Reinforcement learning (RL) concerns the problem of a learning agent interacting with its environment to achieve a goal. Instead
of being given examples of desired behavior, the learning agent must discover by trial and error how to behave in order to
get the most reward. The environment is a Markov decision process (MDP) with state set,

, and action set,

. The agent and the environment interact in a sequence of discrete steps,
t = 0, 1, 2,... The state and action at one time step,

and

, determine the probability distribution for the state at the next time step,

and, jointly, the distribution for the next reward,
r
t+1 ∈ ℜ. The agent’s objective is to chose each
aint to maximize the subsequent
return:
where the discount rate, 0 ≤ γ ≤ 1, determines the relative weighting of immediate and delayed rewards. In some environments,
the interaction consists of a sequence of episodes, each starting in a given state and ending upon arrival in a terminal state,
terminating the series above. In other cases the interaction is continual, without interruption, and the sum may have an infinite
number of terms (in which case we usually assume γ < 1). Infinite horizon cases with γ = 1 are also possible though less common
(e.g., see Mahadevan, 1996).