What Is Reinforcement Learning?

Reinforcement Learning (RL) is a computational framework for sequential decision making under uncertainty. An agent interacts with an environment, taking actions to maximize the cumulative reward it receives over time. Unlike supervised learning — where a teacher labels every example — the agent must discover good behavior purely through trial, error, and delayed feedback.

Three defining properties separate RL from other learning paradigms:

rt
Reward signal
Learning signal is scalar, delayed, and often sparse. No explicit "correct" action is given.
st
Sequential state
Actions affect future states. Decisions form a chain — current choices constrain future options.
π
Policy
The goal is to learn a policy — a mapping from states to actions — that maximizes long-run reward.
γ
Discounting
Future rewards are discounted by γ ∈ [0,1). A reward now is worth more than the same reward later.

The Agent–Environment Interface

At each discrete time step $t = 0, 1, 2, \ldots$, the agent observes state $S_t \in \mathcal{S}$, selects action $A_t \in \mathcal{A}(S_t)$, and the environment responds with a new state $S_{t+1}$ and scalar reward $R_{t+1} \in \mathbb{R}$.

Agent π(a | s) Environment P(s'|s,a), R(s,a) Action Aₜ State Sₜ₊₁ Reward Rₜ₊₁

The Return

The agent's objective is to maximize the expected return $G_t$, the discounted sum of future rewards:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

The discount factor $\gamma \in [0,1)$ serves two purposes: it ensures mathematical convergence when the horizon is infinite, and it encodes the idea that immediate rewards are more valuable than distant ones. When $\gamma = 0$, the agent is myopic, caring only about immediate reward. As $\gamma \to 1$, the agent becomes fully far-sighted.

Note the elegant recursive structure: $G_t = R_{t+1} + \gamma G_{t+1}$. This recursion is the engine behind every RL algorithm.

Markov Decision Processes

Almost all RL problems are formalized as Markov Decision Processes (MDPs). An MDP is a mathematical framework for modeling sequential decision making when outcomes are partly random and partly under the control of a decision maker.

Definition — MDP

An MDP is a 5-tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:

  • $\mathcal{S}$ — set of states
  • $\mathcal{A}$ — set of actions
  • $P(s' \mid s, a)$ — transition probability: probability of landing in state $s'$ after taking action $a$ in state $s$
  • $R(s, a, s')$ — reward function: expected immediate reward
  • $\gamma \in [0,1)$ — discount factor

The Markov Property

The key assumption is the Markov property: the future depends on the present only through the current state, not the history of how we got there.

$$P(S_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = P(S_{t+1} \mid S_t, A_t)$$

In other words: $S_t$ is a sufficient statistic for all past experience. The current state captures everything relevant about the past.

Policies

A policy $\pi$ specifies how the agent chooses actions. It can be:

π(a|s)
Stochastic policy
A distribution over actions given state. More general; enables exploration and mixed strategies.
π(s) = a
Deterministic policy
Maps each state to a single action. Often sufficient for fully observable MDPs.

A policy $\pi$ is a probability distribution over actions given states, satisfying $\sum_{a \in \mathcal{A}} \pi(a \mid s) = 1$ for all $s \in \mathcal{S}$. When we say an agent "follows policy $\pi$," we mean it draws actions $A_t \sim \pi(\cdot \mid S_t)$ at each step.

Finite vs. Infinite Horizon

SettingDefinitionReturn
EpisodicEpisodes end at terminal states$G_t = \sum_{k=0}^{T-t-1} R_{t+k+1}$
Continuing (discounted)No terminal state, $\gamma < 1$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$
Average rewardNo discount, ergodic chain$r(\pi) = \lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T} R_t$

Value Functions

Value functions answer the question: how good is it to be in a given state (or to take a given action from a given state)? They quantify the expected long-term reward under a particular policy, providing the key signal for learning and planning.

State-Value Function $V^\pi$

The state-value function under policy $\pi$ is the expected return when starting in state $s$ and following policy $\pi$ thereafter:

$$V^\pi(s) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s\right]$$ (state-value)

Action-Value Function $Q^\pi$

The action-value function (or Q-function) is the expected return when taking action $a$ in state $s$, then following policy $\pi$:

$$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s, A_t = a\right] = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s, A_t = a\right]$$ (action-value)

Relationship Between $V$ and $Q$

The two value functions are related by a marginalization over the policy:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, Q^\pi(s, a)$$

And conversely, $Q$ can be expressed in terms of $V$:

$$Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma V^\pi(s')\right]$$

Optimal Value Functions

A policy $\pi^*$ is optimal if $V^{\pi^*}(s) \geq V^\pi(s)$ for all $s \in \mathcal{S}$ and all policies $\pi$. The optimal value functions are:

$$V^*(s) = \max_\pi V^\pi(s), \qquad Q^*(s,a) = \max_\pi Q^\pi(s,a)$$

Crucially, for finite MDPs, an optimal policy always exists and can be found by acting greedily with respect to $Q^*$:

$$\pi^*(a \mid s) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} Q^*(s, a') \\ 0 & \text{otherwise} \end{cases}$$

Advantage Function

The advantage function measures how much better action $a$ is compared to the average action under the current policy:

$$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$$

$A^\pi(s,a) > 0$ means action $a$ is better than average; $A^\pi(s,a) < 0$ means worse than average. Advantage is central to policy gradient methods (PPO, A3C) because it reduces variance in gradient estimates.

Bellman Equations

The Bellman equations are the fundamental recursive decompositions of value functions. They state that the value of a state equals the immediate reward plus the discounted value of the next state. Every practical RL algorithm exploits this structure.

Bellman Expectation Equation

Expanding $V^\pi$ using the definition of return:

$$\begin{aligned} V^\pi(s) &= \mathbb{E}_\pi\!\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ &= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right] \end{aligned}$$
(Bellman expectation for $V^\pi$)
$$Q^\pi(s,a) = \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s')\, Q^\pi(s',a')\right]$$ (Bellman expectation for $Q^\pi$)

Bellman Optimality Equations

For the optimal value functions, we take the maximum over actions instead of the expectation under $\pi$:

$$V^*(s) = \max_{a} \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V^*(s')\right]$$ (Bellman optimality for $V^*$)
$$Q^*(s,a) = \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\right]$$ (Bellman optimality for $Q^*$)
Key Insight The Bellman optimality equation is a system of nonlinear equations — one per state. For finite MDPs, there is a unique solution $V^*$, but solving it directly requires knowing the model $P$ and $R$, and scales as $O(|\mathcal{S}|^3)$ via matrix inversion. This motivates approximate and model-free methods.

Bellman Operator

Define the Bellman operator $\mathcal{T}^\pi$ mapping functions $V:\mathcal{S}\to\mathbb{R}$ to functions of the same type:

$$(\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V(s')\right]$$

$V^\pi$ is the unique fixed point of $\mathcal{T}^\pi$: $\mathcal{T}^\pi V^\pi = V^\pi$. The operator is a $\gamma$-contraction in the $\ell_\infty$ norm:

$$\|\mathcal{T}^\pi V - \mathcal{T}^\pi V'\|_\infty \leq \gamma \|V - V'\|_\infty$$

By the Banach fixed-point theorem, repeated application of $\mathcal{T}^\pi$ converges geometrically to $V^\pi$, regardless of initialization. This is the theoretical guarantee behind iterative algorithms like policy evaluation.

Dynamic Programming

Dynamic Programming (DP) refers to a collection of algorithms that compute optimal policies given a perfect model of the MDP. They exploit the Bellman equations iteratively. DP is foundational — while impractical at scale (it requires full model knowledge and is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ per sweep), it defines the target that all approximate RL methods pursue.

Policy Evaluation (Prediction)

Given a policy $\pi$, compute $V^\pi$ by iterating the Bellman expectation operator until convergence:

Algorithm: Iterative Policy Evaluation Input: policy π, threshold θ > 0 Initialize: V(s) ← 0 for all s ∈ 𝒮 Repeat: Δ ← 0 For each s ∈ 𝒮: vV(s) V(s) ← Σa π(a|s) Σs' P(s'|s,a) [R(s,a,s') + γV(s')] Δ ← max(Δ, |vV(s)|) Until Δ < θ Return V ≈ Vπ

Policy Improvement

Given $V^\pi$, we can construct a policy $\pi'$ that is at least as good by acting greedily:

$$\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right]$$

The Policy Improvement Theorem guarantees $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$. If no improvement is possible, $\pi = \pi'$, and by the Bellman optimality equation, $\pi = \pi^*$.

Policy Iteration

Alternating between policy evaluation and improvement converges to the optimal policy in a finite number of steps (since there are finitely many policies):

Algorithm: Policy Iteration Initialize: π arbitrarily (e.g., random) Repeat: // Policy Evaluation V ← PolicyEvaluation(π) // Policy Improvement policy_stable ← true For each s ∈ 𝒮: old_action ← π(s) π(s) ← argmaxa Σs' P(s'|s,a)[R(s,a,s') + γV(s')] If old_action ≠ π(s): policy_stable ← false Until policy_stable Return π = π*, V = V*

Value Iteration

Policy iteration requires a full policy evaluation sweep before each improvement. Value Iteration combines the two into a single update per state by using the Bellman optimality operator:

$$V_{k+1}(s) = \max_{a} \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V_k(s')\right]$$
Algorithm: Value Iteration Initialize: V(s) ← 0 for all s; threshold θ > 0 Repeat: Δ ← 0 For each s ∈ 𝒮: vV(s) V(s) ← maxa Σs' P(s'|s,a)[R + γV(s')] Δ ← max(Δ, |vV(s)|) Until Δ < θ Output: π*(s) = argmaxa Σs' P(s'|s,a)[R + γV(s')]
Interactive — Value Iteration on a Gridworld
Goal (+1)
Pit (−1)
Wall
Low value
High value
Iteration 0
Max Δ
Status Not started

Monte Carlo Methods

Monte Carlo (MC) methods learn from complete episodes of experience without requiring a model of the environment. Instead of computing expected values analytically, MC estimates them by averaging actual returns sampled from interactions.

Key Assumption Monte Carlo requires episodic tasks — the environment must eventually terminate so the full return $G_t$ can be computed. MC cannot be applied directly to continuing tasks.

First-Visit vs. Every-Visit MC

For estimating $V^\pi(s)$, we average returns following visits to $s$. Two variants:

MethodWhich visit to count?Property
First-visit MCOnly the first visit to $s$ in each episodeUnbiased; samples are i.i.d.
Every-visit MCEvery visit to $s$ in each episodeBiased but consistent; more data efficient

Both converge to $V^\pi(s)$ as the number of visits $\to \infty$ (by the law of large numbers).

Algorithm: First-Visit MC Prediction Initialize: V(s) ← 0, Returns(s) ← [] for all s For each episode: Generate episode: S₀,A₀,R₁, S₁,A₁,R₂, … , ST-1,AT-1,RT G ← 0 For t = T−1, T−2, … , 0: G ← γG + Rt+1 If St does not appear in S₀,…,St-1: Append G to Returns(St) V(St) ← mean(Returns(St))

MC Control: Exploring Starts

For control (finding the optimal policy), we estimate $Q^\pi(s,a)$ and improve the policy. The exploring starts assumption ensures all state-action pairs are visited:

Algorithm: MC Control with Exploring Starts Initialize: π(s) arbitrary; Q(s,a) ← 0; Returns(s,a) ← [] For each episode: Choose S₀, A₀ randomly (exploring starts) Generate episode using π from S₀,A₀ G ← 0; visited ← {} For t = T−1, …, 0: G ← γG + Rt+1 If (St,At) ∉ visited: Returns(St,At).append(G) Q(St,At) ← mean(Returns(St,At)) π(St) ← argmaxa Q(St, a)

Incremental Update Rule

Instead of storing all returns, we can update incrementally using a running mean:

$$V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)}\left[G_t - V(S_t)\right]$$

In non-stationary environments, we use a constant step-size $\alpha$ to weight recent returns more heavily:

$$V(S_t) \leftarrow V(S_t) + \alpha\left[G_t - V(S_t)\right]$$

This step-size form is the prototype for all RL update rules. The quantity $[G_t - V(S_t)]$ is the prediction error — how wrong our current estimate was.

Bias–Variance Trade-off

MC uses the true (sampled) return $G_t$, making it unbiased. However, $G_t = \sum_{k} \gamma^k R_{t+k+1}$ is a sum of many random variables, giving it high variance. This is the fundamental tension in RL estimation that TD methods partially resolve.

Temporal Difference Learning

Temporal Difference (TD) learning is arguably the most important idea in RL. It combines the sampling advantage of MC (no model required) with the bootstrapping of DP (updates from estimates, not full returns). TD methods can learn from incomplete episodes, online, at every step.

TD(0): One-Step TD Prediction

Instead of waiting for the full return $G_t$, TD(0) bootstraps using the immediate reward and the current estimate of the next state's value:

$$\underbrace{V(S_t)}_{\text{old estimate}} \leftarrow V(S_t) + \alpha \underbrace{\left[\underbrace{R_{t+1} + \gamma V(S_{t+1})}_{\text{TD target}} - V(S_t)\right]}_{\text{TD error } \delta_t}$$ (TD(0) update)

The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is a one-step prediction error. It is zero when $V$ is self-consistent — i.e., when $V = V^\pi$.

δt
TD error
$R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$. The signal driving all TD learning.
R+γV
TD target
Bootstrap estimate of the true return; one step of Bellman backed up.
α
Learning rate
Controls how quickly estimates are updated. Standard: α ∈ (0,1).

SARSA: On-Policy TD Control

SARSA (State, Action, Reward, State', Action') updates Q-values using transitions $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ where $A_{t+1}$ is selected by the same policy being learned:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\!\left[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\right]$$ (SARSA)

Because SARSA uses the action actually taken by the policy ($A_{t+1} \sim \pi$), it is on-policy: it learns about the policy it is currently executing, including its exploratory actions. With an $\varepsilon$-greedy policy and decaying $\varepsilon$, SARSA converges to the optimal policy.

Q-Learning: Off-Policy TD Control

Q-Learning decouples the behavior policy (used to generate data) from the target policy (being learned) by maximizing over next-state actions:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\!\left[R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)\right]$$ (Q-Learning, Watkins 1989)

By using $\max_{a'} Q(S_{t+1}, a')$ as the target, Q-Learning directly approximates $Q^*$, independent of the behavior policy. This is the key advantage: data collected by any policy (including a random one) can be used to learn the optimal Q-function.

Algorithm: Q-Learning (Tabular) Initialize: Q(s,a) ← 0 for all s,a; Q(terminal,·) ← 0 For each episode: Initialize S Repeat (for each step): Choose A from S using ε-greedy(Q) Take action A; observe R, S' Q(S,A) ← Q(S,A) + α[R + γ maxa Q(S',a) − Q(S,A)] SS' Until S is terminal

Expected SARSA

A middle ground that uses the expected value over next actions rather than a sample:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\!\left[R_{t+1} + \gamma \sum_{a'} \pi(a'|S_{t+1}) Q(S_{t+1}, a') - Q(S_t, A_t)\right]$$

SARSA vs. Q-Learning: The Cliff Example

AlgorithmPolicy typeLearnsCliff behavior
SARSAOn-policy$Q^{\pi_\varepsilon}$ (ε-greedy optimal)Safe path (avoids cliff due to ε-greedy)
Q-LearningOff-policy$Q^*$ (true optimal)Optimal path (cliff edge, risky)
Expected SARSABoth$Q^*$ or $Q^\pi$Optimal; lower variance than SARSA

TD(λ): Eligibility Traces

TD(λ) unifies MC and TD by looking $n$ steps ahead and blending all $n$-step returns:

$$G_t^{(\lambda)} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}$$

where $G_t^{(n)} = \sum_{k=0}^{n-1}\gamma^k R_{t+k+1} + \gamma^n V(S_{t+n})$ is the $n$-step return. This is implemented efficiently with eligibility traces $z_t(s)$:

$$z_t(s) = \gamma\lambda\, z_{t-1}(s) + \mathbf{1}[S_t = s], \qquad V(s) \leftarrow V(s) + \alpha\,\delta_t\, z_t(s)$$

When $\lambda = 0$: recovers TD(0). When $\lambda = 1$: recovers MC (for episodic tasks). Typical values $\lambda \approx 0.9$ often work best in practice.

Interactive — Q-Learning Agent
Start
Goal (+10)
Pit (−5)
Wall
Q arrows = best action
Episodes 0
Avg reward
Steps (last)
Optimal?

Policy Gradient Methods

Value-based methods like Q-Learning find the optimal policy indirectly by learning a value function and then acting greedily. Policy gradient methods optimize the policy directly, parameterizing it as $\pi_\theta(a|s)$ and following the gradient of expected return with respect to $\theta$.

The Policy Gradient Objective

Define the performance objective $J(\theta)$ as the expected return under policy $\pi_\theta$:

$$J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[G_0\right] = \sum_{s} d^{\pi_\theta}(s) \sum_a \pi_\theta(a|s)\, Q^{\pi_\theta}(s,a)$$

where $d^{\pi_\theta}(s)$ is the stationary state distribution under $\pi_\theta$ (discounted or undiscounted, depending on the setting).

The Policy Gradient Theorem

Computing $\nabla_\theta J(\theta)$ is non-trivial because $d^{\pi_\theta}$ also depends on $\theta$. The Policy Gradient Theorem (Sutton et al., 1999) provides a clean expression:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(A_t | S_t) \cdot Q^{\pi_\theta}(S_t, A_t)\right]$$
Intuition The policy gradient increases the log-probability of actions proportional to their Q-value. Good actions (high $Q$) are made more likely; bad actions (low $Q$) are suppressed. Crucially, the derivative of $d^{\pi_\theta}$ does not appear in the formula — it cancels out.

The Log-Derivative Trick (Score Function Estimator)

The key identity enabling unbiased gradient estimation:

$$\nabla_\theta \log \pi_\theta(a|s) = \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)}$$

This allows us to write $\mathbb{E}_{\pi_\theta}[f(a)\nabla_\theta \log \pi_\theta(a|s)]$ as a sample expectation, enabling Monte Carlo estimation from trajectories.

REINFORCE Algorithm

The simplest policy gradient method, using Monte Carlo returns as the baseline-free estimate of $Q^{\pi_\theta}$:

$$\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t|S_t) \cdot G_t$$
Algorithm: REINFORCE Initialize: θ (policy parameters) For each episode: Generate trajectory τ = (S₀,A₀,R₁,…,ST) following πθ For t = 0, 1, …, T−1: Gt ← Σk=tT-1 γk-tRk+1 // actual return from t θ ← θ + α γt Gtθ log πθ(At|St)

Variance Reduction: Baselines

REINFORCE has high variance because $G_t$ varies widely across episodes. We can subtract any baseline $b(s)$ that does not depend on $a$ without introducing bias:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(A_t|S_t)\cdot\left(G_t - b(S_t)\right)\right]$$ (valid because $\mathbb{E}[\nabla \log \pi \cdot b] = 0$)

The optimal baseline that minimizes variance is approximately $b^*(s) \approx V^{\pi_\theta}(s)$, giving the advantage $A^{\pi_\theta}(S_t, A_t) = G_t - V^{\pi_\theta}(S_t)$ as the gradient weight. This is the key insight behind Actor-Critic methods.

The Softmax (Gibbs) Policy

For discrete actions, a common parameterization with preference function $h_\theta(s,a) \in \mathbb{R}$:

$$\pi_\theta(a|s) = \frac{e^{h_\theta(s,a)}}{\sum_{a'} e^{h_\theta(s,a')}}, \qquad \nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta h_\theta(s,a) - \sum_{a'}\pi_\theta(a'|s)\nabla_\theta h_\theta(s,a')$$

Gaussian Policy for Continuous Actions

For continuous action spaces $\mathcal{A} \subseteq \mathbb{R}^d$:

$$\pi_\theta(a|s) = \mathcal{N}\!\left(\mu_\theta(s),\,\sigma_\theta^2(s)\right), \qquad \nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma_\theta^2(s)}\nabla_\theta \mu_\theta(s)$$

Actor-Critic Methods

Actor-Critic methods maintain two function approximators: the actor $\pi_\theta(a|s)$ (the policy) and the critic $V_w(s)$ (the value function). The critic evaluates the current policy, and the actor uses the critic's feedback to update the policy via policy gradients.

πθ
Actor
The policy. Updated by policy gradient using the advantage from the critic.
Vw
Critic
Value function estimate. Updated by TD learning. Provides low-variance gradient signal.
Ât
Advantage estimate
$\delta_t = R_{t+1}+\gamma V_w(S_{t+1})-V_w(S_t)$. Used to weight the policy gradient.

The One-Step Actor-Critic Update

$$\begin{aligned} &\delta_t = R_{t+1} + \gamma V_w(S_{t+1}) - V_w(S_t) \qquad \text{(TD error)}\\ &w \leftarrow w + \alpha_w\, \delta_t\, \nabla_w V_w(S_t) \qquad \text{(critic update)}\\ &\theta \leftarrow \theta + \alpha_\theta\, \delta_t\, \nabla_\theta \log \pi_\theta(A_t|S_t) \qquad \text{(actor update)} \end{aligned}$$ (one-step A-C)
Algorithm: One-Step Actor-Critic Initialize: θ (actor), w (critic), αθ, αw For each episode: Initialize S I ← 1 Repeat (for each step): A ~ πθ(·|S) Take A; observe R, S' δ ← R + γ Vw(S') − Vw(S) // 0 if S' terminal w ← w + αw δ ∇w Vw(S) θ ← θ + αθ I δ ∇θ log πθ(A|S) I ← γI; SS' Until S is terminal

Generalized Advantage Estimation (GAE)

GAE (Schulman et al., 2015) provides a smooth bias-variance trade-off for the advantage estimate using an exponentially weighted average of k-step TD errors:

$$\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \qquad \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$$

When $\lambda = 0$: one-step TD advantage. When $\lambda = 1$: Monte Carlo advantage. GAE with $\lambda \approx 0.95$ is the standard in modern deep RL (PPO, A3C, TRPO).

A3C: Asynchronous Advantage Actor-Critic

Mnih et al. (2016) proposed running multiple actor-critic agents in parallel, each interacting with its own copy of the environment. Asynchronous gradient updates decorrelate experience without a replay buffer:

$$\mathcal{L}(\theta) = -\mathbb{E}\!\left[\sum_t \log \pi_\theta(A_t|S_t)\hat{A}_t - \beta H[\pi_\theta(\cdot|S_t)]\right]$$

The entropy bonus $H[\pi_\theta] = -\sum_a \pi_\theta(a|s)\log\pi_\theta(a|s)$ encourages exploration by penalizing overly deterministic policies.

Deep Reinforcement Learning

Deep RL combines neural network function approximation with RL algorithms. The state space is no longer tabulated — instead, a neural network $f_\theta : \mathcal{S} \to \mathbb{R}^{|\mathcal{A}|}$ approximates $Q^*$ or $\pi^*$. This enables RL to scale to high-dimensional inputs like images, text, or continuous vectors.

DQN: Deep Q-Network

Mnih et al. (2013/2015) showed that Q-Learning with neural networks is unstable due to correlated updates and a non-stationary target. Two key stabilization tricks:

𝒟
Experience Replay
Store transitions $(s,a,r,s')$ in a replay buffer $\mathcal{D}$. Sample random minibatches to break temporal correlations.
θ⁻
Target Network
A periodically frozen copy of the Q-network used to compute targets, reducing oscillation.

The DQN loss function minimizes the mean-squared Bellman error:

$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\left(r + \gamma \max_{a'} Q(s', a';\, \theta^-) - Q(s, a;\, \theta)\right)^2\right]$$ (DQN loss)

Improvements to DQN

VariantKey ideaMotivation
Double DQNSeparate networks for action selection vs. evaluationReduces overestimation bias in Q-targets
Dueling DQN$Q(s,a;\theta) = V(s;\theta) + A(s,a;\theta) - \bar{A}$More efficient learning of $V$ when actions don't matter
PERSample transitions with prob ∝ $|\delta_t|^\alpha$More frequent updates on surprising transitions
RainbowAll of the above + n-step + distributionalState-of-the-art sample efficiency

PPO: Proximal Policy Optimization

PPO (Schulman et al., 2017) is the dominant policy gradient algorithm for practical deep RL. It addresses the challenge that large policy gradient steps can catastrophically change the policy. The clipped objective constrains how much $\pi_\theta$ can change from the old policy $\pi_{\theta_\text{old}}$:

$$L^{CLIP}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]$$ (PPO-Clip)

where $r_t(\theta) = \dfrac{\pi_\theta(A_t|S_t)}{\pi_{\theta_\text{old}}(A_t|S_t)}$ is the probability ratio. The clipping prevents the ratio from going too far from 1, ensuring stable updates without the complexity of a trust region constraint.

SAC: Soft Actor-Critic

SAC (Haarnoja et al., 2018) augments the reward with an entropy term to encourage maximum-entropy policies:

$$J(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}\!\left[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))\right]$$

The temperature parameter $\alpha$ controls the exploration-exploitation trade-off. SAC is off-policy (uses a replay buffer), sample-efficient, and robust to hyperparameters, making it the standard for continuous control tasks.

Landscape of RL Algorithms

AlgorithmTypeAction spaceKey property
DQNValue-based, off-policyDiscreteImage inputs, Atari
DDPGActor-Critic, off-policyContinuousDeterministic policy gradient
TD3Actor-Critic, off-policyContinuousClipped double Q, delayed actor
SACActor-Critic, off-policyContinuousEntropy regularization
PPOPolicy gradient, on-policyBothStable, scalable, default choice
TRPOPolicy gradient, on-policyBothTrust region, monotonic improvement
AlphaZeroModel-based + MCTSDiscreteSelf-play, planning

Exploration vs. Exploitation

The exploration-exploitation dilemma is fundamental to RL: should the agent exploit its current knowledge to maximize immediate reward, or explore to gather information that might lead to better long-term returns? This tension appears in every RL setting.

ε-Greedy Policy

The simplest exploration strategy: with probability $\varepsilon$ take a random action, otherwise take the greedy action.

$$\pi_\varepsilon(a|s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & a = \arg\max_{a'} Q(s,a') \\[6pt] \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases}$$

The Multi-Armed Bandit

The bandit problem is the simplest RL setting: $|\mathcal{S}| = 1$, $|\mathcal{A}| = k$ arms, each with unknown reward distribution $R_i \sim P_i$. Goal: maximize total reward over $T$ steps. Define regret as the gap between the optimal strategy and the chosen one:

$$\text{Regret}(T) = T \cdot Q^*(a^*) - \sum_{t=1}^{T} \mathbb{E}[R_t] = \sum_{a} \Delta_a\, \mathbb{E}[N_a(T)]$$

where $\Delta_a = Q^*(a^*) - Q^*(a)$ is the gap for suboptimal arm $a$, and $N_a(T)$ is the number of times arm $a$ was pulled.

UCB: Upper Confidence Bound

UCB strikes the exploration-exploitation balance using optimism: select the arm that has the highest upper confidence bound on its true mean reward:

$$A_t = \arg\max_a \left[\hat{Q}(a) + c\sqrt{\frac{\ln t}{N_a(t)}}\right]$$

The first term is the estimated reward; the second is the uncertainty bonus. Rarely-sampled arms get large bonuses, ensuring systematic exploration. UCB1 achieves regret $O(\log T)$, which is asymptotically optimal (Lai-Robbins lower bound).

Thompson Sampling

A Bayesian exploration strategy: maintain a posterior distribution over arm rewards $P_a | \text{data}$, sample a reward estimate from each posterior, then act greedily on the samples:

Algorithm: Thompson Sampling (Beta-Bernoulli) Initialize: αa = 1, βa = 1 for each arm a // Uniform prior For t = 1, 2, …, T: For each arm a: sample θ̃a ~ Beta(αa, βa) Pull arm At = argmaxa θ̃a Observe reward Rt ∈ {0, 1} If Rt = 1: αAt ← αAt + 1 Else: βAt ← βAt + 1
Interactive — Multi-Armed Bandit

5-armed bandit with unknown true means. Compare ε-greedy, UCB, and Thompson Sampling.

Steps 0
ε-greedy regret
UCB regret
Thompson regret

Intrinsic Motivation & Curiosity

In environments with sparse rewards, pure ε-greedy exploration is insufficient. Modern approaches augment the reward with an intrinsic bonus that rewards visiting novel or uncertain states:

$$\tilde{r}_t = r_t + \beta \cdot \underbrace{b(s_t)}_{\text{intrinsic bonus}}$$

Common intrinsic bonus designs:

Summary: Key RL Concepts

ConceptSymbolDescription
State$s \in \mathcal{S}$Current situation of the environment
Action$a \in \mathcal{A}$Choice made by the agent
Reward$r_t \in \mathbb{R}$Immediate feedback signal
Return$G_t = \sum \gamma^k r_{t+k+1}$Cumulative discounted reward
Policy$\pi(a|s)$Behavior of the agent
State-value$V^\pi(s)$Expected return from state $s$ under $\pi$
Action-value$Q^\pi(s,a)$Expected return from $(s,a)$ under $\pi$
Advantage$A^\pi(s,a) = Q - V$Relative value of action $a$ vs. average
TD error$\delta_t = r + \gamma V(s') - V(s)$One-step prediction error
Bellman operator$\mathcal{T}^\pi V$One-step backup; fixed point = $V^\pi$