Reinforcement Learning — A Complete Guide

Section 01

What Is Reinforcement Learning?

Reinforcement Learning (RL) is a computational framework for sequential decision making under uncertainty. An agent interacts with an environment, taking actions to maximize the cumulative reward it receives over time. Unlike supervised learning — where a teacher labels every example — the agent must discover good behavior purely through trial, error, and delayed feedback.

Three defining properties separate RL from other learning paradigms:

r_t

Reward signal

Learning signal is scalar, delayed, and often sparse. No explicit "correct" action is given.

s_t

Sequential state

Actions affect future states. Decisions form a chain — current choices constrain future options.

Policy

The goal is to learn a policy — a mapping from states to actions — that maximizes long-run reward.

Discounting

Future rewards are discounted by γ ∈ [0,1). A reward now is worth more than the same reward later.

The Agent–Environment Interface

At each discrete time step $t = 0, 1, 2, \ldots$, the agent observes state $S_t \in \mathcal{S}$, selects action $A_t \in \mathcal{A}(S_t)$, and the environment responds with a new state $S_{t+1}$ and scalar reward $R_{t+1} \in \mathbb{R}$.

The Return

The agent's objective is to maximize the expected return $G_t$, the discounted sum of future rewards:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

The discount factor $\gamma \in [0,1)$ serves two purposes: it ensures mathematical convergence when the horizon is infinite, and it encodes the idea that immediate rewards are more valuable than distant ones. When $\gamma = 0$, the agent is myopic, caring only about immediate reward. As $\gamma \to 1$, the agent becomes fully far-sighted.

Note the elegant recursive structure: $G_t = R_{t+1} + \gamma G_{t+1}$. This recursion is the engine behind every RL algorithm.

Section 02

Markov Decision Processes

Almost all RL problems are formalized as Markov Decision Processes (MDPs). An MDP is a mathematical framework for modeling sequential decision making when outcomes are partly random and partly under the control of a decision maker.

Definition — MDP

An MDP is a 5-tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:

$\mathcal{S}$ — set of states
$\mathcal{A}$ — set of actions
$P(s' \mid s, a)$ — transition probability: probability of landing in state $s'$ after taking action $a$ in state $s$
$R(s, a, s')$ — reward function: expected immediate reward
$\gamma \in [0,1)$ — discount factor

The Markov Property

The key assumption is the Markov property: the future depends on the present only through the current state, not the history of how we got there.

P(S_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = P(S_{t+1} \mid S_t, A_t)

In other words: $S_t$ is a sufficient statistic for all past experience. The current state captures everything relevant about the past.

Policies

A policy $\pi$ specifies how the agent chooses actions. It can be:

π(a|s)

Stochastic policy

A distribution over actions given state. More general; enables exploration and mixed strategies.

π(s) = a

Deterministic policy

Maps each state to a single action. Often sufficient for fully observable MDPs.

A policy $\pi$ is a probability distribution over actions given states, satisfying $\sum_{a \in \mathcal{A}} \pi(a \mid s) = 1$ for all $s \in \mathcal{S}$. When we say an agent "follows policy $\pi$," we mean it draws actions $A_t \sim \pi(\cdot \mid S_t)$ at each step.

Finite vs. Infinite Horizon

Setting	Definition	Return
Episodic	Episodes end at terminal states	$G_t = \sum_{k=0}^{T-t-1} R_{t+k+1}$
Continuing (discounted)	No terminal state, $\gamma < 1$	$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$
Average reward	No discount, ergodic chain	$r(\pi) = \lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T} R_t$

Section 03

Value Functions

Value functions answer the question: how good is it to be in a given state (or to take a given action from a given state)? They quantify the expected long-term reward under a particular policy, providing the key signal for learning and planning.

State-Value Function $V^\pi$

The state-value function under policy $\pi$ is the expected return when starting in state $s$ and following policy $\pi$ thereafter:

$$V^\pi(s) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s\right] = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s\right]$$ (state-value)

Action-Value Function $Q^\pi$

The action-value function (or Q-function) is the expected return when taking action $a$ in state $s$, then following policy $\pi$:

$$Q^\pi(s, a) = \mathbb{E}_\pi\!\left[G_t \mid S_t = s, A_t = a\right] = \mathbb{E}_\pi\!\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; S_t = s, A_t = a\right]$$ (action-value)

Relationship Between $V$ and $Q$

The two value functions are related by a marginalization over the policy:

V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s)\, Q^\pi(s, a)

And conversely, $Q$ can be expressed in terms of $V$:

Q^\pi(s, a) = \sum_{s'} P(s' \mid s, a)\left[R(s,a,s') + \gamma V^\pi(s')\right]

Optimal Value Functions

A policy $\pi^*$ is optimal if $V^{\pi^*}(s) \geq V^\pi(s)$ for all $s \in \mathcal{S}$ and all policies $\pi$. The optimal value functions are:

V^*(s) = \max_\pi V^\pi(s), \qquad Q^*(s,a) = \max_\pi Q^\pi(s,a)

Crucially, for finite MDPs, an optimal policy always exists and can be found by acting greedily with respect to $Q^*$:

\pi^*(a \mid s) = \begin{cases} 1 & \text{if } a = \arg\max_{a'} Q^*(s, a') \\ 0 & \text{otherwise} \end{cases}

Advantage Function

The advantage function measures how much better action $a$ is compared to the average action under the current policy:

A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

$A^\pi(s,a) > 0$ means action $a$ is better than average; $A^\pi(s,a) < 0$ means worse than average. Advantage is central to policy gradient methods (PPO, A3C) because it reduces variance in gradient estimates.

Section 04

Bellman Equations

The Bellman equations are the fundamental recursive decompositions of value functions. They state that the value of a state equals the immediate reward plus the discounted value of the next state. Every practical RL algorithm exploits this structure.

Bellman Expectation Equation

Expanding $V^\pi$ using the definition of return:

$$\begin{aligned} V^\pi(s) &= \mathbb{E}_\pi\!\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ &= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right] \end{aligned}$$ (Bellman expectation for $V^\pi$)

$$Q^\pi(s,a) = \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s')\, Q^\pi(s',a')\right]$$ (Bellman expectation for $Q^\pi$)

Bellman Optimality Equations

For the optimal value functions, we take the maximum over actions instead of the expectation under $\pi$:

$$V^*(s) = \max_{a} \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V^*(s')\right]$$ (Bellman optimality for $V^*$)

$$Q^*(s,a) = \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\right]$$ (Bellman optimality for $Q^*$)

Key Insight The Bellman optimality equation is a system of nonlinear equations — one per state. For finite MDPs, there is a unique solution $V^*$, but solving it directly requires knowing the model $P$ and $R$, and scales as $O(|\mathcal{S}|^3)$ via matrix inversion. This motivates approximate and model-free methods.

Bellman Operator

Define the Bellman operator $\mathcal{T}^\pi$ mapping functions $V:\mathcal{S}\to\mathbb{R}$ to functions of the same type:

(\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V(s')\right]

$V^\pi$ is the unique fixed point of $\mathcal{T}^\pi$: $\mathcal{T}^\pi V^\pi = V^\pi$. The operator is a $\gamma$-contraction in the $\ell_\infty$ norm:

\|\mathcal{T}^\pi V - \mathcal{T}^\pi V'\|_\infty \leq \gamma \|V - V'\|_\infty

By the Banach fixed-point theorem, repeated application of $\mathcal{T}^\pi$ converges geometrically to $V^\pi$, regardless of initialization. This is the theoretical guarantee behind iterative algorithms like policy evaluation.

Section 05

Dynamic Programming

Dynamic Programming (DP) refers to a collection of algorithms that compute optimal policies given a perfect model of the MDP. They exploit the Bellman equations iteratively. DP is foundational — while impractical at scale (it requires full model knowledge and is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ per sweep), it defines the target that all approximate RL methods pursue.

Policy Evaluation (Prediction)

Given a policy $\pi$, compute $V^\pi$ by iterating the Bellman expectation operator until convergence:

Algorithm: Iterative Policy Evaluation Input: policy π, threshold θ > 0 Initialize: V(s) ← 0 for all s ∈ 𝒮 Repeat: Δ ← 0 For each s ∈ 𝒮: v ← V(s) V(s) ← Σ_a π(a|s) Σ_s' P(s'|s,a) [R(s,a,s') + γV(s')] Δ ← max(Δ, |v − V(s)|) Until Δ < θ Return V ≈ V^π

Policy Improvement

Given $V^\pi$, we can construct a policy $\pi'$ that is at least as good by acting greedily:

\pi'(s) = \arg\max_a \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right]

The Policy Improvement Theorem guarantees $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$. If no improvement is possible, $\pi = \pi'$, and by the Bellman optimality equation, $\pi = \pi^*$.

Policy Iteration

Alternating between policy evaluation and improvement converges to the optimal policy in a finite number of steps (since there are finitely many policies):

Algorithm: Policy Iteration Initialize: π arbitrarily (e.g., random) Repeat: // Policy Evaluation V ← PolicyEvaluation(π) // Policy Improvement policy_stable ← true For each s ∈ 𝒮: old_action ← π(s) π(s) ← argmax_a Σ_s' P(s'|s,a)[R(s,a,s') + γV(s')] If old_action ≠ π(s): policy_stable ← false Until policy_stable Return π = π*, V = V*

Value Iteration

Policy iteration requires a full policy evaluation sweep before each improvement. Value Iteration combines the two into a single update per state by using the Bellman optimality operator:

V_{k+1}(s) = \max_{a} \sum_{s'} P(s'|s,a)\!\left[R(s,a,s') + \gamma V_k(s')\right]

Algorithm: Value Iteration Initialize: V(s) ← 0 for all s; threshold θ > 0 Repeat: Δ ← 0 For each s ∈ 𝒮: v ← V(s) V(s) ← max_a Σ_s' P(s'|s,a)[R + γV(s')] Δ ← max(Δ, |v − V(s)|) Until Δ < θ Output: π*(s) = argmax_a Σ_s' P(s'|s,a)[R + γV(s')]

Interactive — Value Iteration on a Gridworld

Goal (+1)

Pit (−1)

Wall

Low value

High value

γ 0.90

Iteration 0

Max Δ —

Status Not started

Section 06

Monte Carlo Methods

Monte Carlo (MC) methods learn from complete episodes of experience without requiring a model of the environment. Instead of computing expected values analytically, MC estimates them by averaging actual returns sampled from interactions.

Key Assumption Monte Carlo requires episodic tasks — the environment must eventually terminate so the full return $G_t$ can be computed. MC cannot be applied directly to continuing tasks.

First-Visit vs. Every-Visit MC

For estimating $V^\pi(s)$, we average returns following visits to $s$. Two variants:

Method	Which visit to count?	Property
First-visit MC	Only the first visit to $s$ in each episode	Unbiased; samples are i.i.d.
Every-visit MC	Every visit to $s$ in each episode	Biased but consistent; more data efficient

Both converge to $V^\pi(s)$ as the number of visits $\to \infty$ (by the law of large numbers).

Algorithm: First-Visit MC Prediction Initialize: V(s) ← 0, Returns(s) ← [] for all s For each episode: Generate episode: S₀,A₀,R₁, S₁,A₁,R₂, … , S_T-1,A_T-1,R_T G ← 0 For t = T−1, T−2, … , 0: G ← γG + R_t+1 If S_t does not appear in S₀,…,S_t-1: Append G to Returns(S_t) V(S_t) ← mean(Returns(S_t))

MC Control: Exploring Starts

For control (finding the optimal policy), we estimate $Q^\pi(s,a)$ and improve the policy. The exploring starts assumption ensures all state-action pairs are visited:

Algorithm: MC Control with Exploring Starts Initialize: π(s) arbitrary; Q(s,a) ← 0; Returns(s,a) ← [] For each episode: Choose S₀, A₀ randomly (exploring starts) Generate episode using π from S₀,A₀ G ← 0; visited ← {} For t = T−1, …, 0: G ← γG + R_t+1 If (S_t,A_t) ∉ visited: Returns(S_t,A_t).append(G) Q(S_t,A_t) ← mean(Returns(S_t,A_t)) π(S_t) ← argmax_a Q(S_t, a)

Incremental Update Rule

Instead of storing all returns, we can update incrementally using a running mean:

V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)}\left[G_t - V(S_t)\right]

In non-stationary environments, we use a constant step-size $\alpha$ to weight recent returns more heavily:

V(S_t) \leftarrow V(S_t) + \alpha\left[G_t - V(S_t)\right]

This step-size form is the prototype for all RL update rules. The quantity $[G_t - V(S_t)]$ is the prediction error — how wrong our current estimate was.

Bias–Variance Trade-off

MC uses the true (sampled) return $G_t$, making it unbiased. However, $G_t = \sum_{k} \gamma^k R_{t+k+1}$ is a sum of many random variables, giving it high variance. This is the fundamental tension in RL estimation that TD methods partially resolve.

Section 07

Temporal Difference Learning

Temporal Difference (TD) learning is arguably the most important idea in RL. It combines the sampling advantage of MC (no model required) with the bootstrapping of DP (updates from estimates, not full returns). TD methods can learn from incomplete episodes, online, at every step.

TD(0): One-Step TD Prediction

Instead of waiting for the full return $G_t$, TD(0) bootstraps using the immediate reward and the current estimate of the next state's value:

$$\underbrace{V(S_t)}_{\text{old estimate}} \leftarrow V(S_t) + \alpha \underbrace{\left[\underbrace{R_{t+1} + \gamma V(S_{t+1})}_{\text{TD target}} - V(S_t)\right]}_{\text{TD error } \delta_t}$$ (TD(0) update)

The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is a one-step prediction error. It is zero when $V$ is self-consistent — i.e., when $V = V^\pi$.

δ_t

TD error

$R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$. The signal driving all TD learning.

R+γV

TD target

Bootstrap estimate of the true return; one step of Bellman backed up.

Learning rate

Controls how quickly estimates are updated. Standard: α ∈ (0,1).

SARSA: On-Policy TD Control

SARSA (State, Action, Reward, State', Action') updates Q-values using transitions $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ where $A_{t+1}$ is selected by the same policy being learned:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\!\left[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\right]$$ (SARSA)

Because SARSA uses the action actually taken by the policy ($A_{t+1} \sim \pi$), it is on-policy: it learns about the policy it is currently executing, including its exploratory actions. With an $\varepsilon$-greedy policy and decaying $\varepsilon$, SARSA converges to the optimal policy.

Q-Learning: Off-Policy TD Control

Q-Learning decouples the behavior policy (used to generate data) from the target policy (being learned) by maximizing over next-state actions:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\!\left[R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)\right]$$ (Q-Learning, Watkins 1989)

By using $\max_{a'} Q(S_{t+1}, a')$ as the target, Q-Learning directly approximates $Q^*$, independent of the behavior policy. This is the key advantage: data collected by any policy (including a random one) can be used to learn the optimal Q-function.

Algorithm: Q-Learning (Tabular) Initialize: Q(s,a) ← 0 for all s,a; Q(terminal,·) ← 0 For each episode: Initialize S Repeat (for each step): Choose A from S using ε-greedy(Q) Take action A; observe R, S' Q(S,A) ← Q(S,A) + α[R + γ max_a Q(S',a) − Q(S,A)] S ← S' Until S is terminal

Expected SARSA

A middle ground that uses the expected value over next actions rather than a sample:

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\!\left[R_{t+1} + \gamma \sum_{a'} \pi(a'|S_{t+1}) Q(S_{t+1}, a') - Q(S_t, A_t)\right]

SARSA vs. Q-Learning: The Cliff Example

Algorithm	Policy type	Learns	Cliff behavior
SARSA	On-policy	$Q^{\pi_\varepsilon}$ (ε-greedy optimal)	Safe path (avoids cliff due to ε-greedy)
Q-Learning	Off-policy	$Q^*$ (true optimal)	Optimal path (cliff edge, risky)
Expected SARSA	Both	$Q^*$ or $Q^\pi$	Optimal; lower variance than SARSA

TD(λ): Eligibility Traces

TD(λ) unifies MC and TD by looking $n$ steps ahead and blending all $n$-step returns:

G_t^{(\lambda)} = (1-\lambda)\sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}

where $G_t^{(n)} = \sum_{k=0}^{n-1}\gamma^k R_{t+k+1} + \gamma^n V(S_{t+n})$ is the $n$-step return. This is implemented efficiently with eligibility traces $z_t(s)$:

z_t(s) = \gamma\lambda\, z_{t-1}(s) + \mathbf{1}[S_t = s], \qquad V(s) \leftarrow V(s) + \alpha\,\delta_t\, z_t(s)

When $\lambda = 0$: recovers TD(0). When $\lambda = 1$: recovers MC (for episodic tasks). Typical values $\lambda \approx 0.9$ often work best in practice.

Interactive — Q-Learning Agent

Start

Goal (+10)

Pit (−5)

Wall

Q arrows = best action

α 0.20 ε 0.20 γ 0.90

Episodes 0

Avg reward —

Steps (last) —

Optimal? —

Section 08

Policy Gradient Methods

Value-based methods like Q-Learning find the optimal policy indirectly by learning a value function and then acting greedily. Policy gradient methods optimize the policy directly, parameterizing it as $\pi_\theta(a|s)$ and following the gradient of expected return with respect to $\theta$.

The Policy Gradient Objective

Define the performance objective $J(\theta)$ as the expected return under policy $\pi_\theta$:

J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[G_0\right] = \sum_{s} d^{\pi_\theta}(s) \sum_a \pi_\theta(a|s)\, Q^{\pi_\theta}(s,a)

where $d^{\pi_\theta}(s)$ is the stationary state distribution under $\pi_\theta$ (discounted or undiscounted, depending on the setting).

The Policy Gradient Theorem

Computing $\nabla_\theta J(\theta)$ is non-trivial because $d^{\pi_\theta}$ also depends on $\theta$. The Policy Gradient Theorem (Sutton et al., 1999) provides a clean expression:

\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(A_t | S_t) \cdot Q^{\pi_\theta}(S_t, A_t)\right]

Intuition The policy gradient increases the log-probability of actions proportional to their Q-value. Good actions (high $Q$) are made more likely; bad actions (low $Q$) are suppressed. Crucially, the derivative of $d^{\pi_\theta}$ does not appear in the formula — it cancels out.

The Log-Derivative Trick (Score Function Estimator)

The key identity enabling unbiased gradient estimation:

\nabla_\theta \log \pi_\theta(a|s) = \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)}

This allows us to write $\mathbb{E}_{\pi_\theta}[f(a)\nabla_\theta \log \pi_\theta(a|s)]$ as a sample expectation, enabling Monte Carlo estimation from trajectories.

REINFORCE Algorithm

The simplest policy gradient method, using Monte Carlo returns as the baseline-free estimate of $Q^{\pi_\theta}$:

\theta \leftarrow \theta + \alpha \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(A_t|S_t) \cdot G_t

Algorithm: REINFORCE Initialize: θ (policy parameters) For each episode: Generate trajectory τ = (S₀,A₀,R₁,…,S_T) following π_θ For t = 0, 1, …, T−1: G_t ← Σ_k=t^T-1 γ^k-tR_k+1 // actual return from t θ ← θ + α γ^t G_t ∇_θ log π_θ(A_t|S_t)

Variance Reduction: Baselines

REINFORCE has high variance because $G_t$ varies widely across episodes. We can subtract any baseline $b(s)$ that does not depend on $a$ without introducing bias:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\!\left[\nabla_\theta \log \pi_\theta(A_t|S_t)\cdot\left(G_t - b(S_t)\right)\right]$$ (valid because $\mathbb{E}[\nabla \log \pi \cdot b] = 0$)

The optimal baseline that minimizes variance is approximately $b^*(s) \approx V^{\pi_\theta}(s)$, giving the advantage $A^{\pi_\theta}(S_t, A_t) = G_t - V^{\pi_\theta}(S_t)$ as the gradient weight. This is the key insight behind Actor-Critic methods.

The Softmax (Gibbs) Policy

For discrete actions, a common parameterization with preference function $h_\theta(s,a) \in \mathbb{R}$:

\pi_\theta(a|s) = \frac{e^{h_\theta(s,a)}}{\sum_{a'} e^{h_\theta(s,a')}}, \qquad \nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta h_\theta(s,a) - \sum_{a'}\pi_\theta(a'|s)\nabla_\theta h_\theta(s,a')

Gaussian Policy for Continuous Actions

For continuous action spaces $\mathcal{A} \subseteq \mathbb{R}^d$:

\pi_\theta(a|s) = \mathcal{N}\!\left(\mu_\theta(s),\,\sigma_\theta^2(s)\right), \qquad \nabla_\theta \log \pi_\theta(a|s) = \frac{(a - \mu_\theta(s))}{\sigma_\theta^2(s)}\nabla_\theta \mu_\theta(s)

Section 09

Actor-Critic Methods

Actor-Critic methods maintain two function approximators: the actor $\pi_\theta(a|s)$ (the policy) and the critic $V_w(s)$ (the value function). The critic evaluates the current policy, and the actor uses the critic's feedback to update the policy via policy gradients.

π_θ

Actor

The policy. Updated by policy gradient using the advantage from the critic.

V_w

Critic

Value function estimate. Updated by TD learning. Provides low-variance gradient signal.

Â_t

Advantage estimate

$\delta_t = R_{t+1}+\gamma V_w(S_{t+1})-V_w(S_t)$. Used to weight the policy gradient.

The One-Step Actor-Critic Update

$$\begin{aligned} &\delta_t = R_{t+1} + \gamma V_w(S_{t+1}) - V_w(S_t) \qquad \text{(TD error)}\\ &w \leftarrow w + \alpha_w\, \delta_t\, \nabla_w V_w(S_t) \qquad \text{(critic update)}\\ &\theta \leftarrow \theta + \alpha_\theta\, \delta_t\, \nabla_\theta \log \pi_\theta(A_t|S_t) \qquad \text{(actor update)} \end{aligned}$$ (one-step A-C)

Algorithm: One-Step Actor-Critic Initialize: θ (actor), w (critic), α_θ, α_w For each episode: Initialize S I ← 1 Repeat (for each step): A ~ π_θ(·|S) Take A; observe R, S' δ ← R + γ V_w(S') − V_w(S) // 0 if S' terminal w ← w + α_w δ ∇_w V_w(S) θ ← θ + α_θ I δ ∇_θ log π_θ(A|S) I ← γI; S ← S' Until S is terminal

Generalized Advantage Estimation (GAE)

GAE (Schulman et al., 2015) provides a smooth bias-variance trade-off for the advantage estimate using an exponentially weighted average of k-step TD errors:

\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \qquad \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

When $\lambda = 0$: one-step TD advantage. When $\lambda = 1$: Monte Carlo advantage. GAE with $\lambda \approx 0.95$ is the standard in modern deep RL (PPO, A3C, TRPO).

A3C: Asynchronous Advantage Actor-Critic

Mnih et al. (2016) proposed running multiple actor-critic agents in parallel, each interacting with its own copy of the environment. Asynchronous gradient updates decorrelate experience without a replay buffer:

\mathcal{L}(\theta) = -\mathbb{E}\!\left[\sum_t \log \pi_\theta(A_t|S_t)\hat{A}_t - \beta H[\pi_\theta(\cdot|S_t)]\right]

The entropy bonus $H[\pi_\theta] = -\sum_a \pi_\theta(a|s)\log\pi_\theta(a|s)$ encourages exploration by penalizing overly deterministic policies.

Section 10

Deep Reinforcement Learning

Deep RL combines neural network function approximation with RL algorithms. The state space is no longer tabulated — instead, a neural network $f_\theta : \mathcal{S} \to \mathbb{R}^{|\mathcal{A}|}$ approximates $Q^*$ or $\pi^*$. This enables RL to scale to high-dimensional inputs like images, text, or continuous vectors.

DQN: Deep Q-Network

Mnih et al. (2013/2015) showed that Q-Learning with neural networks is unstable due to correlated updates and a non-stationary target. Two key stabilization tricks:

𝒟

Experience Replay

Store transitions $(s,a,r,s')$ in a replay buffer $\mathcal{D}$. Sample random minibatches to break temporal correlations.

θ⁻

Target Network

A periodically frozen copy of the Q-network used to compute targets, reducing oscillation.

The DQN loss function minimizes the mean-squared Bellman error:

$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\!\left[\left(r + \gamma \max_{a'} Q(s', a';\, \theta^-) - Q(s, a;\, \theta)\right)^2\right]$$ (DQN loss)

Improvements to DQN

Variant	Key idea	Motivation
Double DQN	Separate networks for action selection vs. evaluation	Reduces overestimation bias in Q-targets
Dueling DQN	$Q(s,a;\theta) = V(s;\theta) + A(s,a;\theta) - \bar{A}$	More efficient learning of $V$ when actions don't matter
PER	Sample transitions with prob ∝ $\|\delta_t\|^\alpha$	More frequent updates on surprising transitions
Rainbow	All of the above + n-step + distributional	State-of-the-art sample efficiency

PPO: Proximal Policy Optimization

PPO (Schulman et al., 2017) is the dominant policy gradient algorithm for practical deep RL. It addresses the challenge that large policy gradient steps can catastrophically change the policy. The clipped objective constrains how much $\pi_\theta$ can change from the old policy $\pi_{\theta_\text{old}}$:

$$L^{CLIP}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]$$ (PPO-Clip)

where $r_t(\theta) = \dfrac{\pi_\theta(A_t|S_t)}{\pi_{\theta_\text{old}}(A_t|S_t)}$ is the probability ratio. The clipping prevents the ratio from going too far from 1, ensuring stable updates without the complexity of a trust region constraint.

SAC: Soft Actor-Critic

SAC (Haarnoja et al., 2018) augments the reward with an entropy term to encourage maximum-entropy policies:

J(\pi) = \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}\!\left[r(s_t,a_t) + \alpha H(\pi(\cdot|s_t))\right]

The temperature parameter $\alpha$ controls the exploration-exploitation trade-off. SAC is off-policy (uses a replay buffer), sample-efficient, and robust to hyperparameters, making it the standard for continuous control tasks.

Landscape of RL Algorithms

Algorithm	Type	Action space	Key property
DQN	Value-based, off-policy	Discrete	Image inputs, Atari
DDPG	Actor-Critic, off-policy	Continuous	Deterministic policy gradient
TD3	Actor-Critic, off-policy	Continuous	Clipped double Q, delayed actor
SAC	Actor-Critic, off-policy	Continuous	Entropy regularization
PPO	Policy gradient, on-policy	Both	Stable, scalable, default choice
TRPO	Policy gradient, on-policy	Both	Trust region, monotonic improvement
AlphaZero	Model-based + MCTS	Discrete	Self-play, planning

Section 11

Exploration vs. Exploitation

The exploration-exploitation dilemma is fundamental to RL: should the agent exploit its current knowledge to maximize immediate reward, or explore to gather information that might lead to better long-term returns? This tension appears in every RL setting.

ε-Greedy Policy

The simplest exploration strategy: with probability $\varepsilon$ take a random action, otherwise take the greedy action.

\pi_\varepsilon(a|s) = \begin{cases} 1 - \varepsilon + \dfrac{\varepsilon}{|\mathcal{A}|} & a = \arg\max_{a'} Q(s,a') \\[6pt] \dfrac{\varepsilon}{|\mathcal{A}|} & \text{otherwise} \end{cases}

The Multi-Armed Bandit

The bandit problem is the simplest RL setting: $|\mathcal{S}| = 1$, $|\mathcal{A}| = k$ arms, each with unknown reward distribution $R_i \sim P_i$. Goal: maximize total reward over $T$ steps. Define regret as the gap between the optimal strategy and the chosen one:

\text{Regret}(T) = T \cdot Q^*(a^*) - \sum_{t=1}^{T} \mathbb{E}[R_t] = \sum_{a} \Delta_a\, \mathbb{E}[N_a(T)]

where $\Delta_a = Q^*(a^*) - Q^*(a)$ is the gap for suboptimal arm $a$, and $N_a(T)$ is the number of times arm $a$ was pulled.

UCB: Upper Confidence Bound

UCB strikes the exploration-exploitation balance using optimism: select the arm that has the highest upper confidence bound on its true mean reward:

A_t = \arg\max_a \left[\hat{Q}(a) + c\sqrt{\frac{\ln t}{N_a(t)}}\right]

The first term is the estimated reward; the second is the uncertainty bonus. Rarely-sampled arms get large bonuses, ensuring systematic exploration. UCB1 achieves regret $O(\log T)$, which is asymptotically optimal (Lai-Robbins lower bound).

Thompson Sampling

A Bayesian exploration strategy: maintain a posterior distribution over arm rewards $P_a | \text{data}$, sample a reward estimate from each posterior, then act greedily on the samples:

Algorithm: Thompson Sampling (Beta-Bernoulli) Initialize: α_a = 1, β_a = 1 for each arm a // Uniform prior For t = 1, 2, …, T: For each arm a: sample θ̃_a ~ Beta(α_a, β_a) Pull arm A_t = argmax_a θ̃_a Observe reward R_t ∈ {0, 1} If R_t = 1: α_{A_t} ← α_{A_t} + 1 Else: β_{A_t} ← β_{A_t} + 1

Interactive — Multi-Armed Bandit

5-armed bandit with unknown true means. Compare ε-greedy, UCB, and Thompson Sampling.

ε 0.10 UCB c 1.50

Steps 0

ε-greedy regret —

UCB regret —

Thompson regret —

Intrinsic Motivation & Curiosity

In environments with sparse rewards, pure ε-greedy exploration is insufficient. Modern approaches augment the reward with an intrinsic bonus that rewards visiting novel or uncertain states:

\tilde{r}_t = r_t + \beta \cdot \underbrace{b(s_t)}_{\text{intrinsic bonus}}

Common intrinsic bonus designs:

Count-based: $b(s) = 1/\sqrt{N(s)}$ — less visited states get larger bonuses
Prediction error: $b(s_t) = \|\hat{s}_{t+1} - s_{t+1}\|^2$ — error of a forward model predicts novelty (ICM, Pathak et al. 2017)
Random Network Distillation (RND): $b(s) = \|f_\theta(s) - f_{\bar\theta}(s)\|^2$ — distance between a trained and fixed random network
Episodic curiosity: reward for visiting states dissimilar to those seen in the current episode

Summary: Key RL Concepts

Concept	Symbol	Description
State	$s \in \mathcal{S}$	Current situation of the environment
Action	$a \in \mathcal{A}$	Choice made by the agent
Reward	$r_t \in \mathbb{R}$	Immediate feedback signal
Return	$G_t = \sum \gamma^k r_{t+k+1}$	Cumulative discounted reward
Policy	$\pi(a\|s)$	Behavior of the agent
State-value	$V^\pi(s)$	Expected return from state $s$ under $\pi$
Action-value	$Q^\pi(s,a)$	Expected return from $(s,a)$ under $\pi$
Advantage	$A^\pi(s,a) = Q - V$	Relative value of action $a$ vs. average
TD error	$\delta_t = r + \gamma V(s') - V(s)$	One-step prediction error
Bellman operator	$\mathcal{T}^\pi V$	One-step backup; fixed point = $V^\pi$