What Is Reinforcement Learning?
Reinforcement Learning (RL) is a computational framework for sequential decision making under uncertainty. An agent interacts with an environment, taking actions to maximize the cumulative reward it receives over time. Unlike supervised learning — where a teacher labels every example — the agent must discover good behavior purely through trial, error, and delayed feedback.
Three defining properties separate RL from other learning paradigms:
The Agent–Environment Interface
At each discrete time step $t = 0, 1, 2, \ldots$, the agent observes state $S_t \in \mathcal{S}$, selects action $A_t \in \mathcal{A}(S_t)$, and the environment responds with a new state $S_{t+1}$ and scalar reward $R_{t+1} \in \mathbb{R}$.
The Return
The agent's objective is to maximize the expected return $G_t$, the discounted sum of future rewards:
The discount factor $\gamma \in [0,1)$ serves two purposes: it ensures mathematical convergence when the horizon is infinite, and it encodes the idea that immediate rewards are more valuable than distant ones. When $\gamma = 0$, the agent is myopic, caring only about immediate reward. As $\gamma \to 1$, the agent becomes fully far-sighted.
Note the elegant recursive structure: $G_t = R_{t+1} + \gamma G_{t+1}$. This recursion is the engine behind every RL algorithm.
Markov Decision Processes
Almost all RL problems are formalized as Markov Decision Processes (MDPs). An MDP is a mathematical framework for modeling sequential decision making when outcomes are partly random and partly under the control of a decision maker.
An MDP is a 5-tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:
- $\mathcal{S}$ — set of states
- $\mathcal{A}$ — set of actions
- $P(s' \mid s, a)$ — transition probability: probability of landing in state $s'$ after taking action $a$ in state $s$
- $R(s, a, s')$ — reward function: expected immediate reward
- $\gamma \in [0,1)$ — discount factor
The Markov Property
The key assumption is the Markov property: the future depends on the present only through the current state, not the history of how we got there.
In other words: $S_t$ is a sufficient statistic for all past experience. The current state captures everything relevant about the past.
Policies
A policy $\pi$ specifies how the agent chooses actions. It can be:
A policy $\pi$ is a probability distribution over actions given states, satisfying $\sum_{a \in \mathcal{A}} \pi(a \mid s) = 1$ for all $s \in \mathcal{S}$. When we say an agent "follows policy $\pi$," we mean it draws actions $A_t \sim \pi(\cdot \mid S_t)$ at each step.
Finite vs. Infinite Horizon
| Setting | Definition | Return |
|---|---|---|
| Episodic | Episodes end at terminal states | $G_t = \sum_{k=0}^{T-t-1} R_{t+k+1}$ |
| Continuing (discounted) | No terminal state, $\gamma < 1$ | $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$ |
| Average reward | No discount, ergodic chain | $r(\pi) = \lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T} R_t$ |
Value Functions
Value functions answer the question: how good is it to be in a given state (or to take a given action from a given state)? They quantify the expected long-term reward under a particular policy, providing the key signal for learning and planning.
State-Value Function $V^\pi$
The state-value function under policy $\pi$ is the expected return when starting in state $s$ and following policy $\pi$ thereafter:
Action-Value Function $Q^\pi$
The action-value function (or Q-function) is the expected return when taking action $a$ in state $s$, then following policy $\pi$:
Relationship Between $V$ and $Q$
The two value functions are related by a marginalization over the policy:
And conversely, $Q$ can be expressed in terms of $V$:
Optimal Value Functions
A policy $\pi^*$ is optimal if $V^{\pi^*}(s) \geq V^\pi(s)$ for all $s \in \mathcal{S}$ and all policies $\pi$. The optimal value functions are:
Crucially, for finite MDPs, an optimal policy always exists and can be found by acting greedily with respect to $Q^*$:
Advantage Function
The advantage function measures how much better action $a$ is compared to the average action under the current policy:
$A^\pi(s,a) > 0$ means action $a$ is better than average; $A^\pi(s,a) < 0$ means worse than average. Advantage is central to policy gradient methods (PPO, A3C) because it reduces variance in gradient estimates.
Bellman Equations
The Bellman equations are the fundamental recursive decompositions of value functions. They state that the value of a state equals the immediate reward plus the discounted value of the next state. Every practical RL algorithm exploits this structure.
Bellman Expectation Equation
Expanding $V^\pi$ using the definition of return:
Bellman Optimality Equations
For the optimal value functions, we take the maximum over actions instead of the expectation under $\pi$:
Bellman Operator
Define the Bellman operator $\mathcal{T}^\pi$ mapping functions $V:\mathcal{S}\to\mathbb{R}$ to functions of the same type:
$V^\pi$ is the unique fixed point of $\mathcal{T}^\pi$: $\mathcal{T}^\pi V^\pi = V^\pi$. The operator is a $\gamma$-contraction in the $\ell_\infty$ norm:
By the Banach fixed-point theorem, repeated application of $\mathcal{T}^\pi$ converges geometrically to $V^\pi$, regardless of initialization. This is the theoretical guarantee behind iterative algorithms like policy evaluation.
Dynamic Programming
Dynamic Programming (DP) refers to a collection of algorithms that compute optimal policies given a perfect model of the MDP. They exploit the Bellman equations iteratively. DP is foundational — while impractical at scale (it requires full model knowledge and is $O(|\mathcal{S}|^2 |\mathcal{A}|)$ per sweep), it defines the target that all approximate RL methods pursue.
Policy Evaluation (Prediction)
Given a policy $\pi$, compute $V^\pi$ by iterating the Bellman expectation operator until convergence:
Policy Improvement
Given $V^\pi$, we can construct a policy $\pi'$ that is at least as good by acting greedily:
The Policy Improvement Theorem guarantees $V^{\pi'}(s) \geq V^\pi(s)$ for all $s$. If no improvement is possible, $\pi = \pi'$, and by the Bellman optimality equation, $\pi = \pi^*$.
Policy Iteration
Alternating between policy evaluation and improvement converges to the optimal policy in a finite number of steps (since there are finitely many policies):
Value Iteration
Policy iteration requires a full policy evaluation sweep before each improvement. Value Iteration combines the two into a single update per state by using the Bellman optimality operator:
Monte Carlo Methods
Monte Carlo (MC) methods learn from complete episodes of experience without requiring a model of the environment. Instead of computing expected values analytically, MC estimates them by averaging actual returns sampled from interactions.
First-Visit vs. Every-Visit MC
For estimating $V^\pi(s)$, we average returns following visits to $s$. Two variants:
| Method | Which visit to count? | Property |
|---|---|---|
| First-visit MC | Only the first visit to $s$ in each episode | Unbiased; samples are i.i.d. |
| Every-visit MC | Every visit to $s$ in each episode | Biased but consistent; more data efficient |
Both converge to $V^\pi(s)$ as the number of visits $\to \infty$ (by the law of large numbers).
MC Control: Exploring Starts
For control (finding the optimal policy), we estimate $Q^\pi(s,a)$ and improve the policy. The exploring starts assumption ensures all state-action pairs are visited:
Incremental Update Rule
Instead of storing all returns, we can update incrementally using a running mean:
In non-stationary environments, we use a constant step-size $\alpha$ to weight recent returns more heavily:
This step-size form is the prototype for all RL update rules. The quantity $[G_t - V(S_t)]$ is the prediction error — how wrong our current estimate was.
Bias–Variance Trade-off
MC uses the true (sampled) return $G_t$, making it unbiased. However, $G_t = \sum_{k} \gamma^k R_{t+k+1}$ is a sum of many random variables, giving it high variance. This is the fundamental tension in RL estimation that TD methods partially resolve.
Temporal Difference Learning
Temporal Difference (TD) learning is arguably the most important idea in RL. It combines the sampling advantage of MC (no model required) with the bootstrapping of DP (updates from estimates, not full returns). TD methods can learn from incomplete episodes, online, at every step.
TD(0): One-Step TD Prediction
Instead of waiting for the full return $G_t$, TD(0) bootstraps using the immediate reward and the current estimate of the next state's value:
The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is a one-step prediction error. It is zero when $V$ is self-consistent — i.e., when $V = V^\pi$.
SARSA: On-Policy TD Control
SARSA (State, Action, Reward, State', Action') updates Q-values using transitions $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ where $A_{t+1}$ is selected by the same policy being learned:
Because SARSA uses the action actually taken by the policy ($A_{t+1} \sim \pi$), it is on-policy: it learns about the policy it is currently executing, including its exploratory actions. With an $\varepsilon$-greedy policy and decaying $\varepsilon$, SARSA converges to the optimal policy.
Q-Learning: Off-Policy TD Control
Q-Learning decouples the behavior policy (used to generate data) from the target policy (being learned) by maximizing over next-state actions:
By using $\max_{a'} Q(S_{t+1}, a')$ as the target, Q-Learning directly approximates $Q^*$, independent of the behavior policy. This is the key advantage: data collected by any policy (including a random one) can be used to learn the optimal Q-function.
Expected SARSA
A middle ground that uses the expected value over next actions rather than a sample:
SARSA vs. Q-Learning: The Cliff Example
| Algorithm | Policy type | Learns | Cliff behavior |
|---|---|---|---|
| SARSA | On-policy | $Q^{\pi_\varepsilon}$ (ε-greedy optimal) | Safe path (avoids cliff due to ε-greedy) |
| Q-Learning | Off-policy | $Q^*$ (true optimal) | Optimal path (cliff edge, risky) |
| Expected SARSA | Both | $Q^*$ or $Q^\pi$ | Optimal; lower variance than SARSA |
TD(λ): Eligibility Traces
TD(λ) unifies MC and TD by looking $n$ steps ahead and blending all $n$-step returns:
where $G_t^{(n)} = \sum_{k=0}^{n-1}\gamma^k R_{t+k+1} + \gamma^n V(S_{t+n})$ is the $n$-step return. This is implemented efficiently with eligibility traces $z_t(s)$:
When $\lambda = 0$: recovers TD(0). When $\lambda = 1$: recovers MC (for episodic tasks). Typical values $\lambda \approx 0.9$ often work best in practice.
Policy Gradient Methods
Value-based methods like Q-Learning find the optimal policy indirectly by learning a value function and then acting greedily. Policy gradient methods optimize the policy directly, parameterizing it as $\pi_\theta(a|s)$ and following the gradient of expected return with respect to $\theta$.
The Policy Gradient Objective
Define the performance objective $J(\theta)$ as the expected return under policy $\pi_\theta$:
where $d^{\pi_\theta}(s)$ is the stationary state distribution under $\pi_\theta$ (discounted or undiscounted, depending on the setting).
The Policy Gradient Theorem
Computing $\nabla_\theta J(\theta)$ is non-trivial because $d^{\pi_\theta}$ also depends on $\theta$. The Policy Gradient Theorem (Sutton et al., 1999) provides a clean expression:
The Log-Derivative Trick (Score Function Estimator)
The key identity enabling unbiased gradient estimation:
This allows us to write $\mathbb{E}_{\pi_\theta}[f(a)\nabla_\theta \log \pi_\theta(a|s)]$ as a sample expectation, enabling Monte Carlo estimation from trajectories.
REINFORCE Algorithm
The simplest policy gradient method, using Monte Carlo returns as the baseline-free estimate of $Q^{\pi_\theta}$:
Variance Reduction: Baselines
REINFORCE has high variance because $G_t$ varies widely across episodes. We can subtract any baseline $b(s)$ that does not depend on $a$ without introducing bias:
The optimal baseline that minimizes variance is approximately $b^*(s) \approx V^{\pi_\theta}(s)$, giving the advantage $A^{\pi_\theta}(S_t, A_t) = G_t - V^{\pi_\theta}(S_t)$ as the gradient weight. This is the key insight behind Actor-Critic methods.
The Softmax (Gibbs) Policy
For discrete actions, a common parameterization with preference function $h_\theta(s,a) \in \mathbb{R}$:
Gaussian Policy for Continuous Actions
For continuous action spaces $\mathcal{A} \subseteq \mathbb{R}^d$:
Actor-Critic Methods
Actor-Critic methods maintain two function approximators: the actor $\pi_\theta(a|s)$ (the policy) and the critic $V_w(s)$ (the value function). The critic evaluates the current policy, and the actor uses the critic's feedback to update the policy via policy gradients.
The One-Step Actor-Critic Update
Generalized Advantage Estimation (GAE)
GAE (Schulman et al., 2015) provides a smooth bias-variance trade-off for the advantage estimate using an exponentially weighted average of k-step TD errors:
When $\lambda = 0$: one-step TD advantage. When $\lambda = 1$: Monte Carlo advantage. GAE with $\lambda \approx 0.95$ is the standard in modern deep RL (PPO, A3C, TRPO).
A3C: Asynchronous Advantage Actor-Critic
Mnih et al. (2016) proposed running multiple actor-critic agents in parallel, each interacting with its own copy of the environment. Asynchronous gradient updates decorrelate experience without a replay buffer:
The entropy bonus $H[\pi_\theta] = -\sum_a \pi_\theta(a|s)\log\pi_\theta(a|s)$ encourages exploration by penalizing overly deterministic policies.
Deep Reinforcement Learning
Deep RL combines neural network function approximation with RL algorithms. The state space is no longer tabulated — instead, a neural network $f_\theta : \mathcal{S} \to \mathbb{R}^{|\mathcal{A}|}$ approximates $Q^*$ or $\pi^*$. This enables RL to scale to high-dimensional inputs like images, text, or continuous vectors.
DQN: Deep Q-Network
Mnih et al. (2013/2015) showed that Q-Learning with neural networks is unstable due to correlated updates and a non-stationary target. Two key stabilization tricks:
The DQN loss function minimizes the mean-squared Bellman error:
Improvements to DQN
| Variant | Key idea | Motivation |
|---|---|---|
| Double DQN | Separate networks for action selection vs. evaluation | Reduces overestimation bias in Q-targets |
| Dueling DQN | $Q(s,a;\theta) = V(s;\theta) + A(s,a;\theta) - \bar{A}$ | More efficient learning of $V$ when actions don't matter |
| PER | Sample transitions with prob ∝ $|\delta_t|^\alpha$ | More frequent updates on surprising transitions |
| Rainbow | All of the above + n-step + distributional | State-of-the-art sample efficiency |
PPO: Proximal Policy Optimization
PPO (Schulman et al., 2017) is the dominant policy gradient algorithm for practical deep RL. It addresses the challenge that large policy gradient steps can catastrophically change the policy. The clipped objective constrains how much $\pi_\theta$ can change from the old policy $\pi_{\theta_\text{old}}$:
where $r_t(\theta) = \dfrac{\pi_\theta(A_t|S_t)}{\pi_{\theta_\text{old}}(A_t|S_t)}$ is the probability ratio. The clipping prevents the ratio from going too far from 1, ensuring stable updates without the complexity of a trust region constraint.
SAC: Soft Actor-Critic
SAC (Haarnoja et al., 2018) augments the reward with an entropy term to encourage maximum-entropy policies:
The temperature parameter $\alpha$ controls the exploration-exploitation trade-off. SAC is off-policy (uses a replay buffer), sample-efficient, and robust to hyperparameters, making it the standard for continuous control tasks.
Landscape of RL Algorithms
| Algorithm | Type | Action space | Key property |
|---|---|---|---|
| DQN | Value-based, off-policy | Discrete | Image inputs, Atari |
| DDPG | Actor-Critic, off-policy | Continuous | Deterministic policy gradient |
| TD3 | Actor-Critic, off-policy | Continuous | Clipped double Q, delayed actor |
| SAC | Actor-Critic, off-policy | Continuous | Entropy regularization |
| PPO | Policy gradient, on-policy | Both | Stable, scalable, default choice |
| TRPO | Policy gradient, on-policy | Both | Trust region, monotonic improvement |
| AlphaZero | Model-based + MCTS | Discrete | Self-play, planning |
Exploration vs. Exploitation
The exploration-exploitation dilemma is fundamental to RL: should the agent exploit its current knowledge to maximize immediate reward, or explore to gather information that might lead to better long-term returns? This tension appears in every RL setting.
ε-Greedy Policy
The simplest exploration strategy: with probability $\varepsilon$ take a random action, otherwise take the greedy action.
The Multi-Armed Bandit
The bandit problem is the simplest RL setting: $|\mathcal{S}| = 1$, $|\mathcal{A}| = k$ arms, each with unknown reward distribution $R_i \sim P_i$. Goal: maximize total reward over $T$ steps. Define regret as the gap between the optimal strategy and the chosen one:
where $\Delta_a = Q^*(a^*) - Q^*(a)$ is the gap for suboptimal arm $a$, and $N_a(T)$ is the number of times arm $a$ was pulled.
UCB: Upper Confidence Bound
UCB strikes the exploration-exploitation balance using optimism: select the arm that has the highest upper confidence bound on its true mean reward:
The first term is the estimated reward; the second is the uncertainty bonus. Rarely-sampled arms get large bonuses, ensuring systematic exploration. UCB1 achieves regret $O(\log T)$, which is asymptotically optimal (Lai-Robbins lower bound).
Thompson Sampling
A Bayesian exploration strategy: maintain a posterior distribution over arm rewards $P_a | \text{data}$, sample a reward estimate from each posterior, then act greedily on the samples:
5-armed bandit with unknown true means. Compare ε-greedy, UCB, and Thompson Sampling.
Intrinsic Motivation & Curiosity
In environments with sparse rewards, pure ε-greedy exploration is insufficient. Modern approaches augment the reward with an intrinsic bonus that rewards visiting novel or uncertain states:
Common intrinsic bonus designs:
- Count-based: $b(s) = 1/\sqrt{N(s)}$ — less visited states get larger bonuses
- Prediction error: $b(s_t) = \|\hat{s}_{t+1} - s_{t+1}\|^2$ — error of a forward model predicts novelty (ICM, Pathak et al. 2017)
- Random Network Distillation (RND): $b(s) = \|f_\theta(s) - f_{\bar\theta}(s)\|^2$ — distance between a trained and fixed random network
- Episodic curiosity: reward for visiting states dissimilar to those seen in the current episode
Summary: Key RL Concepts
| Concept | Symbol | Description |
|---|---|---|
| State | $s \in \mathcal{S}$ | Current situation of the environment |
| Action | $a \in \mathcal{A}$ | Choice made by the agent |
| Reward | $r_t \in \mathbb{R}$ | Immediate feedback signal |
| Return | $G_t = \sum \gamma^k r_{t+k+1}$ | Cumulative discounted reward |
| Policy | $\pi(a|s)$ | Behavior of the agent |
| State-value | $V^\pi(s)$ | Expected return from state $s$ under $\pi$ |
| Action-value | $Q^\pi(s,a)$ | Expected return from $(s,a)$ under $\pi$ |
| Advantage | $A^\pi(s,a) = Q - V$ | Relative value of action $a$ vs. average |
| TD error | $\delta_t = r + \gamma V(s') - V(s)$ | One-step prediction error |
| Bellman operator | $\mathcal{T}^\pi V$ | One-step backup; fixed point = $V^\pi$ |