{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "_jQ1tEQCxwRx" }, "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2024-08-16T02:25:19.423282Z", "iopub.status.busy": "2024-08-16T02:25:19.423061Z", "iopub.status.idle": "2024-08-16T02:25:19.427001Z", "shell.execute_reply": "2024-08-16T02:25:19.426364Z" }, "id": "V_sgB_5dx1f1" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "p62G8M_viUJp" }, "source": [ "# Playing CartPole with the Actor-Critic method\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-mJ2i6jvZ3sK" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " View on TensorFlow.org\n", " \n", " \n", " \n", " Run in Google Colab\n", " \n", " \n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "kFgN7h_wiUJq" }, "source": [ "This tutorial demonstrates how to implement the [Actor-Critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) method using TensorFlow to train an agent on the [Open AI Gym](https://www.gymlibrary.dev/) [`CartPole-v0`](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) environment.\n", "The reader is assumed to have some familiarity with [policy gradient methods](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) of [(deep) reinforcement learning](https://en.wikipedia.org/wiki/Deep_reinforcement_learning).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_kA10ZKRR0hi" }, "source": [ "**Actor-Critic methods**\n", "\n", "Actor-Critic methods are [temporal difference (TD) learning](https://en.wikipedia.org/wiki/Temporal_difference_learning) methods that represent the policy function independent of the value function.\n", "\n", "A policy function (or policy) returns a probability distribution over actions that the agent can take based on the given state.\n", "A value function determines the expected return for an agent starting at a given state and acting according to a particular policy forever after.\n", "\n", "In the Actor-Critic method, the policy is referred to as the *actor* that proposes a set of possible actions given a state, and the estimated value function is referred to as the *critic*, which evaluates actions taken by the *actor* based on the given policy.\n", "\n", "In this tutorial, both the *Actor* and *Critic* will be represented using one neural network with two outputs.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "rBfiafKSRs2k" }, "source": [ "**`CartPole-v0`**\n", "\n", "In the [`CartPole-v0` environment](https://www.gymlibrary.dev/environments/classic_control/cart_pole/), a pole is attached to a cart moving along a frictionless track.\n", "The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of `-1` or `+1` to the cart.\n", "A reward of `+1` is given for every time step the pole remains upright.\n", "An episode ends when: 1) the pole is more than 15 degrees from vertical; or 2) the cart moves more than 2.4 units from the center.\n", "\n", "
\n", "
\n", " \n", "
\n", " Trained actor-critic model in Cartpole-v0 environment\n", "
\n", "
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XSNVK0AeRoJd" }, "source": [ "The problem is considered \"solved\" when the average total reward for the episode reaches 195 over 100 consecutive trials." ] }, { "cell_type": "markdown", "metadata": { "id": "glLwIctHiUJq" }, "source": [ "## Setup\n", "\n", "Import necessary packages and configure global settings.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:19.430598Z", "iopub.status.busy": "2024-08-16T02:25:19.430384Z", "iopub.status.idle": "2024-08-16T02:25:28.284689Z", "shell.execute_reply": "2024-08-16T02:25:28.283565Z" }, "id": "13l6BbxKhCKp" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting gym[classic_control]\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading gym-0.26.2.tar.gz (721 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Installing build dependencies ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \bdone\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25h Getting requirements to build wheel ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \bdone\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \bdone\r\n", "\u001b[?25hRequirement already satisfied: numpy>=1.18.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from gym[classic_control]) (1.26.4)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting cloudpickle>=1.2.0 (from gym[classic_control])\r\n", " Using cached cloudpickle-3.0.0-py3-none-any.whl.metadata (7.0 kB)\r\n", "Collecting gym-notices>=0.0.4 (from gym[classic_control])\r\n", " Downloading gym_notices-0.0.8-py3-none-any.whl.metadata (1.0 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: importlib-metadata>=4.8.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from gym[classic_control]) (8.2.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting pygame==2.1.0 (from gym[classic_control])\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading pygame-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.5 kB)\r\n", "Requirement already satisfied: zipp>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from importlib-metadata>=4.8.0->gym[classic_control]) (3.20.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading pygame-2.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Using cached cloudpickle-3.0.0-py3-none-any.whl (20 kB)\r\n", "Downloading gym_notices-0.0.8-py3-none-any.whl (3.0 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Building wheels for collected packages: gym\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Building wheel for gym (pyproject.toml) ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \bdone\r\n", "\u001b[?25h Created wheel for gym: filename=gym-0.26.2-py3-none-any.whl size=827621 sha256=5c026f387ddccc50f4e466c2a3aef61e075d6c1f5dc7340297a8e5d078f55848\r\n", " Stored in directory: /home/kbuilder/.cache/pip/wheels/af/2b/30/5e78b8b9599f2a2286a582b8da80594f654bf0e18d825a4405\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully built gym\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: gym-notices, pygame, cloudpickle, gym\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully installed cloudpickle-3.0.0 gym-0.26.2 gym-notices-0.0.8 pygame-2.1.0\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting pyglet\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading pyglet-2.0.17-py3-none-any.whl.metadata (7.9 kB)\r\n", "Downloading pyglet-2.0.17-py3-none-any.whl (936 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: pyglet\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully installed pyglet-2.0.17\r\n" ] } ], "source": [ "!pip install gym[classic_control]\n", "!pip install pyglet" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:28.289308Z", "iopub.status.busy": "2024-08-16T02:25:28.289003Z", "iopub.status.idle": "2024-08-16T02:25:38.389710Z", "shell.execute_reply": "2024-08-16T02:25:38.388375Z" }, "id": "WBeQhPi2S4m5" }, "outputs": [], "source": [ "%%bash\n", "# Install additional packages for visualization\n", "sudo apt-get install -y python-opengl > /dev/null 2>&1\n", "pip install git+https://github.com/tensorflow/docs > /dev/null 2>&1" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:38.394227Z", "iopub.status.busy": "2024-08-16T02:25:38.393882Z", "iopub.status.idle": "2024-08-16T02:25:41.082095Z", "shell.execute_reply": "2024-08-16T02:25:41.081147Z" }, "id": "tT4N3qYviUJr" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-08-16 02:25:38.808866: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2024-08-16 02:25:38.830167: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2024-08-16 02:25:38.836483: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" ] } ], "source": [ "import collections\n", "import gym\n", "import numpy as np\n", "import statistics\n", "import tensorflow as tf\n", "import tqdm\n", "\n", "from matplotlib import pyplot as plt\n", "from tensorflow.keras import layers\n", "from typing import Any, List, Sequence, Tuple\n", "\n", "\n", "# Create the environment\n", "env = gym.make(\"CartPole-v1\")\n", "\n", "# Set seed for experiment reproducibility\n", "seed = 42\n", "tf.random.set_seed(seed)\n", "np.random.seed(seed)\n", "\n", "# Small epsilon value for stabilizing division operations\n", "eps = np.finfo(np.float32).eps.item()" ] }, { "cell_type": "markdown", "metadata": { "id": "AOUCe2D0iUJu" }, "source": [ "## The model\n", "\n", "The *Actor* and *Critic* will be modeled using one neural network that generates the action probabilities and Critic value respectively. This tutorial uses model subclassing to define the model.\n", "\n", "During the forward pass, the model will take in the state as the input and will output both action probabilities and critic value $V$, which models the state-dependent [value function](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#value-functions). The goal is to train a model that chooses actions based on a policy $\\pi$ that maximizes expected [return](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#reward-and-return).\n", "\n", "For `CartPole-v0`, there are four values representing the state: cart position, cart-velocity, pole angle and pole velocity respectively. The agent can take two actions to push the cart left (`0`) and right (`1`), respectively.\n", "\n", "Refer to [Gym's Cart Pole documentation page](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) and [_Neuronlike adaptive elements that can solve difficult learning control problems_](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) by Barto, Sutton and Anderson (1983) for more information.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:41.086442Z", "iopub.status.busy": "2024-08-16T02:25:41.085990Z", "iopub.status.idle": "2024-08-16T02:25:41.091949Z", "shell.execute_reply": "2024-08-16T02:25:41.091131Z" }, "id": "aXKbbMC-kmuv" }, "outputs": [], "source": [ "class ActorCritic(tf.keras.Model):\n", " \"\"\"Combined actor-critic network.\"\"\"\n", "\n", " def __init__(\n", " self,\n", " num_actions: int,\n", " num_hidden_units: int):\n", " \"\"\"Initialize.\"\"\"\n", " super().__init__()\n", "\n", " self.common = layers.Dense(num_hidden_units, activation=\"relu\")\n", " self.actor = layers.Dense(num_actions)\n", " self.critic = layers.Dense(1)\n", "\n", " def call(self, inputs: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:\n", " x = self.common(inputs)\n", " return self.actor(x), self.critic(x)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:41.094870Z", "iopub.status.busy": "2024-08-16T02:25:41.094623Z", "iopub.status.idle": "2024-08-16T02:25:41.102456Z", "shell.execute_reply": "2024-08-16T02:25:41.101589Z" }, "id": "nWyxJgjLn68c" }, "outputs": [], "source": [ "num_actions = env.action_space.n # 2\n", "num_hidden_units = 128\n", "\n", "model = ActorCritic(num_actions, num_hidden_units)" ] }, { "cell_type": "markdown", "metadata": { "id": "hk92njFziUJw" }, "source": [ "## Train the agent\n", "\n", "To train the agent, you will follow these steps:\n", "\n", "1. Run the agent on the environment to collect training data per episode.\n", "2. Compute expected return at each time step.\n", "3. Compute the loss for the combined Actor-Critic model.\n", "4. Compute gradients and update network parameters.\n", "5. Repeat 1-4 until either success criterion or max episodes has been reached.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "R2nde2XDs8Gh" }, "source": [ "### 1. Collect training data\n", "\n", "As in supervised learning, in order to train the actor-critic model, you need\n", "to have training data. However, in order to collect such data, the model would\n", "need to be \"run\" in the environment.\n", "\n", "Training data is collected for each episode. Then at each time step, the model's forward pass will be run on the environment's state in order to generate action probabilities and the critic value based on the current policy parameterized by the model's weights.\n", "\n", "The next action will be sampled from the action probabilities generated by the model, which would then be applied to the environment, causing the next state and reward to be generated.\n", "\n", "This process is implemented in the `run_episode` function, which uses TensorFlow operations so that it can later be compiled into a TensorFlow graph for faster training. Note that `tf.TensorArray`s were used to support Tensor iteration on variable length arrays." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:41.105998Z", "iopub.status.busy": "2024-08-16T02:25:41.105500Z", "iopub.status.idle": "2024-08-16T02:25:41.110510Z", "shell.execute_reply": "2024-08-16T02:25:41.109614Z" }, "id": "5URrbGlDSAGx" }, "outputs": [], "source": [ "# Wrap Gym's `env.step` call as an operation in a TensorFlow function.\n", "# This would allow it to be included in a callable TensorFlow graph.\n", "\n", "@tf.numpy_function(Tout=[tf.float32, tf.int32, tf.int32])\n", "def env_step(action: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:\n", " \"\"\"Returns state, reward and done flag given an action.\"\"\"\n", "\n", " state, reward, done, truncated, info = env.step(action)\n", " return (state.astype(np.float32),\n", " np.array(reward, np.int32),\n", " np.array(done, np.int32))\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:41.113607Z", "iopub.status.busy": "2024-08-16T02:25:41.113359Z", "iopub.status.idle": "2024-08-16T02:25:41.120888Z", "shell.execute_reply": "2024-08-16T02:25:41.120062Z" }, "id": "a4qVRV063Cl9" }, "outputs": [], "source": [ "def run_episode(\n", " initial_state: tf.Tensor,\n", " model: tf.keras.Model,\n", " max_steps: int) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor]:\n", " \"\"\"Runs a single episode to collect training data.\"\"\"\n", "\n", " action_probs = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)\n", " values = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)\n", " rewards = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)\n", "\n", " initial_state_shape = initial_state.shape\n", " state = initial_state\n", "\n", " for t in tf.range(max_steps):\n", " # Convert state into a batched tensor (batch size = 1)\n", " state = tf.expand_dims(state, 0)\n", "\n", " # Run the model and to get action probabilities and critic value\n", " action_logits_t, value = model(state)\n", "\n", " # Sample next action from the action probability distribution\n", " action = tf.random.categorical(action_logits_t, 1)[0, 0]\n", " action_probs_t = tf.nn.softmax(action_logits_t)\n", "\n", " # Store critic values\n", " values = values.write(t, tf.squeeze(value))\n", "\n", " # Store log probability of the action chosen\n", " action_probs = action_probs.write(t, action_probs_t[0, action])\n", "\n", " # Apply action to the environment to get next state and reward\n", " state, reward, done = env_step(action)\n", " state.set_shape(initial_state_shape)\n", "\n", " # Store reward\n", " rewards = rewards.write(t, reward)\n", "\n", " if tf.cast(done, tf.bool):\n", " break\n", "\n", " action_probs = action_probs.stack()\n", " values = values.stack()\n", " rewards = rewards.stack()\n", "\n", " return action_probs, values, rewards" ] }, { "cell_type": "markdown", "metadata": { "id": "lBnIHdz22dIx" }, "source": [ "### 2. Compute the expected returns\n", "\n", "The sequence of rewards for each timestep $t$, $\\{r_{t}\\}^{T}_{t=1}$ collected during one episode is converted into a sequence of expected returns $\\{G_{t}\\}^{T}_{t=1}$ in which the sum of rewards is taken from the current timestep $t$ to $T$ and each reward is multiplied with an exponentially decaying discount factor $\\gamma$:\n", "\n", "$$G_{t} = \\sum^{T}_{t'=t} \\gamma^{t'-t}r_{t'}$$\n", "\n", "Since $\\gamma\\in(0,1)$, rewards further out from the current timestep are given less weight.\n", "\n", "Intuitively, expected return simply implies that rewards now are better than rewards later. In a mathematical sense, it is to ensure that the sum of the rewards converges.\n", "\n", "To stabilize training, the resulting sequence of returns is also standardized (i.e. to have zero mean and unit standard deviation).\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:41.124326Z", "iopub.status.busy": "2024-08-16T02:25:41.124079Z", "iopub.status.idle": "2024-08-16T02:25:41.129954Z", "shell.execute_reply": "2024-08-16T02:25:41.129082Z" }, "id": "jpEwFyl315dl" }, "outputs": [], "source": [ "def get_expected_return(\n", " rewards: tf.Tensor,\n", " gamma: float,\n", " standardize: bool = True) -> tf.Tensor:\n", " \"\"\"Compute expected returns per timestep.\"\"\"\n", "\n", " n = tf.shape(rewards)[0]\n", " returns = tf.TensorArray(dtype=tf.float32, size=n)\n", "\n", " # Start from the end of `rewards` and accumulate reward sums\n", " # into the `returns` array\n", " rewards = tf.cast(rewards[::-1], dtype=tf.float32)\n", " discounted_sum = tf.constant(0.0)\n", " discounted_sum_shape = discounted_sum.shape\n", " for i in tf.range(n):\n", " reward = rewards[i]\n", " discounted_sum = reward + gamma * discounted_sum\n", " discounted_sum.set_shape(discounted_sum_shape)\n", " returns = returns.write(i, discounted_sum)\n", " returns = returns.stack()[::-1]\n", "\n", " if standardize:\n", " returns = ((returns - tf.math.reduce_mean(returns)) /\n", " (tf.math.reduce_std(returns) + eps))\n", "\n", " return returns" ] }, { "cell_type": "markdown", "metadata": { "id": "qhr50_Czxazw" }, "source": [ "### 3. The Actor-Critic loss\n", "\n", "Since you're using a hybrid Actor-Critic model, the chosen loss function is a combination of Actor and Critic losses for training, as shown below:\n", "\n", "$$L = L_{actor} + L_{critic}$$" ] }, { "cell_type": "markdown", "metadata": { "id": "nOQIJuG1xdTH" }, "source": [ "#### The Actor loss\n", "\n", "The Actor loss is based on [policy gradients with the Critic as a state dependent baseline](https://www.youtube.com/watch?v=EKqxumCuAAY&t=62m23s) and computed with single-sample (per-episode) estimates.\n", "\n", "$$L_{actor} = -\\sum^{T}_{t=1} \\log\\pi_{\\theta}(a_{t} | s_{t})[G(s_{t}, a_{t}) - V^{\\pi}_{\\theta}(s_{t})]$$\n", "\n", "where:\n", "- $T$: the number of timesteps per episode, which can vary per episode\n", "- $s_{t}$: the state at timestep $t$\n", "- $a_{t}$: chosen action at timestep $t$ given state $s$\n", "- $\\pi_{\\theta}$: is the policy (Actor) parameterized by $\\theta$\n", "- $V^{\\pi}_{\\theta}$: is the value function (Critic) also parameterized by $\\theta$\n", "- $G = G_{t}$: the expected return for a given state, action pair at timestep $t$\n", "\n", "A negative term is added to the sum since the idea is to maximize the probabilities of actions yielding higher rewards by minimizing the combined loss.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "Y304O4OAxiAv" }, "source": [ "##### The Advantage\n", "\n", "The $G - V$ term in our $L_{actor}$ formulation is called the [Advantage](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions), which indicates how much better an action is given a particular state over a random action selected according to the policy $\\pi$ for that state.\n", "\n", "While it's possible to exclude a baseline, this may result in high variance during training. And the nice thing about choosing the critic $V$ as a baseline is that it trained to be as close as possible to $G$, leading to a lower variance.\n", "\n", "In addition, without the Critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.\n", "\n", "For instance, suppose that two actions for a given state would yield the same expected return. Without the Critic, the algorithm would try to raise the probability of these actions based on the objective $J$. With the Critic, it may turn out that there's no Advantage ($G - V = 0$), and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "1hrPLrgGxlvb" }, "source": [ "#### The Critic loss\n", "\n", "Training $V$ to be as close possible to $G$ can be set up as a regression problem with the following loss function:\n", "\n", "$$L_{critic} = L_{\\delta}(G, V^{\\pi}_{\\theta})$$\n", "\n", "where $L_{\\delta}$ is the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss), which is less sensitive to outliers in data than squared-error loss.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:41.133349Z", "iopub.status.busy": "2024-08-16T02:25:41.133103Z", "iopub.status.idle": "2024-08-16T02:25:41.138007Z", "shell.execute_reply": "2024-08-16T02:25:41.137182Z" }, "id": "9EXwbEez6n9m" }, "outputs": [], "source": [ "huber_loss = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.SUM)\n", "\n", "def compute_loss(\n", " action_probs: tf.Tensor,\n", " values: tf.Tensor,\n", " returns: tf.Tensor) -> tf.Tensor:\n", " \"\"\"Computes the combined Actor-Critic loss.\"\"\"\n", "\n", " advantage = returns - values\n", "\n", " action_log_probs = tf.math.log(action_probs)\n", " actor_loss = -tf.math.reduce_sum(action_log_probs * advantage)\n", "\n", " critic_loss = huber_loss(values, returns)\n", "\n", " return actor_loss + critic_loss" ] }, { "cell_type": "markdown", "metadata": { "id": "HSYkQOmRfV75" }, "source": [ "### 4. Define the training step to update parameters\n", "\n", "All of the steps above are combined into a training step that is run every episode. All steps leading up to the loss function are executed with the `tf.GradientTape` context to enable automatic differentiation.\n", "\n", "This tutorial uses the Adam optimizer to apply the gradients to the model parameters.\n", "\n", "The sum of the undiscounted rewards, `episode_reward`, is also computed in this step. This value will be used later on to evaluate if the success criterion is met.\n", "\n", "The `tf.function` context is applied to the `train_step` function so that it can be compiled into a callable TensorFlow graph, which can lead to 10x speedup in training.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:41.141327Z", "iopub.status.busy": "2024-08-16T02:25:41.141061Z", "iopub.status.idle": "2024-08-16T02:25:43.359205Z", "shell.execute_reply": "2024-08-16T02:25:43.358172Z" }, "id": "QoccrkF3IFCg" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1723775141.663718 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.667620 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.671341 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.675129 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.687076 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.690515 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.694025 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.697475 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.700940 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.704504 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.707964 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775141.711412 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.938318 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.940342 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.942370 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.944416 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.946416 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.948277 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.950194 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.952156 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.954012 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.955861 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.957791 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.959728 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.997905 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775142.999871 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.001830 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.003823 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.005687 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.007547 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.009500 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.011459 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.013359 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.015658 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.017987 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723775143.020344 57252 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n" ] } ], "source": [ "optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)\n", "\n", "\n", "@tf.function\n", "def train_step(\n", " initial_state: tf.Tensor,\n", " model: tf.keras.Model,\n", " optimizer: tf.keras.optimizers.Optimizer,\n", " gamma: float,\n", " max_steps_per_episode: int) -> tf.Tensor:\n", " \"\"\"Runs a model training step.\"\"\"\n", "\n", " with tf.GradientTape() as tape:\n", "\n", " # Run the model for one episode to collect training data\n", " action_probs, values, rewards = run_episode(\n", " initial_state, model, max_steps_per_episode)\n", "\n", " # Calculate the expected returns\n", " returns = get_expected_return(rewards, gamma)\n", "\n", " # Convert training data to appropriate TF tensor shapes\n", " action_probs, values, returns = [\n", " tf.expand_dims(x, 1) for x in [action_probs, values, returns]]\n", "\n", " # Calculate the loss values to update our network\n", " loss = compute_loss(action_probs, values, returns)\n", "\n", " # Compute the gradients from the loss\n", " grads = tape.gradient(loss, model.trainable_variables)\n", "\n", " # Apply the gradients to the model's parameters\n", " optimizer.apply_gradients(zip(grads, model.trainable_variables))\n", "\n", " episode_reward = tf.math.reduce_sum(rewards)\n", "\n", " return episode_reward" ] }, { "cell_type": "markdown", "metadata": { "id": "HFvZiDoAflGK" }, "source": [ "### 5. Run the training loop\n", "\n", "Training is executed by running the training step until either the success criterion or maximum number of episodes is reached. \n", "\n", "A running record of episode rewards is kept in a queue. Once 100 trials are reached, the oldest reward is removed at the left (tail) end of the queue and the newest one is added at the head (right). A running sum of the rewards is also maintained for computational efficiency.\n", "\n", "Depending on your runtime, training can finish in less than a minute." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:25:43.362874Z", "iopub.status.busy": "2024-08-16T02:25:43.362590Z", "iopub.status.idle": "2024-08-16T02:26:44.057276Z", "shell.execute_reply": "2024-08-16T02:26:44.056525Z" }, "id": "kbmBxnzLiUJx" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", " 0%| | 0/10000 [00:00= 475 over 500\n", "# consecutive trials\n", "reward_threshold = 475\n", "running_reward = 0\n", "\n", "# The discount factor for future rewards\n", "gamma = 0.99\n", "\n", "# Keep the last episodes reward\n", "episodes_reward: collections.deque = collections.deque(maxlen=min_episodes_criterion)\n", "\n", "t = tqdm.trange(max_episodes)\n", "for i in t:\n", " initial_state, info = env.reset()\n", " initial_state = tf.constant(initial_state, dtype=tf.float32)\n", " episode_reward = int(train_step(\n", " initial_state, model, optimizer, gamma, max_steps_per_episode))\n", "\n", " episodes_reward.append(episode_reward)\n", " running_reward = statistics.mean(episodes_reward)\n", "\n", "\n", " t.set_postfix(\n", " episode_reward=episode_reward, running_reward=running_reward)\n", "\n", " # Show the average episode reward every 10 episodes\n", " if i % 10 == 0:\n", " pass # print(f'Episode {i}: average reward: {avg_reward}')\n", "\n", " if running_reward > reward_threshold and i >= min_episodes_criterion:\n", " break\n", "\n", "print(f'\\nSolved at episode {i}: average reward: {running_reward:.2f}!')" ] }, { "cell_type": "markdown", "metadata": { "id": "ru8BEwS1EmAv" }, "source": [ "## Visualization\n", "\n", "After training, it would be good to visualize how the model performs in the environment. You can run the cells below to generate a GIF animation of one episode run of the model. Note that additional packages need to be installed for Gym to render the environment's images correctly in Colab." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:26:44.060829Z", "iopub.status.busy": "2024-08-16T02:26:44.060575Z", "iopub.status.idle": "2024-08-16T02:26:47.181623Z", "shell.execute_reply": "2024-08-16T02:26:47.180821Z" }, "id": "qbIMMkfmRHyC" }, "outputs": [], "source": [ "# Render an episode and save as a GIF file\n", "\n", "from IPython import display as ipythondisplay\n", "from PIL import Image\n", "\n", "render_env = gym.make(\"CartPole-v1\", render_mode='rgb_array')\n", "\n", "def render_episode(env: gym.Env, model: tf.keras.Model, max_steps: int):\n", " state, info = env.reset()\n", " state = tf.constant(state, dtype=tf.float32)\n", " screen = env.render()\n", " images = [Image.fromarray(screen)]\n", "\n", " for i in range(1, max_steps + 1):\n", " state = tf.expand_dims(state, 0)\n", " action_probs, _ = model(state)\n", " action = np.argmax(np.squeeze(action_probs))\n", "\n", " state, reward, done, truncated, info = env.step(action)\n", " state = tf.constant(state, dtype=tf.float32)\n", "\n", " # Render screen every 10 steps\n", " if i % 10 == 0:\n", " screen = env.render()\n", " images.append(Image.fromarray(screen))\n", "\n", " if done:\n", " break\n", "\n", " return images\n", "\n", "\n", "# Save GIF image\n", "images = render_episode(render_env, model, max_steps_per_episode)\n", "image_file = 'cartpole-v1.gif'\n", "# loop=0: loop forever, duration=1: play each frame for 1ms\n", "images[0].save(\n", " image_file, save_all=True, append_images=images[1:], loop=0, duration=1)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T02:26:47.185484Z", "iopub.status.busy": "2024-08-16T02:26:47.185183Z", "iopub.status.idle": "2024-08-16T02:26:47.197699Z", "shell.execute_reply": "2024-08-16T02:26:47.197111Z" }, "id": "TLd720SejKmf" }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tensorflow_docs.vis.embed as embed\n", "embed.embed_file(image_file)" ] }, { "cell_type": "markdown", "metadata": { "id": "lnq9Hzo1Po6X" }, "source": [ "## Next steps\n", "\n", "This tutorial demonstrated how to implement the Actor-Critic method using Tensorflow.\n", "\n", "As a next step, you could try training a model on a different environment in Gym.\n", "\n", "For additional information regarding Actor-Critic methods and the Cartpole-v0 problem, you may refer to the following resources:\n", "\n", "- [The Actor-Critic method](https://hal.inria.fr/hal-00840470/document)\n", "- [The Actor-Critic lecture (CAL)](https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=7&t=0s)\n", "- [Cart Pole learning control problem \\[Barto, et al. 1983\\]](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)\n", "\n", "For more reinforcement learning examples in TensorFlow, you can check the following resources:\n", "- [Reinforcement learning code examples (keras.io)](https://keras.io/examples/rl/)\n", "- [TF-Agents reinforcement learning library](https://www.tensorflow.org/agents)\n" ] } ], "metadata": { "colab": { "collapsed_sections": [ "_jQ1tEQCxwRx" ], "name": "actor_critic.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 0 }