\n", " \n", " \n", " View on TensorFlow.org\n", " | \n", " \n", " \n", " Run in Google Colab\n", " | \n", " \n", " \n", " View source on GitHub\n", " | \n", " Download notebook\n", " |

\n", "\n", "##### Advantage\n", "\n", "The $G - V$ term in our $L_{actor}$ formulation is called the [advantage](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions), which indicates how much better an action is given a particular state over a random action selected according to the policy $\\pi$ for that state.\n", "\n", "While it's possible to exclude a baseline, this may result in high variance during training. And the nice thing about choosing the critic $V$ as a baseline is that it trained to be as close as possible to $G$, leading to a lower variance.\n", "\n", "In addition, without the critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.\n", "\n", "For instance, suppose that two actions for a given state would yield the same expected return. Without the critic, the algorithm would try to raise the probability of these actions based on the objective $J$. With the critic, it may turn out that there's no advantage ($G - V = 0$) and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.\n", "\n", "

\n", "\n", "#### Critic loss\n", "\n", "Training $V$ to be as close possible to $G$ can be set up as a regression problem with the following loss function:\n", "\n", "$$L_{critic} = L_{\\delta}(G, V^{\\pi}_{\\theta})$$\n", "\n", "where $L_{\\delta}$ is the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss), which is less sensitive to outliers in data than squared-error loss.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "9EXwbEez6n9m" }, "outputs": [], "source": [ "huber_loss = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.SUM)\n", "\n", "def compute_loss(\n", " action_probs: tf.Tensor, \n", " values: tf.Tensor, \n", " returns: tf.Tensor) -> tf.Tensor:\n", " \"\"\"Computes the combined actor-critic loss.\"\"\"\n", "\n", " advantage = returns - values\n", "\n", " action_log_probs = tf.math.log(action_probs)\n", " actor_loss = -tf.math.reduce_sum(action_log_probs * advantage)\n", "\n", " critic_loss = huber_loss(values, returns)\n", "\n", " return actor_loss + critic_loss" ] }, { "cell_type": "markdown", "metadata": { "id": "HSYkQOmRfV75" }, "source": [ "### 4. Defining the training step to update parameters\n", "\n", "We combine all of the steps above into a training step that is run every episode. All steps leading up to the loss function are executed with the `tf.GradientTape` context to enable automatic differentiation.\n", "\n", "We use the Adam optimizer to apply the gradients to the model parameters.\n", "\n", "We also compute the sum of the undiscounted rewards, `episode_reward`, in this step which would be used later on to evaluate if we have met the success criterion.\n", "\n", "We apply the `tf.function` context to the `train_step` function so that it can be compiled into a callable TensorFlow graph, which can lead to 10x speedup in training.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "QoccrkF3IFCg" }, "outputs": [], "source": [ "optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)\n", "\n", "\n", "@tf.function\n", "def train_step(\n", " initial_state: tf.Tensor, \n", " model: tf.keras.Model, \n", " optimizer: tf.keras.optimizers.Optimizer, \n", " gamma: float, \n", " max_steps_per_episode: int) -> tf.Tensor:\n", " \"\"\"Runs a model training step.\"\"\"\n", "\n", " with tf.GradientTape() as tape:\n", "\n", " # Run the model for one episode to collect training data\n", " action_probs, values, rewards = run_episode(\n", " initial_state, model, max_steps_per_episode) \n", "\n", " # Calculate expected returns\n", " returns = get_expected_return(rewards, gamma)\n", "\n", " # Convert training data to appropriate TF tensor shapes\n", " action_probs, values, returns = [\n", " tf.expand_dims(x, 1) for x in [action_probs, values, returns]] \n", "\n", " # Calculating loss values to update our network\n", " loss = compute_loss(action_probs, values, returns)\n", "\n", " # Compute the gradients from the loss\n", " grads = tape.gradient(loss, model.trainable_variables)\n", "\n", " # Apply the gradients to the model's parameters\n", " optimizer.apply_gradients(zip(grads, model.trainable_variables))\n", "\n", " episode_reward = tf.math.reduce_sum(rewards)\n", "\n", " return episode_reward" ] }, { "cell_type": "markdown", "metadata": { "id": "HFvZiDoAflGK" }, "source": [ "### 5. Run the training loop\n", "\n", "We execute training by run the training step until either the success criterion or maximum number of episodes is reached. \n", "\n", "We keep a running record of episode rewards using a queue. Once 100 trials are reached, the oldest reward is removed at the left (tail) end of the queue and the newest one is added at the head (right). A running sum of the rewards is also maintained for computational efficiency. \n", "\n", "Depending on your runtime, training can finish in less than a minute." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "kbmBxnzLiUJx" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", " 0%| | 0/10000 [00:00= 195 over 100 \n", "# consecutive trials\n", "reward_threshold = 195\n", "running_reward = 0\n", "\n", "# Discount factor for future rewards\n", "gamma = 0.99\n", "\n", "with tqdm.trange(max_episodes) as t:\n", " for i in t:\n", " initial_state = tf.constant(env.reset(), dtype=tf.float32)\n", " episode_reward = int(train_step(\n", " initial_state, model, optimizer, gamma, max_steps_per_episode))\n", "\n", " running_reward = episode_reward*0.01 + running_reward*.99\n", " \n", " t.set_description(f'Episode {i}')\n", " t.set_postfix(\n", " episode_reward=episode_reward, running_reward=running_reward)\n", " \n", " # Show average episode reward every 10 episodes\n", " if i % 10 == 0:\n", " pass # print(f'Episode {i}: average reward: {avg_reward}')\n", " \n", " if running_reward > reward_threshold: \n", " break\n", "\n", "print(f'\\nSolved at episode {i}: average reward: {running_reward:.2f}!')" ] }, { "cell_type": "markdown", "metadata": { "id": "ru8BEwS1EmAv" }, "source": [ "## Visualization\n", "\n", "After training, it would be good to visualize how the model performs in the environment. You can run the cells below to generate a GIF animation of one episode run of the model. Note that additional packages need to be installed for OpenAI Gym to render the environment's images correctly in Colab." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "qbIMMkfmRHyC" }, "outputs": [], "source": [ "# Render an episode and save as a GIF file\n", "\n", "from IPython import display as ipythondisplay\n", "from PIL import Image\n", "from pyvirtualdisplay import Display\n", "\n", "\n", "display = Display(visible=0, size=(400, 300))\n", "display.start()\n", "\n", "\n", "def render_episode(env: gym.Env, model: tf.keras.Model, max_steps: int): \n", " screen = env.render(mode='rgb_array')\n", " im = Image.fromarray(screen)\n", "\n", " images = [im]\n", " \n", " state = tf.constant(env.reset(), dtype=tf.float32)\n", " for i in range(1, max_steps + 1):\n", " state = tf.expand_dims(state, 0)\n", " action_probs, _ = model(state)\n", " action = np.argmax(np.squeeze(action_probs))\n", "\n", " state, _, done, _ = env.step(action)\n", " state = tf.constant(state, dtype=tf.float32)\n", "\n", " # Render screen every 10 steps\n", " if i % 10 == 0:\n", " screen = env.render(mode='rgb_array')\n", " images.append(Image.fromarray(screen))\n", " \n", " if done:\n", " break\n", " \n", " return images\n", "\n", "\n", "# Save GIF image\n", "images = render_episode(env, model, max_steps_per_episode)\n", "image_file = 'cartpole-v0.gif'\n", "# loop=0: loop forever, duration=1: play each frame for 1ms\n", "images[0].save(\n", " image_file, save_all=True, append_images=images[1:], loop=0, duration=1)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "TLd720SejKmf" }, "outputs": [ { "data": { "text/html": [ "