{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2019 The TensorFlow Authors.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "# Distributed training with Keras" ] }, { "cell_type": "markdown", "metadata": { "id": "r6P32iYYV27b" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "## Overview\n", "\n", "The `tf.distribute.Strategy` API provides an abstraction for distributing your training across multiple processing units. It allows you to carry out distributed training using existing models and training code with minimal changes.\n", "\n", "This tutorial demonstrates how to use the `tf.distribute.MirroredStrategy` to perform in-graph replication with _synchronous training on many GPUs on one machine_. The strategy essentially copies all of the model's variables to each processor. Then, it uses [all-reduce](http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/) to combine the gradients from all processors, and applies the combined value to all copies of the model.\n", "\n", "You will use the `tf.keras` APIs to build the model and `Model.fit` for training it. (To learn about distributed training with a custom training loop and the `MirroredStrategy`, check out [this tutorial](custom_training.ipynb).)\n", "\n", "`MirroredStrategy` trains your model on multiple GPUs on a single machine. For _synchronous training on many GPUs on multiple workers_, use the `tf.distribute.MultiWorkerMirroredStrategy` with the [Keras Model.fit](multi_worker_with_keras.ipynb) or [a custom training loop](multi_worker_with_ctl.ipynb). For other options, refer to the [Distributed training guide](../../guide/distributed_training.ipynb).\n", "\n", "To learn about various other strategies, there is the [Distributed training with TensorFlow](../../guide/distributed_training.ipynb) guide." ] }, { "cell_type": "markdown", "metadata": { "id": "Dney9v7BsJij" }, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "r8S3ublR7Ay8" }, "outputs": [], "source": [ "import tensorflow_datasets as tfds\n", "import tensorflow as tf\n", "\n", "import os\n", "\n", "# Load the TensorBoard notebook extension.\n", "%load_ext tensorboard" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SkocY8tgRd3H" }, "outputs": [], "source": [ "print(tf.__version__)" ] }, { "cell_type": "markdown", "metadata": { "id": "hXhefksNKk2I" }, "source": [ "## Download the dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "OtnnUwvmB3X5" }, "source": [ "Load the MNIST dataset from [TensorFlow Datasets](https://www.tensorflow.org/datasets). This returns a dataset in the `tf.data` format.\n", "\n", "Setting the `with_info` argument to `True` includes the metadata for the entire dataset, which is being saved here to `info`. Among other things, this metadata object includes the number of train and test examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iXMJ3G9NB3X6" }, "outputs": [], "source": [ "datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)\n", "\n", "mnist_train, mnist_test = datasets['train'], datasets['test']" ] }, { "cell_type": "markdown", "metadata": { "id": "GrjVhv-eKuHD" }, "source": [ "## Define the distribution strategy" ] }, { "cell_type": "markdown", "metadata": { "id": "TlH8vx6BB3X9" }, "source": [ "Create a `MirroredStrategy` object. This will handle distribution and provide a context manager (`MirroredStrategy.scope`) to build your model inside." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4j0tdf4YB3X9" }, "outputs": [], "source": [ "strategy = tf.distribute.MirroredStrategy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cY3KA_h2iVfN" }, "outputs": [], "source": [ "print('Number of devices: {}'.format(strategy.num_replicas_in_sync))" ] }, { "cell_type": "markdown", "metadata": { "id": "lNbPv0yAleW8" }, "source": [ "## Set up the input pipeline" ] }, { "cell_type": "markdown", "metadata": { "id": "psozqcuptXhK" }, "source": [ "When training a model with multiple GPUs, you can use the extra computing power effectively by increasing the batch size. In general, use the largest batch size that fits the GPU memory and tune the learning rate accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p1xWxKcnhar9" }, "outputs": [], "source": [ "# You can also do info.splits.total_num_examples to get the total\n", "# number of examples in the dataset.\n", "\n", "num_train_examples = info.splits['train'].num_examples\n", "num_test_examples = info.splits['test'].num_examples\n", "\n", "BUFFER_SIZE = 10000\n", "\n", "BATCH_SIZE_PER_REPLICA = 64\n", "BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync" ] }, { "cell_type": "markdown", "metadata": { "id": "0Wm5rsL2KoDF" }, "source": [ "Define a function that normalizes the image pixel values from the `[0, 255]` range to the `[0, 1]` range ([feature scaling](https://en.wikipedia.org/wiki/Feature_scaling)):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Eo9a46ZeJCkm" }, "outputs": [], "source": [ "def scale(image, label):\n", " image = tf.cast(image, tf.float32)\n", " image /= 255\n", "\n", " return image, label" ] }, { "cell_type": "markdown", "metadata": { "id": "WZCa5RLc5A91" }, "source": [ "Apply this `scale` function to the training and test data, and then use the `tf.data.Dataset` APIs to shuffle the training data (`Dataset.shuffle`), and batch it (`Dataset.batch`). Notice that you are also keeping an in-memory cache of the training data to improve performance (`Dataset.cache`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gRZu2maChwdT" }, "outputs": [], "source": [ "train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)\n", "eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)" ] }, { "cell_type": "markdown", "metadata": { "id": "4xsComp8Kz5H" }, "source": [ "## Create the model and instantiate the optimizer" ] }, { "cell_type": "markdown", "metadata": { "id": "1BnQYQTpB3YA" }, "source": [ "Within the context of `Strategy.scope`, create and compile the model using the Keras API:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IexhL_vIB3YA" }, "outputs": [], "source": [ "with strategy.scope():\n", " model = tf.keras.Sequential([\n", " tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),\n", " tf.keras.layers.MaxPooling2D(),\n", " tf.keras.layers.Flatten(),\n", " tf.keras.layers.Dense(64, activation='relu'),\n", " tf.keras.layers.Dense(10)\n", " ])\n", "\n", " model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),\n", " metrics=['accuracy'])" ] }, { "cell_type": "markdown", "metadata": { "id": "DCDKFcNJzdcd" }, "source": [ "For this toy example with the MNIST dataset, you will be using the Adam optimizer's default learning rate of 0.001.\n", "\n", "For larger datasets, the key benefit of distributed training is to learn more in each training step, because each step processes more training data in parallel, which allows for a larger learning rate (within the limits of the model and dataset)." ] }, { "cell_type": "markdown", "metadata": { "id": "8i6OU5W9Vy2u" }, "source": [ "## Define the callbacks\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YOXO5nvvK3US" }, "source": [ "Define the following [Keras Callbacks](https://www.tensorflow.org/guide/keras/train_and_evaluate):\n", "\n", "- `tf.keras.callbacks.TensorBoard`: writes a log for TensorBoard, which allows you to visualize the graphs.\n", "- `tf.keras.callbacks.ModelCheckpoint`: saves the model at a certain frequency, such as after every epoch.\n", "- `tf.keras.callbacks.BackupAndRestore`: provides the fault tolerance functionality by backing up the model and current epoch number. Learn more in the _Fault tolerance_ section of the [Multi-worker training with Keras](multi_worker_with_keras.ipynb) tutorial.\n", "- `tf.keras.callbacks.LearningRateScheduler`: schedules the learning rate to change after, for example, every epoch/batch.\n", "\n", "For illustrative purposes, add a [custom callback](https://www.tensorflow.org/guide/keras/custom_callback) called `PrintLR` to display the *learning rate* in the notebook.\n", "\n", "**Note:** Use the `BackupAndRestore` callback instead of `ModelCheckpoint` as the main mechanism to restore the training state upon a restart from a job failure. Since `BackupAndRestore` only supports eager mode, in graph mode consider using `ModelCheckpoint`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "A9bwLCcXzSgy" }, "outputs": [], "source": [ "# Define the checkpoint directory to store the checkpoints.\n", "checkpoint_dir = './training_checkpoints'\n", "# Define the name of the checkpoint files.\n", "checkpoint_prefix = os.path.join(checkpoint_dir, \"ckpt_{epoch}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wpU-BEdzJDbK" }, "outputs": [], "source": [ "# Define a function for decaying the learning rate.\n", "# You can define any decay function you need.\n", "def decay(epoch):\n", " if epoch < 3:\n", " return 1e-3\n", " elif epoch >= 3 and epoch < 7:\n", " return 1e-4\n", " else:\n", " return 1e-5" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jKhiMgXtKq2w" }, "outputs": [], "source": [ "# Define a callback for printing the learning rate at the end of each epoch.\n", "class PrintLR(tf.keras.callbacks.Callback):\n", " def on_epoch_end(self, epoch, logs=None):\n", " print('\\nLearning rate for epoch {} is {}'.format( epoch + 1, model.optimizer.lr.numpy()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YVqAbR6YyNQh" }, "outputs": [], "source": [ "# Put all the callbacks together.\n", "callbacks = [\n", " tf.keras.callbacks.TensorBoard(log_dir='./logs'),\n", " tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,\n", " save_weights_only=True),\n", " tf.keras.callbacks.LearningRateScheduler(decay),\n", " PrintLR()\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "70HXgDQmK46q" }, "source": [ "## Train and evaluate" ] }, { "cell_type": "markdown", "metadata": { "id": "6EophnOAB3YD" }, "source": [ "Now, train the model in the usual way by calling Keras `Model.fit` on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7MVw_6CqB3YD" }, "outputs": [], "source": [ "EPOCHS = 12\n", "\n", "model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks)" ] }, { "cell_type": "markdown", "metadata": { "id": "NUcWAUUupIvG" }, "source": [ "Check for saved checkpoints:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JQ4zeSTxKEhB" }, "outputs": [], "source": [ "# Check the checkpoint directory.\n", "!ls {checkpoint_dir}" ] }, { "cell_type": "markdown", "metadata": { "id": "qor53h7FpMke" }, "source": [ "To check how well the model performs, load the latest checkpoint and call `Model.evaluate` on the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JtEwxiTgpQoP" }, "outputs": [], "source": [ "model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))\n", "\n", "eval_loss, eval_acc = model.evaluate(eval_dataset)\n", "\n", "print('Eval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))" ] }, { "cell_type": "markdown", "metadata": { "id": "IIeF2RWfYu4N" }, "source": [ "To visualize the output, launch TensorBoard and view the logs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vtyAZO0DoKu_" }, "outputs": [], "source": [ "%tensorboard --logdir=logs" ] }, { "cell_type": "markdown", "metadata": { "id": "a0a82d26d6bd" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LnyscOkvKKBR" }, "outputs": [], "source": [ "!ls -sh ./logs" ] }, { "cell_type": "markdown", "metadata": { "id": "kBLlogrDvMgg" }, "source": [ "## Save the model" ] }, { "cell_type": "markdown", "metadata": { "id": "Xa87y_A0vRma" }, "source": [ "Save the model to a `.keras` zip archive using `Model.save`. After your model is saved, you can load it with or without the `Strategy.scope`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "h8Q4MKOLwG7K" }, "outputs": [], "source": [ "path = 'my_model.keras'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4HvcDmVsvQoa" }, "outputs": [], "source": [ "model.save(path)" ] }, { "cell_type": "markdown", "metadata": { "id": "vKJT4w5JwVPI" }, "source": [ "Now, load the model without `Strategy.scope`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T_gT0RbRvQ3o" }, "outputs": [], "source": [ "unreplicated_model = tf.keras.models.load_model(path)\n", "\n", "unreplicated_model.compile(\n", " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " optimizer=tf.keras.optimizers.Adam(),\n", " metrics=['accuracy'])\n", "\n", "eval_loss, eval_acc = unreplicated_model.evaluate(eval_dataset)\n", "\n", "print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))" ] }, { "cell_type": "markdown", "metadata": { "id": "YBLzcRF0wbDe" }, "source": [ "Load the model with `Strategy.scope`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BBVo3WGGwd9a" }, "outputs": [], "source": [ "with strategy.scope():\n", " replicated_model = tf.keras.models.load_model(path)\n", " replicated_model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " optimizer=tf.keras.optimizers.Adam(),\n", " metrics=['accuracy'])\n", "\n", " eval_loss, eval_acc = replicated_model.evaluate(eval_dataset)\n", " print ('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))" ] }, { "cell_type": "markdown", "metadata": { "id": "MUZwaz4AKjtD" }, "source": [ "### Additional resources\n", "\n", "More examples that use different distribution strategies with the Keras `Model.fit` API:\n", "\n", "1. The [Solve GLUE tasks using BERT on TPU](https://www.tensorflow.org/text/tutorials/bert_glue) tutorial uses `tf.distribute.MirroredStrategy` for training on GPUs and `tf.distribute.TPUStrategy` on TPUs.\n", "1. The [Save and load a model using a distribution strategy](save_and_load.ipynb) tutorial demonstates how to use the SavedModel APIs with `tf.distribute.Strategy`.\n", "1. The [official TensorFlow models](https://github.com/tensorflow/models/tree/master/official) can be configured to run multiple distribution strategies.\n", "\n", "To learn more about TensorFlow distribution strategies:\n", "\n", "1. The [Custom training with tf.distribute.Strategy](custom_training.ipynb) tutorial shows how to use the `tf.distribute.MirroredStrategy` for single-worker training with a custom training loop.\n", "1. The [Multi-worker training with Keras](multi_worker_with_keras.ipynb) tutorial shows how to use the `MultiWorkerMirroredStrategy` with `Model.fit`.\n", "1. The [Custom training loop with Keras and MultiWorkerMirroredStrategy](multi_worker_with_ctl.ipynb) tutorial shows how to use the `MultiWorkerMirroredStrategy` with Keras and a custom training loop.\n", "1. The [Distributed training in TensorFlow](https://www.tensorflow.org/guide/distributed_training) guide provides an overview of the available distribution strategies.\n", "1. The [Better performance with tf.function](../../guide/function.ipynb) guide provides information about other strategies and tools, such as the [TensorFlow Profiler](../../guide/profiler.md) you can use to optimize the performance of your TensorFlow models.\n", "\n", "Note: `tf.distribute.Strategy` is actively under development and TensorFlow will be adding more examples and tutorials in the near future. Please give it a try. Your feedback is welcome—feel free to submit it via [issues on GitHub](https://github.com/tensorflow/tensorflow/issues/new)." ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "keras.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }