{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "wJcYs_ERTnnI" }, "source": [ "##### Copyright 2021 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2022-12-14T20:18:44.977236Z", "iopub.status.busy": "2022-12-14T20:18:44.976791Z", "iopub.status.idle": "2022-12-14T20:18:44.980535Z", "shell.execute_reply": "2022-12-14T20:18:44.979983Z" }, "id": "HMUDt0CiUJk9" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "77z2OchJTk0l" }, "source": [ "# 迁移单工作进程多 GPU 训练\n", "\n", "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 运行\n", " 在 Github 上查看源代码\n", " 下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "meUTrR4I6m1C" }, "source": [ "本指南演示了如何将单工作进程多 GPU 工作流从 TensorFlow 1 迁移到 TensorFlow 2。\n", "\n", "要在一台机器上跨多个 GPU 执行同步训练,请执行以下操作:\n", "\n", "- 在 TensorFlow 1 中,将 `tf.estimator.Estimator` API 与 `tf.distribute.MirroredStrategy` 一起使用。\n", "- 在 TensorFlow 2 中,可以使用 [Keras Model.fit](https://tensorflow.google.cn/tutorials/distribute/keras) 或带有 `tf.distribute.MirroredStrategy` 的[自定义训练循环](https://tensorflow.google.cn/tutorials/distribute/custom_training)。有关详情,请参阅[使用 TensorFlow 进行分布式训练](https://tensorflow.google.cn/guide/distributed_training#mirroredstrategy)指南。" ] }, { "cell_type": "markdown", "metadata": { "id": "YdZSoIXEbhg-" }, "source": [ "## 安装" ] }, { "cell_type": "markdown", "metadata": { "id": "6d466b39d0db" }, "source": [ "从导入和用于演示目的的简单数据集开始:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T20:18:44.984208Z", "iopub.status.busy": "2022-12-14T20:18:44.983707Z", "iopub.status.idle": "2022-12-14T20:18:46.879537Z", "shell.execute_reply": "2022-12-14T20:18:46.878827Z" }, "id": "iE0vSfMXumKI" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 20:18:45.916015: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n", "2022-12-14 20:18:45.916109: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n", "2022-12-14 20:18:45.916118: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n" ] } ], "source": [ "import tensorflow as tf\n", "import tensorflow.compat.v1 as tf1" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T20:18:46.884071Z", "iopub.status.busy": "2022-12-14T20:18:46.883276Z", "iopub.status.idle": "2022-12-14T20:18:46.887892Z", "shell.execute_reply": "2022-12-14T20:18:46.887287Z" }, "id": "m7rnGxsXtDkV" }, "outputs": [], "source": [ "features = [[1., 1.5], [2., 2.5], [3., 3.5]]\n", "labels = [[0.3], [0.5], [0.7]]\n", "eval_features = [[4., 4.5], [5., 5.5], [6., 6.5]]\n", "eval_labels = [[0.8], [0.9], [1.]]" ] }, { "cell_type": "markdown", "metadata": { "id": "4uXff1BEssdE" }, "source": [ "## TensorFlow 1:使用 tf.estimator.Estimator 进行单工作进程分布式训练" ] }, { "cell_type": "markdown", "metadata": { "id": "A9560BqEOTpb" }, "source": [ "此示例演示了单工作进程多 GPU 训练的 TensorFlow 1 规范工作流。您需要通过 `tf.estimator.Estimator` 的 `config` 参数设置分布策略 (`tf.distribute.MirroredStrategy`):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T20:18:46.891543Z", "iopub.status.busy": "2022-12-14T20:18:46.890933Z", "iopub.status.idle": "2022-12-14T20:18:55.201255Z", "shell.execute_reply": "2022-12-14T20:18:55.200403Z" }, "id": "lqe9obf7suIj" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Initializing RunConfig with distribution strategies.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Not using Distribute Coordinator.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Using temporary folder as model directory: /tmpfs/tmp/tmpuwsixyzs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Using config: {'_model_dir': '/tmpfs/tmp/tmpuwsixyzs', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true\n", "graph_options {\n", " rewrite_options {\n", " meta_optimizer_iterations: ONE\n", " }\n", "}\n", ", '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': , '_device_fn': None, '_protocol': None, '_eval_distribute': , '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_distribute_coordinator_mode': None}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Not using Distribute Coordinator.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Running training and evaluation locally (non-distributed).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1244: StrategyBase.configure (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "use `update_config_proto` instead.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py:461: UserWarning: To make it possible to preserve tf.data options across serialization boundaries, their implementation has moved to be part of the TensorFlow graph. As a consequence, the options value is in general no longer known at graph construction time. Invoking this method in graph mode retains the legacy behavior of the original implementation, but note that the returned value might not reflect the actual value of the options.\n", " warnings.warn(\"To make it possible to preserve tf.data options across \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/adagrad.py:138: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Call initializer instance with the dtype argument instead of passing it to the constructor\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n", "Instructions for updating:\n", "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Create CheckpointSaverHook.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/util.py:95: DistributedIteratorV1.initialize (from tensorflow.python.distribute.v1.input_lib) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use the iterator's `initializer` property instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Graph was finalized.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Running local_init_op.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done running local_init_op.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saving checkpoints for 0 into /tmpfs/tmp/tmpuwsixyzs/model.ckpt.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 20:18:52.851713: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:52.853006: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:52.865105: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:52.865600: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:52.875518: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:52.876039: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:52.883065: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:52.883549: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Starting evaluation at 2022-12-14T20:18:54\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Graph was finalized.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Restoring parameters from /tmpfs/tmp/tmpuwsixyzs/model.ckpt-0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Running local_init_op.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done running local_init_op.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Inference Time : 0.46575s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Finished evaluation at 2022-12-14-20:18:55\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saving dict for global step 0: global_step = 0, loss = 15.352239\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 20:18:54.899023: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:54.900364: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:54.903415: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:54.903922: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:54.919177: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:54.919679: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:54.928562: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorFromStringHandle' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorFromStringHandle}}\n", "\t. Registered: device='CPU'\n", "\n", "2022-12-14 20:18:54.929037: W tensorflow/core/grappler/utils/graph_view.cc:836] No registered 'MultiDeviceIteratorGetNextFromShard' OpKernel for GPU devices compatible with node {{node MultiDeviceIteratorGetNextFromShard}}\n", "\t. Registered: device='CPU'\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saving 'checkpoint_path' summary for global step 0: /tmpfs/tmp/tmpuwsixyzs/model.ckpt-0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Loss for final step: None.\n" ] }, { "data": { "text/plain": [ "({'loss': 15.352239, 'global_step': 0}, [])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def _input_fn():\n", " return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n", "\n", "def _eval_input_fn():\n", " return tf1.data.Dataset.from_tensor_slices(\n", " (eval_features, eval_labels)).batch(1)\n", "\n", "def _model_fn(features, labels, mode):\n", " logits = tf1.layers.Dense(1)(features)\n", " loss = tf1.losses.mean_squared_error(labels=labels, predictions=logits)\n", " optimizer = tf1.train.AdagradOptimizer(0.05)\n", " train_op = optimizer.minimize(loss, global_step=tf1.train.get_global_step())\n", " return tf1.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)\n", "\n", "strategy = tf1.distribute.MirroredStrategy()\n", "config = tf1.estimator.RunConfig(\n", " train_distribute=strategy, eval_distribute=strategy)\n", "estimator = tf1.estimator.Estimator(model_fn=_model_fn, config=config)\n", "\n", "train_spec = tf1.estimator.TrainSpec(input_fn=_input_fn)\n", "eval_spec = tf1.estimator.EvalSpec(input_fn=_eval_input_fn)\n", "tf1.estimator.train_and_evaluate(estimator, train_spec, eval_spec)" ] }, { "cell_type": "markdown", "metadata": { "id": "KEmzBjfnsxwT" }, "source": [ "## TensorFlow 2:使用 Keras 进行单工作进程训练" ] }, { "cell_type": "markdown", "metadata": { "id": "fkgkGf_AOaRR" }, "source": [ "迁移到 TensorFlow 2 时,可以将 Keras API 与 `tf.distribute.MirroredStrategy` 一起使用。\n", "\n", "如果您使用 `tf.keras` API 进行模型构建,并使用 Keras `Model.fit` 进行训练,那么主要区别在于,这会在 `Strategy.scope` 的上下文中实例化 Keras 模型、优化器和指标,而不是为 `tf.estimator.Estimator` 定义 `config`。\n", "\n", "如果您需要使用自定义训练循环,请查看[将 tf.distribute.Strategy 与自定义训练循环一起使用](https://tensorflow.google.cn/guide/distributed_training#using_tfdistributestrategy_with_custom_training_loops)指南。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T20:18:55.205306Z", "iopub.status.busy": "2022-12-14T20:18:55.204559Z", "iopub.status.idle": "2022-12-14T20:18:55.216520Z", "shell.execute_reply": "2022-12-14T20:18:55.215657Z" }, "id": "atVciNgPs0fw" }, "outputs": [], "source": [ "dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n", "eval_dataset = tf.data.Dataset.from_tensor_slices(\n", " (eval_features, eval_labels)).batch(1)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T20:18:55.219899Z", "iopub.status.busy": "2022-12-14T20:18:55.219374Z", "iopub.status.idle": "2022-12-14T20:18:59.444442Z", "shell.execute_reply": "2022-12-14T20:18:59.443703Z" }, "id": "Kip65sYBlKiu" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 20:18:55.307178: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: \"TensorSliceDataset/_2\"\n", "op: \"TensorSliceDataset\"\n", "input: \"Placeholder/_0\"\n", "input: \"Placeholder/_1\"\n", "attr {\n", " key: \"Toutput_types\"\n", " value {\n", " list {\n", " type: DT_FLOAT\n", " type: DT_FLOAT\n", " }\n", " }\n", "}\n", "attr {\n", " key: \"_cardinality\"\n", " value {\n", " i: 3\n", " }\n", "}\n", "attr {\n", " key: \"is_files\"\n", " value {\n", " b: false\n", " }\n", "}\n", "attr {\n", " key: \"metadata\"\n", " value {\n", " s: \"\\n\\025TensorSliceDataset:24\"\n", " }\n", "}\n", "attr {\n", " key: \"output_shapes\"\n", " value {\n", " list {\n", " shape {\n", " dim {\n", " size: 2\n", " }\n", " }\n", " shape {\n", " dim {\n", " size: 1\n", " }\n", " }\n", " }\n", " }\n", "}\n", "attr {\n", " key: \"replicate_on_split\"\n", " value {\n", " b: false\n", " }\n", "}\n", "experimental_type {\n", " type_id: TFT_PRODUCT\n", " args {\n", " type_id: TFT_DATASET\n", " args {\n", " type_id: TFT_PRODUCT\n", " args {\n", " type_id: TFT_TENSOR\n", " args {\n", " type_id: TFT_FLOAT\n", " }\n", " }\n", " args {\n", " type_id: TFT_TENSOR\n", " args {\n", " type_id: TFT_FLOAT\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/3 [=========>....................] - ETA: 5s - loss: 1.1238" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/3 [==============================] - 3s 8ms/step - loss: 2.0533\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 20:18:58.668860: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:784] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: \"TensorSliceDataset/_2\"\n", "op: \"TensorSliceDataset\"\n", "input: \"Placeholder/_0\"\n", "input: \"Placeholder/_1\"\n", "attr {\n", " key: \"Toutput_types\"\n", " value {\n", " list {\n", " type: DT_FLOAT\n", " type: DT_FLOAT\n", " }\n", " }\n", "}\n", "attr {\n", " key: \"_cardinality\"\n", " value {\n", " i: 3\n", " }\n", "}\n", "attr {\n", " key: \"is_files\"\n", " value {\n", " b: false\n", " }\n", "}\n", "attr {\n", " key: \"metadata\"\n", " value {\n", " s: \"\\n\\025TensorSliceDataset:26\"\n", " }\n", "}\n", "attr {\n", " key: \"output_shapes\"\n", " value {\n", " list {\n", " shape {\n", " dim {\n", " size: 2\n", " }\n", " }\n", " shape {\n", " dim {\n", " size: 1\n", " }\n", " }\n", " }\n", " }\n", "}\n", "attr {\n", " key: \"replicate_on_split\"\n", " value {\n", " b: false\n", " }\n", "}\n", "experimental_type {\n", " type_id: TFT_PRODUCT\n", " args {\n", " type_id: TFT_DATASET\n", " args {\n", " type_id: TFT_PRODUCT\n", " args {\n", " type_id: TFT_TENSOR\n", " args {\n", " type_id: TFT_FLOAT\n", " }\n", " }\n", " args {\n", " type_id: TFT_TENSOR\n", " args {\n", " type_id: TFT_FLOAT\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/3 [=========>....................] - ETA: 1s - loss: 3.5378" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/3 [==============================] - 1s 6ms/step - loss: 5.7745\n" ] }, { "data": { "text/plain": [ "{'loss': 5.774545192718506}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "strategy = tf.distribute.MirroredStrategy()\n", "with strategy.scope():\n", " model = tf.keras.models.Sequential([tf.keras.layers.Dense(1)])\n", " optimizer = tf.keras.optimizers.Adagrad(learning_rate=0.05)\n", "\n", "model.compile(optimizer=optimizer, loss='mse')\n", "model.fit(dataset)\n", "model.evaluate(eval_dataset, return_dict=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "0431f3935485" }, "source": [ "## 后续步骤" ] }, { "cell_type": "markdown", "metadata": { "id": "a68d2a99f79b" }, "source": [ "要详细了解如何在 TensorFlow 2 中使用 `tf.distribute.MirroredStrategy` 进行分布式训练,请查看以下文档:\n", "\n", "- [使用 Keras 在一台机器上进行分布式训练](../../tutorials/distribute/keras)教程\n", "- [使用自定义训练循环在一台机器上进行分布式训练](../../tutorials/distribute/custom_training)教程\n", "- [使用 TensorFlow 进行分布式训练](../../guide/distributed_training)指南\n", "- [使用多个 GPU](../../guide/gpu#using_multiple_gpus) 指南\n", "- [优化多 GPU 单主机上的性能(使用 TensorFlow Profiler)](../../guide/gpu_performance_analysis#2_optimize_the_performance_on_the_multi-gpu_single_host)指南" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "mirrored_strategy.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 0 }