{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Tce3stUlHN0L"
   },
   "source": [
    "##### Copyright 2019 The TensorFlow Authors.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cellView": "form",
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:04.028385Z",
     "iopub.status.busy": "2024-01-11T18:07:04.028107Z",
     "iopub.status.idle": "2024-01-11T18:07:04.031889Z",
     "shell.execute_reply": "2024-01-11T18:07:04.031316Z"
    },
    "id": "tuOe1ymfHZPu"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MfBg1C5NB3X0"
   },
   "source": [
    "# Keras と MultiWorkerMirroredStrategy を使用したカスタムトレーニングループ\n",
    "\n",
    "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td>     <a target=\"_blank\" href=\"https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl\">     <img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\">     TensorFlow.org で表示</a>\n",
    "</td>\n",
    "  <td>     <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/ja/tutorials/distribute/multi_worker_with_ctl.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\">Google Colabで実行</a>\n",
    "</td>\n",
    "  <td><a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/ja/tutorials/distribute/multi_worker_with_ctl.ipynb\">     <img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\">     GitHubでソースを表示</a></td>\n",
    "  <td><a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/ja/tutorials/distribute/multi_worker_with_ctl.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\">ノートブックをダウンロード</a></td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xHxb-dlhMIzW"
   },
   "source": [
    "## 概要\n",
    "\n",
    "このチュートリアルでは、`tf.distribute.Strategy` API を使用して、Keras モデルと[カスタムトレーニングループ](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch)でマルチワーカー分散トレーニングを実行する方法を実演します。トレーニングループは `tf.distribute.MultiWorkerMirroredStrategy` を介して分散され、[単一のワーカー](custom_training.ipynb)で実行するように設計された `tf.keras` モデルが、最小限のコード変更で複数のワーカーでシームレスに機能します。カスタムトレーニングループは、モデルのデバッグを容易にするでけでなく、柔軟なトレーニングとより優れた制御を提供します。詳細については、[基本的なトレーニングループの作成](../../guide/basic_training_loops.ipynb)、[ゼロからのトレーニングループの作成](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch)、[カスタムトレーニング](../customization/custom_training_walkthrough.ipynb)を参照してください。\n",
    "\n",
    "`tf.keras.Model.fit` で `MultiWorkerMirroredStrategy` を使用する方法については、この[チュートリアル](multi_worker_with_keras.ipynb)を参照してください。\n",
    "\n",
    "<code>tf.distribute.Strategy</code> API の理解をさらに深めるには、<a>TensorFlow での分散型トレーニング</a>ガイドを参照してください。このガイドでは、TensorFlow がサポートする分散ストラテジーの概要が提供されています。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MUXex9ctTuDB"
   },
   "source": [
    "## セットアップ\n",
    "\n",
    "まず、必要なものをインポートします。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:04.035164Z",
     "iopub.status.busy": "2024-01-11T18:07:04.034955Z",
     "iopub.status.idle": "2024-01-11T18:07:04.041177Z",
     "shell.execute_reply": "2024-01-11T18:07:04.040558Z"
    },
    "id": "bnYxvfLD-LW-"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import sys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Zz0EY91y3mxy"
   },
   "source": [
    "TensorFlow をインポートする前に、環境にいくつかの変更を加えます。\n",
    "\n",
    "- すべての GPU を無効にします。これにより、すべてのワーカーが同じ GPU を使用しようとすることによって発生するエラーが防止されます。実際のアプリケーションでは、各ワーカーは異なるマシン上にあります。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:04.044566Z",
     "iopub.status.busy": "2024-01-11T18:07:04.044030Z",
     "iopub.status.idle": "2024-01-11T18:07:04.047090Z",
     "shell.execute_reply": "2024-01-11T18:07:04.046513Z"
    },
    "id": "685pbYEY3jGC"
   },
   "outputs": [],
   "source": [
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"-1\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7X1MS6385BWi"
   },
   "source": [
    "- `'TF_CONFIG'` 環境変数をリセットします。これについては後で詳しく説明します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:04.050384Z",
     "iopub.status.busy": "2024-01-11T18:07:04.049798Z",
     "iopub.status.idle": "2024-01-11T18:07:04.052876Z",
     "shell.execute_reply": "2024-01-11T18:07:04.052319Z"
    },
    "id": "WEJLYa2_7OZF"
   },
   "outputs": [],
   "source": [
    "os.environ.pop('TF_CONFIG', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Rd4L9Ii77SS8"
   },
   "source": [
    "- 現在のディレクトリが Python のパス上にあることを確認してください。これにより、ノートブックは後で `%%writefile` によって書き込まれたファイルをインポートできます。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:04.055631Z",
     "iopub.status.busy": "2024-01-11T18:07:04.055424Z",
     "iopub.status.idle": "2024-01-11T18:07:04.058575Z",
     "shell.execute_reply": "2024-01-11T18:07:04.057906Z"
    },
    "id": "hPBuZUNSZmrQ"
   },
   "outputs": [],
   "source": [
    "if '.' not in sys.path:\n",
    "  sys.path.insert(0, '.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pDhHuMjb7bfU"
   },
   "source": [
    "次に TensorFlow をインポートします。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:04.061537Z",
     "iopub.status.busy": "2024-01-11T18:07:04.061306Z",
     "iopub.status.idle": "2024-01-11T18:07:06.413893Z",
     "shell.execute_reply": "2024-01-11T18:07:06.413202Z"
    },
    "id": "vHNvttzV43sA"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:04.484008: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "2024-01-11 18:07:04.484054: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "2024-01-11 18:07:04.485585: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0S2jpf6Sx50i"
   },
   "source": [
    "### データセットとモデルの定義"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fLW6D2TzvC-4"
   },
   "source": [
    "次に、単純なモデルとデータセットの設定を使用して `mnist.py` ファイルを作成します。この Python ファイルは、このチュートリアルのワーカープロセスによって使用されます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.418559Z",
     "iopub.status.busy": "2024-01-11T18:07:06.418132Z",
     "iopub.status.idle": "2024-01-11T18:07:06.423737Z",
     "shell.execute_reply": "2024-01-11T18:07:06.423087Z"
    },
    "id": "dma_wUAxZqo2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing mnist.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile mnist.py\n",
    "\n",
    "import os\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "def mnist_dataset(batch_size):\n",
    "  (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()\n",
    "  # The `x` arrays are in uint8 and have values in the range [0, 255].\n",
    "  # You need to convert them to float32 with values in the range [0, 1]\n",
    "  x_train = x_train / np.float32(255)\n",
    "  y_train = y_train.astype(np.int64)\n",
    "  train_dataset = tf.data.Dataset.from_tensor_slices(\n",
    "      (x_train, y_train)).shuffle(60000)\n",
    "  return train_dataset\n",
    "\n",
    "def dataset_fn(global_batch_size, input_context):\n",
    "  batch_size = input_context.get_per_replica_batch_size(global_batch_size)\n",
    "  dataset = mnist_dataset(batch_size)\n",
    "  dataset = dataset.shard(input_context.num_input_pipelines,\n",
    "                          input_context.input_pipeline_id)\n",
    "  dataset = dataset.batch(batch_size)\n",
    "  return dataset\n",
    "\n",
    "def build_cnn_model():\n",
    "  regularizer = tf.keras.regularizers.L2(1e-5)\n",
    "  return tf.keras.Sequential([\n",
    "      tf.keras.Input(shape=(28, 28)),\n",
    "      tf.keras.layers.Reshape(target_shape=(28, 28, 1)),\n",
    "      tf.keras.layers.Conv2D(32, 3,\n",
    "                             activation='relu',\n",
    "                             kernel_regularizer=regularizer),\n",
    "      tf.keras.layers.Flatten(),\n",
    "      tf.keras.layers.Dense(128,\n",
    "                            activation='relu',\n",
    "                            kernel_regularizer=regularizer),\n",
    "      tf.keras.layers.Dense(10, kernel_regularizer=regularizer)\n",
    "  ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JmgZwwymxqt5"
   },
   "source": [
    "## マルチワーカー構成\n",
    "\n",
    "これから、マルチワーカートレーニングを見ていきます。TensorFlow では、複数のマシンでのトレーニングに `'TF_CONFIG'` 環境変数が必要です。各マシンは異なる役割を持つ場合があります。以下で使用される `'TF_CONFIG'` 変数は、クラスタの一部である各ワーカーのクラスタ構成を指定する JSON 文字列です。これは、`cluster_resolver.TFConfigClusterResolver` を使用してクラスタを指定するためのデフォルトの方法ですが、`distribute.cluster_resolver` モジュールで利用可能な他のオプションがあります。`'TF_CONFIG'` 変数の設定の詳細は、[分散型トレーニングガイド](../../guide/distributed_training.ipynb)を参照してください。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SS8WhvRhe_Ya"
   },
   "source": [
    "### クラスタについて説明する\n",
    "\n",
    "以下に構成の例を示します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.426986Z",
     "iopub.status.busy": "2024-01-11T18:07:06.426733Z",
     "iopub.status.idle": "2024-01-11T18:07:06.430359Z",
     "shell.execute_reply": "2024-01-11T18:07:06.429729Z"
    },
    "id": "XK1eTYvSZiX7"
   },
   "outputs": [],
   "source": [
    "tf_config = {\n",
    "    'cluster': {\n",
    "        'worker': ['localhost:12345', 'localhost:23456']\n",
    "    },\n",
    "    'task': {'type': 'worker', 'index': 0}\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JjgwJbPKZkJL"
   },
   "source": [
    "`tf_config` は Python の単なるローカル変数であることに注意してください。トレーニング構成に使用するには、JSON としてシリアル化し、`'TF_CONFIG'` 環境変数に配置します。JSON 文字列としてシリアル化された同じ `'TF_CONFIG'` を次に示します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.433531Z",
     "iopub.status.busy": "2024-01-11T18:07:06.433274Z",
     "iopub.status.idle": "2024-01-11T18:07:06.439865Z",
     "shell.execute_reply": "2024-01-11T18:07:06.439274Z"
    },
    "id": "yY-T0YDQZjbu"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'{\"cluster\": {\"worker\": [\"localhost:12345\", \"localhost:23456\"]}, \"task\": {\"type\": \"worker\", \"index\": 0}}'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json.dumps(tf_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AUBmYRZqxthH"
   },
   "source": [
    "`TF_CONFIG` には、`cluster` と `task` の 2 つのコンポーネントがあります。\n",
    "\n",
    "- `'cluster'` はすべてのワーカーで同じであり、トレーニングクラスタに関する情報を提供します。これは、`'worker'` などのさまざまなタイプのジョブで構成されるディクショナリです。`MultiWorkerMirroredStrategy` を使用するマルチワーカートレーニングでは、通常、一般的な `'worker'` の作業に加えて、チェックポイントの保存や TensorBoard のサマリーファイルの書き込みなど、ほかよりタスクを担う `'worker'` が 1 つあります。こういったワーカーは、`'chief'` ワーカーと呼ばれ、`'index'` 0 のワーカーがチーフワーカーに指定されるようになっています。\n",
    "\n",
    "- `'task'` は現在のタスクの情報を提供し、ワーカーごとに異なります。タスクはそのワーカーの `'type'` と `'index'` を指定します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8YFpxrcsZ2xG"
   },
   "source": [
    "この例では、タスクの `'type'` を `'worker'`、そしてタスクの `'index'` を `0` に指定します。つまり、このような設定を持つマシンが最初のワーカーであり、チーフワーカーとして指定されて他のワーカーよりも多くの作業を実行します。他のマシンには、`'TF_CONFIG'` 環境変数も設定されており、同一の `'cluster'` ディクショナリも必要ですが、タスクの `'type'` やタスクの `'index'` は、それらのマシンの役割に応じて異なります。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aogb74kHxynz"
   },
   "source": [
    "このチュートリアルでは、例として `'localhost'` 上に 2 つのワーカーを持つ `'TF_CONFIG'` の設定方法を紹介します。実際には、外部 IP アドレスとポートに複数のワーカーを作成し、各ワーカーに適切な `'TF_CONFIG'` を設定します。\n",
    "\n",
    "この例では、2 つのワーカーを使用します。最初のワーカーの `'TF_CONFIG'` は上に示されています。2  番目のワーカーには、`tf_config['task']['index']=1` を設定します。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cIlkfWmjz1PG"
   },
   "source": [
    "### ノートブックの環境変数とサブプロセス"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FcjAbuGY1ACJ"
   },
   "source": [
    "サブプロセスは、親から環境変数を継承します。したがって、この Jupyter ノートブックプロセスで環境変数を設定すると、次のようになります。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.443319Z",
     "iopub.status.busy": "2024-01-11T18:07:06.443087Z",
     "iopub.status.idle": "2024-01-11T18:07:06.446077Z",
     "shell.execute_reply": "2024-01-11T18:07:06.445496Z"
    },
    "id": "PH2gHn2_0_U8"
   },
   "outputs": [],
   "source": [
    "os.environ['GREETINGS'] = 'Hello TensorFlow!'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gQkIX-cg18md"
   },
   "source": [
    "サブプロセスから環境変数にアクセスできます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.449111Z",
     "iopub.status.busy": "2024-01-11T18:07:06.448868Z",
     "iopub.status.idle": "2024-01-11T18:07:06.485987Z",
     "shell.execute_reply": "2024-01-11T18:07:06.485096Z"
    },
    "id": "pquKO6IA18G5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello TensorFlow!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo ${GREETINGS}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "af6BCA-Y2fpz"
   },
   "source": [
    "次のセクションでは、これを使用して `TF_CONFIG` をワーカーサブプロセスに渡します。この方法で実際にジョブを起動することは決してありませんが、このチュートリアルで最小限のマルチワーカーの例を示すためには十分です。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UhNtHfuxCGVy"
   },
   "source": [
    "## MultiWorkerMirroredStrategy\n",
    "\n",
    "モデルをトレーニングする前に、まず `tf.distribute.MultiWorkerMirroredStrategy` のインスタンスを作成します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.489858Z",
     "iopub.status.busy": "2024-01-11T18:07:06.489585Z",
     "iopub.status.idle": "2024-01-11T18:07:06.623882Z",
     "shell.execute_reply": "2024-01-11T18:07:06.623246Z"
    },
    "id": "1uFSHCJXMrQ-"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Using MirroredStrategy with devices ('/device:CPU:0',)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTO\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:06.593731: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n"
     ]
    }
   ],
   "source": [
    "strategy = tf.distribute.MultiWorkerMirroredStrategy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N0iv7SyyAohc"
   },
   "source": [
    "注意: `tf.distribute.MultiWorkerMirroredStrategy` を呼び出すと、`'TF_CONFIG'` が解析され、TensorFlow の GRPC サーバーが開始されます。したがって、`tf.distribute.Strategy` をインスタンス化する前に、`'TF_CONFIG'` 環境変数を設定する必要があります。時間を節約するために、このチュートリアルではサーバーを起動する必要がありません。完全な例は、このチュートリアルの最後のセクションにあります。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TS4S-faBHHam"
   },
   "source": [
    "`tf.distribute.Strategy.scope` を使用して、モデルの構築時にストラテジーを使用する必要があることを指定します。これにより、ストラテジーは変数の配置などを制御できます。すべてのワーカーの各デバイスのモデルのレイヤーにすべての変数のコピーが作成されます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.627200Z",
     "iopub.status.busy": "2024-01-11T18:07:06.626959Z",
     "iopub.status.idle": "2024-01-11T18:07:06.724006Z",
     "shell.execute_reply": "2024-01-11T18:07:06.723320Z"
    },
    "id": "nXV49tG1_opc"
   },
   "outputs": [],
   "source": [
    "import mnist\n",
    "with strategy.scope():\n",
    "  # Model building needs to be within `strategy.scope()`.\n",
    "  multi_worker_model = mnist.build_cnn_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DSYkM-on6r3Y"
   },
   "source": [
    "## ワーカー間でデータを自動シャーディングする\n",
    "\n",
    "マルチワーカー トレーニングでは、収束と再現性を確保するために*データセットのシャーディング*が必要です。シャーディングとは、各ワーカーにデータセット全体のサブセットを渡すことを意味し、単一のワーカーでのトレーニングに似たエクスペリエンスを作成するのに役立ちます。以下の例では、`tf.distribute` のデフォルトの自動シャーディング ポリシーを使用しています。`tf.data.experimental.DistributeOptions` の `tf.data.experimental.AutoShardPolicy` を設定してカスタマイズすることもできます。詳細については、[分散入力チュートリアル](input.ipynb)の*シャーディング*セクションを参照してください。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:06.728032Z",
     "iopub.status.busy": "2024-01-11T18:07:06.727463Z",
     "iopub.status.idle": "2024-01-11T18:07:07.329483Z",
     "shell.execute_reply": "2024-01-11T18:07:07.328681Z"
    },
    "id": "65-p36pt6rUF"
   },
   "outputs": [],
   "source": [
    "per_worker_batch_size = 64\n",
    "num_workers = len(tf_config['cluster']['worker'])\n",
    "global_batch_size = per_worker_batch_size * num_workers\n",
    "\n",
    "with strategy.scope():\n",
    "  multi_worker_dataset = strategy.distribute_datasets_from_function(\n",
    "      lambda input_context: mnist.dataset_fn(global_batch_size, input_context))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rkNzSR3g60iP"
   },
   "source": [
    "## カスタムトレーニングループを定義してモデルをトレーニングする\n",
    "\n",
    "オプティマイザーを指定します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:07.333760Z",
     "iopub.status.busy": "2024-01-11T18:07:07.333500Z",
     "iopub.status.idle": "2024-01-11T18:07:07.345689Z",
     "shell.execute_reply": "2024-01-11T18:07:07.345036Z"
    },
    "id": "NoMr4_zTeKSn"
   },
   "outputs": [],
   "source": [
    "with strategy.scope():\n",
    "  # The creation of optimizer and train_accuracy needs to be in\n",
    "  # `strategy.scope()` as well, since they create variables.\n",
    "  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)\n",
    "  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n",
    "      name='train_accuracy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RmrDcAii4B5O"
   },
   "source": [
    "`tf.function` でトレーニング ステップを定義します。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:07.348933Z",
     "iopub.status.busy": "2024-01-11T18:07:07.348668Z",
     "iopub.status.idle": "2024-01-11T18:07:07.355204Z",
     "shell.execute_reply": "2024-01-11T18:07:07.354623Z"
    },
    "id": "znXWN5S3eUDB"
   },
   "outputs": [],
   "source": [
    "@tf.function\n",
    "def train_step(iterator):\n",
    "  \"\"\"Training step function.\"\"\"\n",
    "\n",
    "  def step_fn(inputs):\n",
    "    \"\"\"Per-Replica step function.\"\"\"\n",
    "    x, y = inputs\n",
    "    with tf.GradientTape() as tape:\n",
    "      predictions = multi_worker_model(x, training=True)\n",
    "      per_example_loss = tf.keras.losses.SparseCategoricalCrossentropy(\n",
    "          from_logits=True,\n",
    "          reduction=tf.keras.losses.Reduction.NONE)(y, predictions)\n",
    "      loss = tf.nn.compute_average_loss(per_example_loss)\n",
    "      model_losses = multi_worker_model.losses\n",
    "      if model_losses:\n",
    "        loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))\n",
    "\n",
    "    grads = tape.gradient(loss, multi_worker_model.trainable_variables)\n",
    "    optimizer.apply_gradients(\n",
    "        zip(grads, multi_worker_model.trainable_variables))\n",
    "    train_accuracy.update_state(y, predictions)\n",
    "    return loss\n",
    "\n",
    "  per_replica_losses = strategy.run(step_fn, args=(next(iterator),))\n",
    "  return strategy.reduce(\n",
    "      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eFXHsUVBy0Rx"
   },
   "source": [
    "### チェックポイントの保存と復元\n",
    "\n",
    "カスタムトレーニングループを作成するときは、Keras コールバックに依存するのではなく、[チェックポイントの保存](../../guide/checkpoint.ipynb)を手動で処理する必要があります。`MultiWorkerMirroredStrategy` の場合、チェックポイントまたは完全なモデルを保存するには、すべてのワーカーが含まれる必要があることに注意してください。チーフワーカーだけを保存しようとすると、デッドロックが発生する可能性があるためです。ワーカーは、相互に上書きしないように、異なるパスに書き込む必要もあります。ディレクトリを構成する方法の例を次に示します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:07.358448Z",
     "iopub.status.busy": "2024-01-11T18:07:07.357852Z",
     "iopub.status.idle": "2024-01-11T18:07:07.363437Z",
     "shell.execute_reply": "2024-01-11T18:07:07.362872Z"
    },
    "id": "LcFO6x1KyjhI"
   },
   "outputs": [],
   "source": [
    "from multiprocessing import util\n",
    "checkpoint_dir = os.path.join(util.get_temp_dir(), 'ckpt')\n",
    "\n",
    "def _is_chief(task_type, task_id, cluster_spec):\n",
    "  return (task_type is None\n",
    "          or task_type == 'chief'\n",
    "          or (task_type == 'worker'\n",
    "              and task_id == 0\n",
    "              and \"chief\" not in cluster_spec.as_dict()))\n",
    "\n",
    "def _get_temp_dir(dirpath, task_id):\n",
    "  base_dirpath = 'workertemp_' + str(task_id)\n",
    "  temp_dir = os.path.join(dirpath, base_dirpath)\n",
    "  tf.io.gfile.makedirs(temp_dir)\n",
    "  return temp_dir\n",
    "\n",
    "def write_filepath(filepath, task_type, task_id, cluster_spec):\n",
    "  dirpath = os.path.dirname(filepath)\n",
    "  base = os.path.basename(filepath)\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    dirpath = _get_temp_dir(dirpath, task_id)\n",
    "  return os.path.join(dirpath, base)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nrcdPHtG4ObO"
   },
   "source": [
    "モデルをトラッキングする `tf.train.Checkpoint` を 1 つ作成します。これは `tf.train.CheckpointManager` により管理されるため、最新のチェックポイントのみが保存されます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:07.366091Z",
     "iopub.status.busy": "2024-01-11T18:07:07.365874Z",
     "iopub.status.idle": "2024-01-11T18:07:07.374114Z",
     "shell.execute_reply": "2024-01-11T18:07:07.373503Z"
    },
    "id": "4rURT2pI4aqV"
   },
   "outputs": [],
   "source": [
    "epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64), name='epoch')\n",
    "step_in_epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64),\n",
    "    name='step_in_epoch')\n",
    "task_type, task_id = (strategy.cluster_resolver.task_type,\n",
    "                      strategy.cluster_resolver.task_id)\n",
    "# Normally, you don't need to manually instantiate a `ClusterSpec`, but in this\n",
    "# illustrative example you did not set `'TF_CONFIG'` before initializing the\n",
    "# strategy. Check out the next section for \"real-world\" usage.\n",
    "cluster_spec = tf.train.ClusterSpec(tf_config['cluster'])\n",
    "\n",
    "checkpoint = tf.train.Checkpoint(\n",
    "    model=multi_worker_model, epoch=epoch, step_in_epoch=step_in_epoch)\n",
    "\n",
    "write_checkpoint_dir = write_filepath(checkpoint_dir, task_type, task_id,\n",
    "                                      cluster_spec)\n",
    "checkpoint_manager = tf.train.CheckpointManager(\n",
    "    checkpoint, directory=write_checkpoint_dir, max_to_keep=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RO7cbN40XD5v"
   },
   "source": [
    "チェックポイントを復元する必要があれば、便利な `tf.train.latest_checkpoint` 関数を使用して、保存された最新のチェックポイントを見つけることができます (または `tf.train.CheckpointManager.restore_or_initialize` を呼び出します)。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:07.377022Z",
     "iopub.status.busy": "2024-01-11T18:07:07.376774Z",
     "iopub.status.idle": "2024-01-11T18:07:07.380176Z",
     "shell.execute_reply": "2024-01-11T18:07:07.379604Z"
    },
    "id": "gniynaQj6HMV"
   },
   "outputs": [],
   "source": [
    "latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\n",
    "if latest_checkpoint:\n",
    "  checkpoint.restore(latest_checkpoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1j9JuI-h6ObW"
   },
   "source": [
    "チェックポイントを復元した後、カスタムトレーニングループのトレーニングを続行できます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:07.383368Z",
     "iopub.status.busy": "2024-01-11T18:07:07.382833Z",
     "iopub.status.idle": "2024-01-11T18:07:11.518759Z",
     "shell.execute_reply": "2024-01-11T18:07:11.518005Z"
    },
    "id": "kZzXZCh45FY6"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:07.596472: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 0, accuracy: 0.828348, train_loss: 0.563694.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 1, accuracy: 0.928125, train_loss: 0.244830.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 2, accuracy: 0.945201, train_loss: 0.188251.\n"
     ]
    }
   ],
   "source": [
    "num_epochs = 3\n",
    "num_steps_per_epoch = 70\n",
    "\n",
    "while epoch.numpy() < num_epochs:\n",
    "  iterator = iter(multi_worker_dataset)\n",
    "  total_loss = 0.0\n",
    "  num_batches = 0\n",
    "\n",
    "  while step_in_epoch.numpy() < num_steps_per_epoch:\n",
    "    total_loss += train_step(iterator)\n",
    "    num_batches += 1\n",
    "    step_in_epoch.assign_add(1)\n",
    "\n",
    "  train_loss = total_loss / num_batches\n",
    "  print('Epoch: %d, accuracy: %f, train_loss: %f.'\n",
    "                %(epoch.numpy(), train_accuracy.result(), train_loss))\n",
    "\n",
    "  train_accuracy.reset_states()\n",
    "\n",
    "  # Once the `CheckpointManager` is set up, you're now ready to save, and remove\n",
    "  # the checkpoints non-chief workers saved.\n",
    "  checkpoint_manager.save()\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    tf.io.gfile.rmtree(write_checkpoint_dir)\n",
    "\n",
    "  epoch.assign_add(1)\n",
    "  step_in_epoch.assign(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0W1Osks466DE"
   },
   "source": [
    "## 完全なコードの概要"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jfYpmIxO6Jck"
   },
   "source": [
    "これまでに説明したすべての手順の概要は、以下のとおりです。\n",
    "\n",
    "1. ワーカープロセスを作成します。\n",
    "2. `'TF_CONFIG'` をワーカープロセスに渡します。\n",
    "3. 各ワークプロセスで、トレーニングコードを含む以下のスクリプトを実行します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:11.522675Z",
     "iopub.status.busy": "2024-01-11T18:07:11.522375Z",
     "iopub.status.idle": "2024-01-11T18:07:11.528883Z",
     "shell.execute_reply": "2024-01-11T18:07:11.528203Z"
    },
    "id": "MIDCESkVzN6M"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing main.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile main.py\n",
    "#@title File: `main.py`\n",
    "import os\n",
    "import json\n",
    "import tensorflow as tf\n",
    "import mnist\n",
    "from multiprocessing import util\n",
    "\n",
    "per_worker_batch_size = 64\n",
    "tf_config = json.loads(os.environ['TF_CONFIG'])\n",
    "num_workers = len(tf_config['cluster']['worker'])\n",
    "global_batch_size = per_worker_batch_size * num_workers\n",
    "\n",
    "num_epochs = 3\n",
    "num_steps_per_epoch=70\n",
    "\n",
    "# Checkpoint saving and restoring\n",
    "def _is_chief(task_type, task_id, cluster_spec):\n",
    "  return (task_type is None\n",
    "          or task_type == 'chief'\n",
    "          or (task_type == 'worker'\n",
    "              and task_id == 0\n",
    "              and 'chief' not in cluster_spec.as_dict()))\n",
    "\n",
    "def _get_temp_dir(dirpath, task_id):\n",
    "  base_dirpath = 'workertemp_' + str(task_id)\n",
    "  temp_dir = os.path.join(dirpath, base_dirpath)\n",
    "  tf.io.gfile.makedirs(temp_dir)\n",
    "  return temp_dir\n",
    "\n",
    "def write_filepath(filepath, task_type, task_id, cluster_spec):\n",
    "  dirpath = os.path.dirname(filepath)\n",
    "  base = os.path.basename(filepath)\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    dirpath = _get_temp_dir(dirpath, task_id)\n",
    "  return os.path.join(dirpath, base)\n",
    "\n",
    "checkpoint_dir = os.path.join(util.get_temp_dir(), 'ckpt')\n",
    "\n",
    "# Define Strategy\n",
    "strategy = tf.distribute.MultiWorkerMirroredStrategy()\n",
    "\n",
    "with strategy.scope():\n",
    "  # Model building/compiling need to be within `tf.distribute.Strategy.scope`.\n",
    "  multi_worker_model = mnist.build_cnn_model()\n",
    "\n",
    "  multi_worker_dataset = strategy.distribute_datasets_from_function(\n",
    "      lambda input_context: mnist.dataset_fn(global_batch_size, input_context))\n",
    "  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)\n",
    "  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n",
    "      name='train_accuracy')\n",
    "\n",
    "@tf.function\n",
    "def train_step(iterator):\n",
    "  \"\"\"Training step function.\"\"\"\n",
    "\n",
    "  def step_fn(inputs):\n",
    "    \"\"\"Per-Replica step function.\"\"\"\n",
    "    x, y = inputs\n",
    "    with tf.GradientTape() as tape:\n",
    "      predictions = multi_worker_model(x, training=True)\n",
    "      per_example_loss = tf.keras.losses.SparseCategoricalCrossentropy(\n",
    "          from_logits=True,\n",
    "          reduction=tf.keras.losses.Reduction.NONE)(y, predictions)\n",
    "      loss = tf.nn.compute_average_loss(per_example_loss)\n",
    "      model_losses = multi_worker_model.losses\n",
    "      if model_losses:\n",
    "        loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))\n",
    "\n",
    "    grads = tape.gradient(loss, multi_worker_model.trainable_variables)\n",
    "    optimizer.apply_gradients(\n",
    "        zip(grads, multi_worker_model.trainable_variables))\n",
    "    train_accuracy.update_state(y, predictions)\n",
    "\n",
    "    return loss\n",
    "\n",
    "  per_replica_losses = strategy.run(step_fn, args=(next(iterator),))\n",
    "  return strategy.reduce(\n",
    "      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)\n",
    "\n",
    "epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64), name='epoch')\n",
    "step_in_epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64),\n",
    "    name='step_in_epoch')\n",
    "\n",
    "task_type, task_id, cluster_spec = (strategy.cluster_resolver.task_type,\n",
    "                                    strategy.cluster_resolver.task_id,\n",
    "                                    strategy.cluster_resolver.cluster_spec())\n",
    "\n",
    "checkpoint = tf.train.Checkpoint(\n",
    "    model=multi_worker_model, epoch=epoch, step_in_epoch=step_in_epoch)\n",
    "\n",
    "write_checkpoint_dir = write_filepath(checkpoint_dir, task_type, task_id,\n",
    "                                      cluster_spec)\n",
    "checkpoint_manager = tf.train.CheckpointManager(\n",
    "    checkpoint, directory=write_checkpoint_dir, max_to_keep=1)\n",
    "\n",
    "# Restoring the checkpoint\n",
    "latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\n",
    "if latest_checkpoint:\n",
    "  checkpoint.restore(latest_checkpoint)\n",
    "\n",
    "# Resume our CTL training\n",
    "while epoch.numpy() < num_epochs:\n",
    "  iterator = iter(multi_worker_dataset)\n",
    "  total_loss = 0.0\n",
    "  num_batches = 0\n",
    "\n",
    "  while step_in_epoch.numpy() < num_steps_per_epoch:\n",
    "    total_loss += train_step(iterator)\n",
    "    num_batches += 1\n",
    "    step_in_epoch.assign_add(1)\n",
    "\n",
    "  train_loss = total_loss / num_batches\n",
    "  print('Epoch: %d, accuracy: %f, train_loss: %f.'\n",
    "                %(epoch.numpy(), train_accuracy.result(), train_loss))\n",
    "\n",
    "  train_accuracy.reset_states()\n",
    "\n",
    "  checkpoint_manager.save()\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    tf.io.gfile.rmtree(write_checkpoint_dir)\n",
    "\n",
    "  epoch.assign_add(1)\n",
    "  step_in_epoch.assign(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ItVOvPN1qnZ6"
   },
   "source": [
    "現在のディレクトリには、両方の Python ファイルが含まれています。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:11.532421Z",
     "iopub.status.busy": "2024-01-11T18:07:11.532155Z",
     "iopub.status.idle": "2024-01-11T18:07:11.596064Z",
     "shell.execute_reply": "2024-01-11T18:07:11.595230Z"
    },
    "id": "bi6x05Sr60O9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "main.py\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mnist.py\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "ls *.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qmEEStPS6vR_"
   },
   "source": [
    "JSON は `TF_CONFIG` をシリアル化し、環境変数に追加します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:11.599947Z",
     "iopub.status.busy": "2024-01-11T18:07:11.599661Z",
     "iopub.status.idle": "2024-01-11T18:07:11.603682Z",
     "shell.execute_reply": "2024-01-11T18:07:11.603041Z"
    },
    "id": "9uu3g7vV7Bbt"
   },
   "outputs": [],
   "source": [
    "os.environ['TF_CONFIG'] = json.dumps(tf_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MsY3dQLK7jdf"
   },
   "source": [
    "これで、`main.py` を実行し、`'TF_CONFIG'` を使用するワーカープロセスを起動できます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:11.606900Z",
     "iopub.status.busy": "2024-01-11T18:07:11.606658Z",
     "iopub.status.idle": "2024-01-11T18:07:11.610540Z",
     "shell.execute_reply": "2024-01-11T18:07:11.609894Z"
    },
    "id": "txMXaq8d8N_S"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All background processes were killed.\n"
     ]
    }
   ],
   "source": [
    "# first kill any previous runs\n",
    "%killbgscripts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:11.613551Z",
     "iopub.status.busy": "2024-01-11T18:07:11.613284Z",
     "iopub.status.idle": "2024-01-11T18:07:11.668946Z",
     "shell.execute_reply": "2024-01-11T18:07:11.667866Z"
    },
    "id": "qnSma_Ck7r-r"
   },
   "outputs": [],
   "source": [
    "%%bash --bg\n",
    "python main.py &> job_0.log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZChyazqS7v0P"
   },
   "source": [
    "上記のコマンドについて注意すべき点がいくつかあります。\n",
    "\n",
    "1. [ノートブック「マジック」](https://ipython.readthedocs.io/en/stable/interactive/magics.html)である `%%bash` を使用して、いくつかの bash コマンドを実行します。\n",
    "2. このワーカーは終了しないため、`--bg` フラグを使用して `bash` プロセスをバックグラウンドで実行します。このワーカーは始める前にすべてのワーカーを待ちます。\n",
    "\n",
    "バックグラウンドのワーカープロセスはこのノートブックに出力を出力しないため、`&>` で出力をファイルにリダイレクトし、何が起こったかを検査できます。\n",
    "\n",
    "プロセスが開始するまで数秒待ちます。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:11.672944Z",
     "iopub.status.busy": "2024-01-11T18:07:11.672670Z",
     "iopub.status.idle": "2024-01-11T18:07:31.696607Z",
     "shell.execute_reply": "2024-01-11T18:07:31.695845Z"
    },
    "id": "Hm2yrULE9281"
   },
   "outputs": [],
   "source": [
    "import time\n",
    "time.sleep(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZFPoNxg_9_Mx"
   },
   "source": [
    "これまでにワーカーのログファイルに出力されたものを検査します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:31.700773Z",
     "iopub.status.busy": "2024-01-11T18:07:31.700488Z",
     "iopub.status.idle": "2024-01-11T18:07:31.762527Z",
     "shell.execute_reply": "2024-01-11T18:07:31.761758Z"
    },
    "id": "vZEOuVgQ9-hn"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:12.173295: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:12.173354: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:12.174760: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:14.114778: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cat job_0.log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RqZhVF7L_KOy"
   },
   "source": [
    "ログファイルの最後の行は `Started server with target: grpc://localhost:12345` であるはずです。最初のワーカーは準備が整い、他のすべてのワーカーの準備が整うのを待っています。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Pi8vPNNA_l4a"
   },
   "source": [
    "2 番目のワーカーのプロセスを始めるように `tf_config` を更新します。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:31.766323Z",
     "iopub.status.busy": "2024-01-11T18:07:31.766034Z",
     "iopub.status.idle": "2024-01-11T18:07:31.770001Z",
     "shell.execute_reply": "2024-01-11T18:07:31.769340Z"
    },
    "id": "lAiYkkPu_Jqd"
   },
   "outputs": [],
   "source": [
    "tf_config['task']['index'] = 1\n",
    "os.environ['TF_CONFIG'] = json.dumps(tf_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0AshGVO0_x0w"
   },
   "source": [
    "次に、2 番目のワーカーを起動します。すべてのワーカーがアクティブであるため、これによりトレーニングが開始されます（したがって、このプロセスをバックグラウンドで実行する必要はありません）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:31.773323Z",
     "iopub.status.busy": "2024-01-11T18:07:31.773047Z",
     "iopub.status.idle": "2024-01-11T18:07:46.172579Z",
     "shell.execute_reply": "2024-01-11T18:07:46.170778Z"
    },
    "id": "_ESVtyQ9_xjx"
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "python main.py > /dev/null 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hX4FA2O2AuAn"
   },
   "source": [
    "最初のワーカーにより書き込まれたログを再確認すると、そのモデルのトレーニングに参加していることがわかります。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:46.176876Z",
     "iopub.status.busy": "2024-01-11T18:07:46.176602Z",
     "iopub.status.idle": "2024-01-11T18:07:46.241946Z",
     "shell.execute_reply": "2024-01-11T18:07:46.241010Z"
    },
    "id": "rc6hw3yTBKXX"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:12.173295: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:12.173354: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:12.174760: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:14.114778: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:274] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-01-11 18:07:35.302084: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 0, accuracy: 0.823103, train_loss: 0.584635.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 1, accuracy: 0.923103, train_loss: 0.254365.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 2, accuracy: 0.949888, train_loss: 0.180697.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cat job_0.log"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-01-11T18:07:46.245422Z",
     "iopub.status.busy": "2024-01-11T18:07:46.245067Z",
     "iopub.status.idle": "2024-01-11T18:07:46.249997Z",
     "shell.execute_reply": "2024-01-11T18:07:46.249349Z"
    },
    "id": "sG5_1UgrgniF"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All background processes were killed.\n"
     ]
    }
   ],
   "source": [
    "# Delete the `'TF_CONFIG'`, and kill any background tasks so they don't affect the next section.\n",
    "os.environ.pop('TF_CONFIG', None)\n",
    "%killbgscripts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bhxMXa0AaZkK"
   },
   "source": [
    "## マルチワーカートレーニングの詳細\n",
    "\n",
    "このチュートリアルでは、マルチワーカーセットアップのカスタムトレーニングループのワークフローを紹介しました。他のトピックの詳細な説明は、カスタムトレーニングループ向けの [Keras を使用したマルチワーカートレーニング (`tf.keras.Model.fit`)](multi_worker_with_keras.ipynb) チュートリアルを参照してください。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ega2hdOQEmy_"
   },
   "source": [
    "## 詳細情報\n",
    "\n",
    "1. [TensorFlow での分散型トレーニング](../../guide/distributed_training.ipynb)ガイドでは、利用可能な分散ストラテジーの概要が説明されています。\n",
    "2. [公式モデル](https://github.com/tensorflow/models/tree/master/official)。この多くは、複数の分散ストラテジーで実行するように構成できます。\n",
    "3. `tf.function` ガイドの[パフォーマンスの改善](../../guide/function.ipynb)では、その他のストラテジーや、TensorFlow モデルのパフォーマンスを最適化するために使用できる<a>ツール</a>に関する情報が提供されています。\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "multi_worker_with_ctl.ipynb",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}