{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Tce3stUlHN0L"
   },
   "source": [
    "##### Copyright 2019 The TensorFlow Authors.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cellView": "form",
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:53.937910Z",
     "iopub.status.busy": "2022-12-15T02:05:53.937282Z",
     "iopub.status.idle": "2022-12-15T02:05:53.941584Z",
     "shell.execute_reply": "2022-12-15T02:05:53.940978Z"
    },
    "id": "tuOe1ymfHZPu"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MfBg1C5NB3X0"
   },
   "source": [
    "# Keras 및 MultiWorkerMirroredStrategy를 사용한 사용자 정의 훈련 루프\n",
    "\n",
    "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td><a target=\"_blank\" href=\"https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\">TensorFlow.org에서 보기</a></td>\n",
    "  <td><a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/ko/tutorials/distribute/multi_worker_with_ctl.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\">Google Colab에서 실행하기</a></td>\n",
    "  <td><a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/ko/tutorials/distribute/multi_worker_with_ctl.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\">GitHub에서 소스 보기</a></td>\n",
    "  <td><a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/ko/tutorials/distribute/multi_worker_with_ctl.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\">노트북 다운로드하기</a></td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xHxb-dlhMIzW"
   },
   "source": [
    "## 개요\n",
    "\n",
    "이 튜토리얼에서는 <code>tf.distribute.Strategy</code> API를 사용하여 Keras 모델 및 <a>사용자 정의 훈련 루프</a>로 다중 작업자 분산 훈련을 수행하는 방법을 보여줍니다. 훈련 루프는 `tf.distribute.MultiWorkerMirroredStrategy`를 통해 배포되므로 <a>단일 작업자</a>에서 실행되도록 설계된 <code>tf.keras</code> 모델은 최소한의 코드 변경만으로 여러 작업자에서 원활하게 작동할 수 있습니다. 사용자 정의 훈련 루프는 훈련에 대한 유연성과 더 많은 통제력을 제공하는 동시에 모델을 디버그하기 더 쉽게 만들어줍니다. [기본 훈련 루프](../../guide/basic_training_loops.ipynb), [처음부터 훈련 루프 작성하기](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch) 및 [사용자 정의 훈련](../customization/custom_training_walkthrough.ipynb)에 대해 자세히 알아보세요.\n",
    "\n",
    "`tf.keras.Model.fit`를 `MultiWorkerMirroredStrategy`과 함께 사용하는 방법을 찾고 있다면 이 [튜토리얼](multi_worker_with_keras.ipynb)을 대신 참조하세요.\n",
    "\n",
    "<code>tf.distribute.Strategy</code> API를 심층적으로 이해하는 데 관심이 있는 분들은 <a>TensorFlow로 분산 훈련하기</a> 가이드에서 TensorFlow가 제공하는 분산 훈련 전략들을 훑어보실 수 있습니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MUXex9ctTuDB"
   },
   "source": [
    "## 설정\n",
    "\n",
    "먼저, 몇 가지 필요한 패키지를 가져옵니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:53.945595Z",
     "iopub.status.busy": "2022-12-15T02:05:53.944983Z",
     "iopub.status.idle": "2022-12-15T02:05:53.952540Z",
     "shell.execute_reply": "2022-12-15T02:05:53.951864Z"
    },
    "id": "bnYxvfLD-LW-"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import sys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Zz0EY91y3mxy"
   },
   "source": [
    "TensorFlow를 가져오기 전에 환경을 일부 변경합니다.\n",
    "\n",
    "- 모든 GPU를 비활성화합니다. 그러면 모든 작업자가 동일한 GPU를 사용하려고 하여 발생하는 오류가 방지됩니다. 실제 애플리케이션에서는 각 작업자가 다른 시스템에 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:53.956120Z",
     "iopub.status.busy": "2022-12-15T02:05:53.955552Z",
     "iopub.status.idle": "2022-12-15T02:05:53.958737Z",
     "shell.execute_reply": "2022-12-15T02:05:53.958139Z"
    },
    "id": "685pbYEY3jGC"
   },
   "outputs": [],
   "source": [
    "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"-1\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7X1MS6385BWi"
   },
   "source": [
    "- `'TF_CONFIG'` 환경 변수를 재설정합니다(나중에 이에 대해 자세히 볼 수 있음)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:53.962333Z",
     "iopub.status.busy": "2022-12-15T02:05:53.961769Z",
     "iopub.status.idle": "2022-12-15T02:05:53.964974Z",
     "shell.execute_reply": "2022-12-15T02:05:53.964372Z"
    },
    "id": "WEJLYa2_7OZF"
   },
   "outputs": [],
   "source": [
    "os.environ.pop('TF_CONFIG', None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Rd4L9Ii77SS8"
   },
   "source": [
    "- 현재 디렉터리가 Python의 경로에 있는지 확인합니다. 그렇게 되어 있으면 나중에 `%%writefile`이 작성한 파일을 노트북에서 가져올 수 있습니다.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:53.968274Z",
     "iopub.status.busy": "2022-12-15T02:05:53.967796Z",
     "iopub.status.idle": "2022-12-15T02:05:53.971014Z",
     "shell.execute_reply": "2022-12-15T02:05:53.970394Z"
    },
    "id": "hPBuZUNSZmrQ"
   },
   "outputs": [],
   "source": [
    "if '.' not in sys.path:\n",
    "  sys.path.insert(0, '.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pDhHuMjb7bfU"
   },
   "source": [
    "이제 TensorFlow를 가져옵니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:53.974578Z",
     "iopub.status.busy": "2022-12-15T02:05:53.973981Z",
     "iopub.status.idle": "2022-12-15T02:05:56.159719Z",
     "shell.execute_reply": "2022-12-15T02:05:56.158893Z"
    },
    "id": "vHNvttzV43sA"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:05:55.039887: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n",
      "2022-12-15 02:05:55.040002: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n",
      "2022-12-15 02:05:55.040014: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0S2jpf6Sx50i"
   },
   "source": [
    "### 데이터세트 및 모델 정의"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fLW6D2TzvC-4"
   },
   "source": [
    "다음으로, 간단한 모델 및 데이터세트 설정으로 `mnist.py` 파일을 만듭니다. 이 Python 파일은 이 튜토리얼의 작업자 프로세스에서 사용됩니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.164924Z",
     "iopub.status.busy": "2022-12-15T02:05:56.163876Z",
     "iopub.status.idle": "2022-12-15T02:05:56.169852Z",
     "shell.execute_reply": "2022-12-15T02:05:56.169185Z"
    },
    "id": "dma_wUAxZqo2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing mnist.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile mnist.py\n",
    "\n",
    "import os\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "\n",
    "def mnist_dataset(batch_size):\n",
    "  (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()\n",
    "  # The `x` arrays are in uint8 and have values in the range [0, 255].\n",
    "  # You need to convert them to float32 with values in the range [0, 1]\n",
    "  x_train = x_train / np.float32(255)\n",
    "  y_train = y_train.astype(np.int64)\n",
    "  train_dataset = tf.data.Dataset.from_tensor_slices(\n",
    "      (x_train, y_train)).shuffle(60000)\n",
    "  return train_dataset\n",
    "\n",
    "def dataset_fn(global_batch_size, input_context):\n",
    "  batch_size = input_context.get_per_replica_batch_size(global_batch_size)\n",
    "  dataset = mnist_dataset(batch_size)\n",
    "  dataset = dataset.shard(input_context.num_input_pipelines,\n",
    "                          input_context.input_pipeline_id)\n",
    "  dataset = dataset.batch(batch_size)\n",
    "  return dataset\n",
    "\n",
    "def build_cnn_model():\n",
    "  return tf.keras.Sequential([\n",
    "      tf.keras.Input(shape=(28, 28)),\n",
    "      tf.keras.layers.Reshape(target_shape=(28, 28, 1)),\n",
    "      tf.keras.layers.Conv2D(32, 3, activation='relu'),\n",
    "      tf.keras.layers.Flatten(),\n",
    "      tf.keras.layers.Dense(128, activation='relu'),\n",
    "      tf.keras.layers.Dense(10)\n",
    "  ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JmgZwwymxqt5"
   },
   "source": [
    "## 다중 작업자 구성\n",
    "\n",
    "이제 다중 작업자 훈련의 세계로 들어가 보겠습니다. TensorFlow에서 `'TF_CONFIG'` 환경 변수는 여러 시스템에 대한 훈련에 필요합니다. 시스템마다 역할이 다를 수 있습니다. 아래에 사용된 `'TF_CONFIG'` 변수는 클러스터의 일부인 각 작업자에 대한 클러스터 구성을 지정하는 JSON 문자열입니다. 이것은 `cluster_resolver.TFConfigClusterResolver`를 사용하여 클러스터를 지정하는 기본 방법이지만, `distribute.cluster_resolver` 모듈에서 사용할 수 있는 다른 옵션들도 있습니다. <a>분산 훈련 가이드</a>에서 <code>'TF_CONFIG'</code> 변수 설정에 대해 자세히 알아보세요."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SS8WhvRhe_Ya"
   },
   "source": [
    "### 클러스터 설명하기\n",
    "\n",
    "다음은 구성의 예입니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.173196Z",
     "iopub.status.busy": "2022-12-15T02:05:56.172946Z",
     "iopub.status.idle": "2022-12-15T02:05:56.176830Z",
     "shell.execute_reply": "2022-12-15T02:05:56.176173Z"
    },
    "id": "XK1eTYvSZiX7"
   },
   "outputs": [],
   "source": [
    "tf_config = {\n",
    "    'cluster': {\n",
    "        'worker': ['localhost:12345', 'localhost:23456']\n",
    "    },\n",
    "    'task': {'type': 'worker', 'index': 0}\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JjgwJbPKZkJL"
   },
   "source": [
    "`tf_config`는 단순히 Python의 지역 변수입니다. 훈련 구성에 사용하려면 JSON으로 직렬화하고 `'TF_CONFIG'` 환경 변수에 배치합니다. 다음은 JSON 문자열로 직렬화된 동일한 `'TF_CONFIG'`입니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.180125Z",
     "iopub.status.busy": "2022-12-15T02:05:56.179608Z",
     "iopub.status.idle": "2022-12-15T02:05:56.186274Z",
     "shell.execute_reply": "2022-12-15T02:05:56.185631Z"
    },
    "id": "yY-T0YDQZjbu"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'{\"cluster\": {\"worker\": [\"localhost:12345\", \"localhost:23456\"]}, \"task\": {\"type\": \"worker\", \"index\": 0}}'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "json.dumps(tf_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AUBmYRZqxthH"
   },
   "source": [
    "`'TF_CONFIG'`에는 `'cluster'`와 `'task'`라는 두 가지 구성 요소가 있습니다.\n",
    "\n",
    "- `'cluster'`는 모든 작업자에게 동일하며 `'worker'`와 같은 다양한 유형의 작업으로 구성된 사전인 훈련 클러스터에 대한 정보를 제공합니다. `MultiWorkerMirroredStrategy`를 사용한 다중 작업자 훈련에는 일반적으로 일반 `'worker'`가 수행하는 작업 외에 TensorBoard에 대한 체크포인트를 저장하고 요약 파일을 작성하는 등 약간 더 많은 일을 처리하는 `'worker'`가 하나 있습니다. 이러한 작업자를 `'chief'` 작업자라고 하며, `'index'`가 0인 `'worker'`를 메인 `worker`로 지정하는 것이 관례입니다.\n",
    "\n",
    "- `'task'`는 현재 작업에 대한 정보를 제공하며 작업자마다 다릅니다. 이를 통해 해당 작업자의 `'type'`과 `'index'`가 지정됩니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8YFpxrcsZ2xG"
   },
   "source": [
    "이 예에서는 `'type'` 작업을 `'worker'`로 설정하고 `'index'` 작업을 `0`으로 설정합니다. 이 시스템은 첫 번째 작업자이며 메인 작업자로 지정되어 다른 작업자보다 더 많은 일을 처리하게 됩니다. 다른 시스템에도 `'TF_CONFIG'` 환경 변수가 설정되어 있어야 하며, 동일한 `'cluster'` 사전이 있어야 하지만 해당 시스템의 역할에 따라 다른 작업 `'type'` 또는 작업 `'index'`가 있어야 합니다.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aogb74kHxynz"
   },
   "source": [
    "설명을 위해 이 튜토리얼에서는 `'localhost'`에 두 작업자가 있는 `'TF_CONFIG'`를 설정하는 방법을 보여줍니다. 실제로 사용자는 외부 IP 주소/포트에 여러 작업자를 만들고 각 작업자에 `'TF_CONFIG'`를 적절하게 설정합니다.\n",
    "\n",
    "이 예제에서는 두 작업자를 사용합니다. 첫 번째 작업자의 `'TF_CONFIG'`는 위와 같습니다. 두 번째 작업자의 경우 `tf_config['task']['index']=1`을 설정합니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cIlkfWmjz1PG"
   },
   "source": [
    "### 노트북의 환경 변수 및 하위 프로세스"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FcjAbuGY1ACJ"
   },
   "source": [
    "하위 프로세스는 상위 요소로부터 환경 변수를 상속합니다. 따라서 이 Jupyter Notebook 프로세스에서 다음과 같이 환경 변수를 설정하는 경우:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.190230Z",
     "iopub.status.busy": "2022-12-15T02:05:56.189552Z",
     "iopub.status.idle": "2022-12-15T02:05:56.193068Z",
     "shell.execute_reply": "2022-12-15T02:05:56.192396Z"
    },
    "id": "PH2gHn2_0_U8"
   },
   "outputs": [],
   "source": [
    "os.environ['GREETINGS'] = 'Hello TensorFlow!'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gQkIX-cg18md"
   },
   "source": [
    "하위 프로세스에서 환경 변수에 액세스할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.196763Z",
     "iopub.status.busy": "2022-12-15T02:05:56.196100Z",
     "iopub.status.idle": "2022-12-15T02:05:56.251997Z",
     "shell.execute_reply": "2022-12-15T02:05:56.250970Z"
    },
    "id": "pquKO6IA18G5"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello TensorFlow!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo ${GREETINGS}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "af6BCA-Y2fpz"
   },
   "source": [
    "다음 섹션에서는 이를 사용하여 작업자 하위 프로세스에 `'TF_CONFIG'`를 전달합니다. 이런 식으로 작업을 시작하지는 않겠지만 이 튜토리얼의 목적인 최소 다중 작업자 예제를 보여주는 데는 충분합니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UhNtHfuxCGVy"
   },
   "source": [
    "## MultiWorkerMirroredStrategy\n",
    "\n",
    "모델을 훈련하기 전에 먼저 `tf.distribute.MultiWorkerMirroredStrategy`의 인스턴스를 만듭니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.256593Z",
     "iopub.status.busy": "2022-12-15T02:05:56.255950Z",
     "iopub.status.idle": "2022-12-15T02:05:56.306828Z",
     "shell.execute_reply": "2022-12-15T02:05:56.306061Z"
    },
    "id": "1uFSHCJXMrQ-"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:CPU:0',), communication = CommunicationImplementation.AUTO\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:05:56.272689: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n"
     ]
    }
   ],
   "source": [
    "strategy = tf.distribute.MultiWorkerMirroredStrategy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N0iv7SyyAohc"
   },
   "source": [
    "참고: `'TF_CONFIG'`가 구문 분석되고 TensorFlow의 GRPC 서버는 `tf.distribute.MultiWorkerMirroredStrategy`를 호출할 때 시작됩니다. 따라서 `tf.distribute.Strategy`를 인스턴스화하기 전에 `'TF_CONFIG'` 환경 변수를 설정해야 합니다. 이 튜토리얼의 예제에서는 시간을 절약하기 위해 이를 보여주지 않으므로 서버를 시작할 필요가 없습니다. 이 튜토리얼의 마지막 섹션에 전체 예제가 나와 있습니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TS4S-faBHHam"
   },
   "source": [
    "`tf.distribute.Strategy.scope`를 사용하여 모델을 빌드할 때 사용해야 하는 전략을 지정합니다. 이를 통해 전략에서 변수 배치와 같은 사항을 제어할 수 있습니다. 모든 작업자에 걸쳐 각 장치의 모델 레이어에 있는 모든 변수의 복사본을 생성합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.310266Z",
     "iopub.status.busy": "2022-12-15T02:05:56.310010Z",
     "iopub.status.idle": "2022-12-15T02:05:56.411262Z",
     "shell.execute_reply": "2022-12-15T02:05:56.410425Z"
    },
    "id": "nXV49tG1_opc"
   },
   "outputs": [],
   "source": [
    "import mnist\n",
    "with strategy.scope():\n",
    "  # Model building needs to be within `strategy.scope()`.\n",
    "  multi_worker_model = mnist.build_cnn_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DSYkM-on6r3Y"
   },
   "source": [
    "## 작업자 간에 데이터 자동 샤딩하기\n",
    "\n",
    "다중 작업자 훈련에서는 수렴과 재현성을 보장하기 위해 *데이터세트 샤딩*이 필요합니다. 샤딩은 각 작업자에게 전체 데이터세트의 일부를 전달하는 것을 의미합니다. 이는 단일 작업자에 대한 훈련과 유사한 경험을 만드는 데 도움이 됩니다. 아래 예에서는 `tf.distribute`의 기본 자동 샤딩 정책을 이용하고 있습니다. `tf.data.experimental.AutoShardPolicy`의 `tf.data.experimental.DistributeOptions`를 설정하여 이를 사용자 정의할 수도 있습니다. 자세한 내용은 <a>분산 입력 튜토리얼</a>의 <em>샤딩</em> 섹션을 참조하세요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.415863Z",
     "iopub.status.busy": "2022-12-15T02:05:56.415230Z",
     "iopub.status.idle": "2022-12-15T02:05:56.956239Z",
     "shell.execute_reply": "2022-12-15T02:05:56.955451Z"
    },
    "id": "65-p36pt6rUF"
   },
   "outputs": [],
   "source": [
    "per_worker_batch_size = 64\n",
    "num_workers = len(tf_config['cluster']['worker'])\n",
    "global_batch_size = per_worker_batch_size * num_workers\n",
    "\n",
    "with strategy.scope():\n",
    "  multi_worker_dataset = strategy.distribute_datasets_from_function(\n",
    "      lambda input_context: mnist.dataset_fn(global_batch_size, input_context))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rkNzSR3g60iP"
   },
   "source": [
    "## 사용자 정의 훈련 루프 정의 및 모델 훈련하기\n",
    "\n",
    "옵티마이저 지정:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.960906Z",
     "iopub.status.busy": "2022-12-15T02:05:56.960286Z",
     "iopub.status.idle": "2022-12-15T02:05:56.972975Z",
     "shell.execute_reply": "2022-12-15T02:05:56.972229Z"
    },
    "id": "NoMr4_zTeKSn"
   },
   "outputs": [],
   "source": [
    "with strategy.scope():\n",
    "  # The creation of optimizer and train_accuracy needs to be in\n",
    "  # `strategy.scope()` as well, since they create variables.\n",
    "  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)\n",
    "  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n",
    "      name='train_accuracy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RmrDcAii4B5O"
   },
   "source": [
    "`tf.function`을 사용하여 훈련 단계를 정의합니다.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.977213Z",
     "iopub.status.busy": "2022-12-15T02:05:56.976548Z",
     "iopub.status.idle": "2022-12-15T02:05:56.983127Z",
     "shell.execute_reply": "2022-12-15T02:05:56.982486Z"
    },
    "id": "znXWN5S3eUDB"
   },
   "outputs": [],
   "source": [
    "@tf.function\n",
    "def train_step(iterator):\n",
    "  \"\"\"Training step function.\"\"\"\n",
    "\n",
    "  def step_fn(inputs):\n",
    "    \"\"\"Per-Replica step function.\"\"\"\n",
    "    x, y = inputs\n",
    "    with tf.GradientTape() as tape:\n",
    "      predictions = multi_worker_model(x, training=True)\n",
    "      per_batch_loss = tf.keras.losses.SparseCategoricalCrossentropy(\n",
    "          from_logits=True,\n",
    "          reduction=tf.keras.losses.Reduction.NONE)(y, predictions)\n",
    "      loss = tf.nn.compute_average_loss(\n",
    "          per_batch_loss, global_batch_size=global_batch_size)\n",
    "\n",
    "    grads = tape.gradient(loss, multi_worker_model.trainable_variables)\n",
    "    optimizer.apply_gradients(\n",
    "        zip(grads, multi_worker_model.trainable_variables))\n",
    "    train_accuracy.update_state(y, predictions)\n",
    "    return loss\n",
    "\n",
    "  per_replica_losses = strategy.run(step_fn, args=(next(iterator),))\n",
    "  return strategy.reduce(\n",
    "      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eFXHsUVBy0Rx"
   },
   "source": [
    "### 체크포인트 저장 및 복원\n",
    "\n",
    "사용자 정의 훈련 루프를 작성할 때 Keras 콜백에 의존하는 대신 수동으로 [체크포인트 저장](../../guide/checkpoint.ipynb)을 처리해야 합니다. `MultiWorkerMirroredStrategy`의 경우 체크포인트 또는 전체 모델을 저장하려면 모든 작업자가 참여해야 합니다. 메인 작업자만 저장하려고 하면 교착 상태가 발생할 수 있기 때문입니다. 또한 작업자는 서로 덮어쓰지 않도록 다른 경로에 작성해야 합니다. 다음은 디렉터리를 구성하는 방법의 예입니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.986540Z",
     "iopub.status.busy": "2022-12-15T02:05:56.986175Z",
     "iopub.status.idle": "2022-12-15T02:05:56.992236Z",
     "shell.execute_reply": "2022-12-15T02:05:56.991606Z"
    },
    "id": "LcFO6x1KyjhI"
   },
   "outputs": [],
   "source": [
    "from multiprocessing import util\n",
    "checkpoint_dir = os.path.join(util.get_temp_dir(), 'ckpt')\n",
    "\n",
    "def _is_chief(task_type, task_id, cluster_spec):\n",
    "  return (task_type is None\n",
    "          or task_type == 'chief'\n",
    "          or (task_type == 'worker'\n",
    "              and task_id == 0\n",
    "              and \"chief\" not in cluster_spec.as_dict()))\n",
    "\n",
    "def _get_temp_dir(dirpath, task_id):\n",
    "  base_dirpath = 'workertemp_' + str(task_id)\n",
    "  temp_dir = os.path.join(dirpath, base_dirpath)\n",
    "  tf.io.gfile.makedirs(temp_dir)\n",
    "  return temp_dir\n",
    "\n",
    "def write_filepath(filepath, task_type, task_id, cluster_spec):\n",
    "  dirpath = os.path.dirname(filepath)\n",
    "  base = os.path.basename(filepath)\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    dirpath = _get_temp_dir(dirpath, task_id)\n",
    "  return os.path.join(dirpath, base)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nrcdPHtG4ObO"
   },
   "source": [
    "`tf.train.CheckpointManager`에 의해 관리되는 모델을 추적하는 하나의 `tf.train.Checkpoint`를 생성하여 최신 체크포인트만 보존되도록 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:56.995617Z",
     "iopub.status.busy": "2022-12-15T02:05:56.995121Z",
     "iopub.status.idle": "2022-12-15T02:05:57.003161Z",
     "shell.execute_reply": "2022-12-15T02:05:57.002526Z"
    },
    "id": "4rURT2pI4aqV"
   },
   "outputs": [],
   "source": [
    "epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64), name='epoch')\n",
    "step_in_epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64),\n",
    "    name='step_in_epoch')\n",
    "task_type, task_id = (strategy.cluster_resolver.task_type,\n",
    "                      strategy.cluster_resolver.task_id)\n",
    "# Normally, you don't need to manually instantiate a `ClusterSpec`, but in this \n",
    "# illustrative example you did not set `'TF_CONFIG'` before initializing the\n",
    "# strategy. Check out the next section for \"real-world\" usage.\n",
    "cluster_spec = tf.train.ClusterSpec(tf_config['cluster'])\n",
    "\n",
    "checkpoint = tf.train.Checkpoint(\n",
    "    model=multi_worker_model, epoch=epoch, step_in_epoch=step_in_epoch)\n",
    "\n",
    "write_checkpoint_dir = write_filepath(checkpoint_dir, task_type, task_id,\n",
    "                                      cluster_spec)\n",
    "checkpoint_manager = tf.train.CheckpointManager(\n",
    "    checkpoint, directory=write_checkpoint_dir, max_to_keep=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RO7cbN40XD5v"
   },
   "source": [
    "이제 체크포인트를 복원해야 할 때 편리한 `tf.train.latest_checkpoint` 함수를 사용하거나 `tf.train.CheckpointManager.restore_or_initialize`를 호출하여 저장된 최신 체크포인트를 찾을 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:57.006730Z",
     "iopub.status.busy": "2022-12-15T02:05:57.006041Z",
     "iopub.status.idle": "2022-12-15T02:05:57.009713Z",
     "shell.execute_reply": "2022-12-15T02:05:57.009098Z"
    },
    "id": "gniynaQj6HMV"
   },
   "outputs": [],
   "source": [
    "latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\n",
    "if latest_checkpoint:\n",
    "  checkpoint.restore(latest_checkpoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1j9JuI-h6ObW"
   },
   "source": [
    "체크포인트를 복원한 후, 사용자 정의 훈련 루프의 훈련을 계속할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:05:57.013264Z",
     "iopub.status.busy": "2022-12-15T02:05:57.012680Z",
     "iopub.status.idle": "2022-12-15T02:06:03.604522Z",
     "shell.execute_reply": "2022-12-15T02:06:03.603672Z"
    },
    "id": "kZzXZCh45FY6"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:05:57.289580: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n",
      "Instructions for updating:\n",
      "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 0, accuracy: 0.819531, train_loss: 0.572494.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 1, accuracy: 0.929129, train_loss: 0.238400.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 2, accuracy: 0.956250, train_loss: 0.157569.\n"
     ]
    }
   ],
   "source": [
    "num_epochs = 3\n",
    "num_steps_per_epoch = 70\n",
    "\n",
    "while epoch.numpy() < num_epochs:\n",
    "  iterator = iter(multi_worker_dataset)\n",
    "  total_loss = 0.0\n",
    "  num_batches = 0\n",
    "\n",
    "  while step_in_epoch.numpy() < num_steps_per_epoch:\n",
    "    total_loss += train_step(iterator)\n",
    "    num_batches += 1\n",
    "    step_in_epoch.assign_add(1)\n",
    "\n",
    "  train_loss = total_loss / num_batches\n",
    "  print('Epoch: %d, accuracy: %f, train_loss: %f.'\n",
    "                %(epoch.numpy(), train_accuracy.result(), train_loss))\n",
    "\n",
    "  train_accuracy.reset_states()\n",
    "\n",
    "  # Once the `CheckpointManager` is set up, you're now ready to save, and remove\n",
    "  # the checkpoints non-chief workers saved.\n",
    "  checkpoint_manager.save()\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    tf.io.gfile.rmtree(write_checkpoint_dir)\n",
    "\n",
    "  epoch.assign_add(1)\n",
    "  step_in_epoch.assign(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0W1Osks466DE"
   },
   "source": [
    "## 전체 코드 한 눈에 보기"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jfYpmIxO6Jck"
   },
   "source": [
    "지금까지 논의된 모든 절차를 요약하면 다음과 같습니다.\n",
    "\n",
    "1. 작업자 프로세스를 만듭니다.\n",
    "2. 작업자 프로세스에 `'TF_CONFIG'`를 전달합니다.\n",
    "3. 각 작업 프로세스에서 훈련 코드가 포함된 아래 스크립트를 실행하도록 합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:03.608748Z",
     "iopub.status.busy": "2022-12-15T02:06:03.608341Z",
     "iopub.status.idle": "2022-12-15T02:06:03.614906Z",
     "shell.execute_reply": "2022-12-15T02:06:03.614198Z"
    },
    "id": "MIDCESkVzN6M"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing main.py\n"
     ]
    }
   ],
   "source": [
    "%%writefile main.py\n",
    "#@title File: `main.py`\n",
    "import os\n",
    "import json\n",
    "import tensorflow as tf\n",
    "import mnist\n",
    "from multiprocessing import util\n",
    "\n",
    "per_worker_batch_size = 64\n",
    "tf_config = json.loads(os.environ['TF_CONFIG'])\n",
    "num_workers = len(tf_config['cluster']['worker'])\n",
    "global_batch_size = per_worker_batch_size * num_workers\n",
    "\n",
    "num_epochs = 3\n",
    "num_steps_per_epoch=70\n",
    "\n",
    "# Checkpoint saving and restoring\n",
    "def _is_chief(task_type, task_id, cluster_spec):\n",
    "  return (task_type is None\n",
    "          or task_type == 'chief'\n",
    "          or (task_type == 'worker'\n",
    "              and task_id == 0\n",
    "              and 'chief' not in cluster_spec.as_dict()))\n",
    "    \n",
    "def _get_temp_dir(dirpath, task_id):\n",
    "  base_dirpath = 'workertemp_' + str(task_id)\n",
    "  temp_dir = os.path.join(dirpath, base_dirpath)\n",
    "  tf.io.gfile.makedirs(temp_dir)\n",
    "  return temp_dir\n",
    "\n",
    "def write_filepath(filepath, task_type, task_id, cluster_spec):\n",
    "  dirpath = os.path.dirname(filepath)\n",
    "  base = os.path.basename(filepath)\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    dirpath = _get_temp_dir(dirpath, task_id)\n",
    "  return os.path.join(dirpath, base)\n",
    "\n",
    "checkpoint_dir = os.path.join(util.get_temp_dir(), 'ckpt')\n",
    "\n",
    "# Define Strategy\n",
    "strategy = tf.distribute.MultiWorkerMirroredStrategy()\n",
    "\n",
    "with strategy.scope():\n",
    "  # Model building/compiling need to be within `tf.distribute.Strategy.scope`.\n",
    "  multi_worker_model = mnist.build_cnn_model()\n",
    "\n",
    "  multi_worker_dataset = strategy.distribute_datasets_from_function(\n",
    "      lambda input_context: mnist.dataset_fn(global_batch_size, input_context))        \n",
    "  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)\n",
    "  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n",
    "      name='train_accuracy')\n",
    "\n",
    "@tf.function\n",
    "def train_step(iterator):\n",
    "  \"\"\"Training step function.\"\"\"\n",
    "\n",
    "  def step_fn(inputs):\n",
    "    \"\"\"Per-Replica step function.\"\"\"\n",
    "    x, y = inputs\n",
    "    with tf.GradientTape() as tape:\n",
    "      predictions = multi_worker_model(x, training=True)\n",
    "      per_batch_loss = tf.keras.losses.SparseCategoricalCrossentropy(\n",
    "          from_logits=True,\n",
    "          reduction=tf.keras.losses.Reduction.NONE)(y, predictions)\n",
    "      loss = tf.nn.compute_average_loss(\n",
    "          per_batch_loss, global_batch_size=global_batch_size)\n",
    "\n",
    "    grads = tape.gradient(loss, multi_worker_model.trainable_variables)\n",
    "    optimizer.apply_gradients(\n",
    "        zip(grads, multi_worker_model.trainable_variables))\n",
    "    train_accuracy.update_state(y, predictions)\n",
    "\n",
    "    return loss\n",
    "\n",
    "  per_replica_losses = strategy.run(step_fn, args=(next(iterator),))\n",
    "  return strategy.reduce(\n",
    "      tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)\n",
    "\n",
    "epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64), name='epoch')\n",
    "step_in_epoch = tf.Variable(\n",
    "    initial_value=tf.constant(0, dtype=tf.dtypes.int64),\n",
    "    name='step_in_epoch')\n",
    "\n",
    "task_type, task_id, cluster_spec = (strategy.cluster_resolver.task_type,\n",
    "                                    strategy.cluster_resolver.task_id,\n",
    "                                    strategy.cluster_resolver.cluster_spec())\n",
    "\n",
    "checkpoint = tf.train.Checkpoint(\n",
    "    model=multi_worker_model, epoch=epoch, step_in_epoch=step_in_epoch)\n",
    "\n",
    "write_checkpoint_dir = write_filepath(checkpoint_dir, task_type, task_id,\n",
    "                                      cluster_spec)\n",
    "checkpoint_manager = tf.train.CheckpointManager(\n",
    "    checkpoint, directory=write_checkpoint_dir, max_to_keep=1)\n",
    "\n",
    "# Restoring the checkpoint\n",
    "latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)\n",
    "if latest_checkpoint:\n",
    "  checkpoint.restore(latest_checkpoint)\n",
    "\n",
    "# Resume our CTL training\n",
    "while epoch.numpy() < num_epochs:\n",
    "  iterator = iter(multi_worker_dataset)\n",
    "  total_loss = 0.0\n",
    "  num_batches = 0\n",
    "\n",
    "  while step_in_epoch.numpy() < num_steps_per_epoch:\n",
    "    total_loss += train_step(iterator)\n",
    "    num_batches += 1\n",
    "    step_in_epoch.assign_add(1)\n",
    "\n",
    "  train_loss = total_loss / num_batches\n",
    "  print('Epoch: %d, accuracy: %f, train_loss: %f.'\n",
    "                %(epoch.numpy(), train_accuracy.result(), train_loss))\n",
    "  \n",
    "  train_accuracy.reset_states()\n",
    "\n",
    "  checkpoint_manager.save()\n",
    "  if not _is_chief(task_type, task_id, cluster_spec):\n",
    "    tf.io.gfile.rmtree(write_checkpoint_dir)\n",
    "\n",
    "  epoch.assign_add(1)\n",
    "  step_in_epoch.assign(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ItVOvPN1qnZ6"
   },
   "source": [
    "현재 디렉터리에는 이제 두 Python 파일이 모두 포함됩니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:03.618777Z",
     "iopub.status.busy": "2022-12-15T02:06:03.618107Z",
     "iopub.status.idle": "2022-12-15T02:06:03.679420Z",
     "shell.execute_reply": "2022-12-15T02:06:03.678425Z"
    },
    "id": "bi6x05Sr60O9"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "main.py\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mnist.py\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "ls *.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qmEEStPS6vR_"
   },
   "source": [
    "따라서 `'TF_CONFIG'`를 JSON 직렬화하고 환경 변수에 추가합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:03.683863Z",
     "iopub.status.busy": "2022-12-15T02:06:03.683159Z",
     "iopub.status.idle": "2022-12-15T02:06:03.687952Z",
     "shell.execute_reply": "2022-12-15T02:06:03.687235Z"
    },
    "id": "9uu3g7vV7Bbt"
   },
   "outputs": [],
   "source": [
    "os.environ['TF_CONFIG'] = json.dumps(tf_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MsY3dQLK7jdf"
   },
   "source": [
    "이제 `main.py`를 실행하고 `'TF_CONFIG'`를 사용하는 작업자 프로세스를 시작할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:03.691687Z",
     "iopub.status.busy": "2022-12-15T02:06:03.691051Z",
     "iopub.status.idle": "2022-12-15T02:06:03.695167Z",
     "shell.execute_reply": "2022-12-15T02:06:03.694427Z"
    },
    "id": "txMXaq8d8N_S"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All background processes were killed.\n"
     ]
    }
   ],
   "source": [
    "# first kill any previous runs\n",
    "%killbgscripts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:03.698518Z",
     "iopub.status.busy": "2022-12-15T02:06:03.697879Z",
     "iopub.status.idle": "2022-12-15T02:06:03.753875Z",
     "shell.execute_reply": "2022-12-15T02:06:03.752765Z"
    },
    "id": "qnSma_Ck7r-r"
   },
   "outputs": [],
   "source": [
    "%%bash --bg\n",
    "python main.py &> job_0.log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZChyazqS7v0P"
   },
   "source": [
    "위의 명령에서 몇 가지 주의할 사항이 있습니다.\n",
    "\n",
    "1. 일부 bash 명령을 실행하기 위해 [노트북 \"매직\"](https://ipython.readthedocs.io/en/stable/interactive/magics.html)인 `%%bash`가 사용됩니다.\n",
    "2. `--bg` 플래그를 사용하여 백그라운드에서 `bash` 프로세스를 실행합니다. 이 작업자는 종료되지 않기 때문입니다. 시작하기 전에 모든 작업자를 기다립니다.\n",
    "\n",
    "백그라운드 작업자 프로세스는 이 노트북에 출력을 인쇄하지 않습니다. `&>`는 출력을 파일로 리디렉션하여 발생한 상황을 검사할 수 있도록 합니다.\n",
    "\n",
    "프로세스가 시작될 때까지 몇 초 동안 기다립니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:03.759437Z",
     "iopub.status.busy": "2022-12-15T02:06:03.758442Z",
     "iopub.status.idle": "2022-12-15T02:06:23.783878Z",
     "shell.execute_reply": "2022-12-15T02:06:23.782898Z"
    },
    "id": "Hm2yrULE9281"
   },
   "outputs": [],
   "source": [
    "import time\n",
    "time.sleep(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZFPoNxg_9_Mx"
   },
   "source": [
    "이제 지금까지 작업자의 로그 파일에 대한 출력을 확인합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:23.788956Z",
     "iopub.status.busy": "2022-12-15T02:06:23.788265Z",
     "iopub.status.idle": "2022-12-15T02:06:23.846927Z",
     "shell.execute_reply": "2022-12-15T02:06:23.845997Z"
    },
    "id": "vZEOuVgQ9-hn"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:04.951967: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:04.952071: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:04.952083: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:06.050491: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cat job_0.log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RqZhVF7L_KOy"
   },
   "source": [
    "로그 파일의 마지막 줄은 다음과 같아야 합니다: `Started server with target: grpc://localhost:12345`. 이제 첫 번째 작업자가 준비되었으며 다른 모든 작업자가 계속 진행할 준비가 되기를 기다립니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Pi8vPNNA_l4a"
   },
   "source": [
    "두 번째 작업자 프로세스가 선택하도록 `tf_config`를 업데이트합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:23.851017Z",
     "iopub.status.busy": "2022-12-15T02:06:23.850695Z",
     "iopub.status.idle": "2022-12-15T02:06:23.854828Z",
     "shell.execute_reply": "2022-12-15T02:06:23.854113Z"
    },
    "id": "lAiYkkPu_Jqd"
   },
   "outputs": [],
   "source": [
    "tf_config['task']['index'] = 1\n",
    "os.environ['TF_CONFIG'] = json.dumps(tf_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0AshGVO0_x0w"
   },
   "source": [
    "이제 두 번째 작업자를 시작합니다. 모든 작업자가 활성 상태이므로 훈련이 시작됩니다(따라서 이 프로세스를 백그라운드에 둘 필요가 없음)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:23.858207Z",
     "iopub.status.busy": "2022-12-15T02:06:23.857919Z",
     "iopub.status.idle": "2022-12-15T02:06:39.937338Z",
     "shell.execute_reply": "2022-12-15T02:06:39.936179Z"
    },
    "id": "_ESVtyQ9_xjx"
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "python main.py > /dev/null 2>&1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hX4FA2O2AuAn"
   },
   "source": [
    "첫 번째 작업자가 작성한 로그를 다시 확인하면 작업자가 해당 모델 훈련에 참여했음을 알 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:39.942724Z",
     "iopub.status.busy": "2022-12-15T02:06:39.941864Z",
     "iopub.status.idle": "2022-12-15T02:06:40.007095Z",
     "shell.execute_reply": "2022-12-15T02:06:40.005921Z"
    },
    "id": "rc6hw3yTBKXX"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:04.951967: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:04.952071: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:04.952083: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:06.050491: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-12-15 02:06:27.412697: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Instructions for updating:\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 0, accuracy: 0.823996, train_loss: 0.581225.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 1, accuracy: 0.925558, train_loss: 0.249244.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch: 2, accuracy: 0.946317, train_loss: 0.186680.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cat job_0.log"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-15T02:06:40.012161Z",
     "iopub.status.busy": "2022-12-15T02:06:40.011831Z",
     "iopub.status.idle": "2022-12-15T02:06:40.017272Z",
     "shell.execute_reply": "2022-12-15T02:06:40.016523Z"
    },
    "id": "sG5_1UgrgniF"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All background processes were killed.\n"
     ]
    }
   ],
   "source": [
    "# Delete the `'TF_CONFIG'`, and kill any background tasks so they don't affect the next section.\n",
    "os.environ.pop('TF_CONFIG', None)\n",
    "%killbgscripts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bhxMXa0AaZkK"
   },
   "source": [
    "## 심층적 다중 작업자 훈련\n",
    "\n",
    "이 튜토리얼에서는 다중 작업자 설정의 사용자 정의 훈련 루프 워크플로를 보여주었습니다. 다른 주제에 대한 자세한 설명은 사용자 정의 훈련 루프에 적용할 수 있는 [Keras를 사용한 다중 작업자 훈련(`tf.keras.Model.fit`)](multi_worker_with_keras.ipynb) 튜토리얼에서 확인할 수 있습니다."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ega2hdOQEmy_"
   },
   "source": [
    "## 자세히 알아보기\n",
    "\n",
    "1. [TensorFlow에서 분산 훈련하기](../../guide/distributed_training.ipynb) 가이드는 사용 가능한 분산 전략을 간략히 소개합니다.\n",
    "2. 많은 [공식 모델](https://github.com/tensorflow/models/tree/master/official)은 다양한 분산 전략을 실행하도록 설정할 수 있습니다.\n",
    "3. `tf.function` 가이드의 [성능 섹션](../../guide/function.ipynb)에서 TensorFlow 모델 성능 최적화를 위해 사용할 수 있는 다른 전략과 [도구](../../guide/profiler.md)를 소개합니다.\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "multi_worker_with_ctl.ipynb",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}