{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "FEL3NlTTDlSX" }, "source": [ "##### Copyright 2021 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2024-01-11T18:11:57.141727Z", "iopub.status.busy": "2024-01-11T18:11:57.141186Z", "iopub.status.idle": "2024-01-11T18:11:57.144722Z", "shell.execute_reply": "2024-01-11T18:11:57.144171Z" }, "id": "FlUw7tSKbtg4" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "77z2OchJTk0l" }, "source": [ "# TF2 に移行されたトレーニングパイプラインをデバッグする\n", "\n", "\n", " \n", " \n", " \n", " \n", "
TensorFlow.org で表示 Google Colab で実行 GitHub でソースを表示ノートブックをダウンロード
" ] }, { "cell_type": "markdown", "metadata": { "id": "zTwPu-w6M5sz" }, "source": [ "このノートブックでは、TensorFlow 2(TF2)に移行されたトレーニングパイプラインをデバッグする方法を説明します。内容は以下のとおりです。\n", "\n", "1. トレーニングパイプラインをデバッグするための推奨される手順とコードサンプル\n", "2. デバッグ用ツール\n", "3. その他の関連リソース\n", "\n", "比較用の TensorFlow 1(TF1.x)コードとトレーニング済みモデルがあり、同等の検証精度を達成する TF2 モデルを構築することを前提とします。\n", "\n", "このノートブックは、トレーニングや推論の速度やメモリ使用量に関するデバッグパフォーマンスの問題は**取り上げません**。" ] }, { "cell_type": "markdown", "metadata": { "id": "fKm9R4CtOAP3" }, "source": [ "## デバッグワークフロー\n", "\n", "以下は、TF2 トレーニングパイプラインをデバッグするための一般的なワークフローです。これらの手順を順番に実行する必要はありません。中間ステップでモデルをテストし、デバッグ範囲を絞り込む二分探索アプローチを使用することもできます。\n", "\n", "1. コンパイルエラーとランタイムエラーを修正する\n", "\n", "2. シングルフォワードパスの検証(別の[ガイド](./validate_correctness.ipynb))\n", "\n", " a. 単一の CPU デバイスの場合\n", "\n", " - 変数が 1 回だけ作成されることを確認する\n", " - 変数の数、名前、形状が一致していることを確認する\n", " - すべての変数をリセットし、すべてのランダム性を無効にして数値の等価性をチェックする\n", " - 乱数生成の調整、推論における数値的等価性をチェックする\n", " - (オプション)チェックポイントが正しく読み込まれ、TF1.x/TF2 モデルが同一の出力を生成することを確認する\n", "\n", " b. 単一の GPU/TPU デバイスの場合\n", "\n", " c. マルチデバイスストラテジー\n", "\n", "3. 数ステップのモデルトレーニングの数値的等価性の検証(コードサンプルは以下で入手可能)\n", "\n", " a. 単一の CPU デバイスでの小規模な固定データを使用した単一のトレーニングステップの検証。具体的には、次のコンポーネントの数値的等価性を確認する\n", "\n", " - 損失計算\n", " - 指標\n", " - 学習率\n", " - 勾配の計算と更新\n", "\n", " b. 3 つ以上のステップをトレーニングした後に統計をチェックして、モメンタムなどのオプティマイザの動作を検証する。単一の CPU デバイスで固定データを使用する。\n", "\n", " c. 単一の GPU/TPU デバイス\n", "\n", " d. マルチデバイスストラテジーを使用(以下の [MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) のイントロを参照)\n", "\n", "4. 実際のデータセットでのエンドツーエンドの収束テスト\n", "\n", " a. TensorBoard でトレーニングの動作を確認する\n", "\n", " - 単純なオプティマイザを使用する(SGD と単純な分布戦略。 最初に `tf.distribute.OneDeviceStrategy` を使用する)。\n", " - トレーニング指標\n", " - 評価指標\n", " - 固有のランダム性に対する妥当な許容範囲を把握する\n", "\n", " b. 高度なオプティマイザ/学習率スケジューラ/分散ストラテジーとの同等性をチェックする\n", "\n", " c. 混合精度使用時の同等性をチェックする\n", "\n", "5. 追加の乗積ベンチマーク" ] }, { "cell_type": "markdown", "metadata": { "id": "XKakQBI9-FLb" }, "source": [ "## セットアップ" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:11:57.148474Z", "iopub.status.busy": "2024-01-11T18:11:57.148027Z", "iopub.status.idle": "2024-01-11T18:12:39.290827Z", "shell.execute_reply": "2024-01-11T18:12:39.289975Z" }, "id": "i1ghHyXl-Oqd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", "tensorflow-datasets 4.9.3 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\r\n", "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\r\n", "\u001b[0m" ] } ], "source": [ "# The `DeterministicRandomTestTool` is only available from Tensorflow 2.8:\n", "!pip install -q \"tensorflow==2.9.*\"" ] }, { "cell_type": "markdown", "metadata": { "id": "usyRSlIRl3r2" }, "source": [ "### 1 つのフォワードパスの検証\n", "\n", "チェックポイントの読み込みを含む 1 つのフォワードパスの検証については、別の [colab](./validate_correctness.ipynb) で説明しています。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:39.295322Z", "iopub.status.busy": "2024-01-11T18:12:39.295059Z", "iopub.status.idle": "2024-01-11T18:12:41.295845Z", "shell.execute_reply": "2024-01-11T18:12:41.294896Z" }, "id": "HVBQbsZeVL_V" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-01-11 18:12:39.615665: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n" ] } ], "source": [ "import sys\n", "import unittest\n", "import numpy as np\n", "\n", "import tensorflow as tf\n", "import tensorflow.compat.v1 as v1" ] }, { "cell_type": "markdown", "metadata": { "id": "4M104dt7m5cC" }, "source": [ "### 数ステップのモデルトレーニングの数値的等価性検証" ] }, { "cell_type": "markdown", "metadata": { "id": "v2Nz2Ni1EkMz" }, "source": [ "モデル構成を設定し、偽のデータセットを準備します。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:41.300321Z", "iopub.status.busy": "2024-01-11T18:12:41.299900Z", "iopub.status.idle": "2024-01-11T18:12:41.305463Z", "shell.execute_reply": "2024-01-11T18:12:41.304647Z" }, "id": "hUxXadzKU9rT" }, "outputs": [], "source": [ "params = {\n", " 'input_size': 3,\n", " 'num_classes': 3,\n", " 'layer_1_size': 2,\n", " 'layer_2_size': 2,\n", " 'num_train_steps': 100,\n", " 'init_lr': 1e-3,\n", " 'end_lr': 0.0,\n", " 'decay_steps': 1000,\n", " 'lr_power': 1.0,\n", "}\n", "\n", "# make a small fixed dataset\n", "fake_x = np.ones((2, params['input_size']), dtype=np.float32)\n", "fake_y = np.zeros((2, params['num_classes']), dtype=np.int32)\n", "fake_y[0][0] = 1\n", "fake_y[1][1] = 1\n", "\n", "step_num = 3" ] }, { "cell_type": "markdown", "metadata": { "id": "lV_n3Ukmz4Un" }, "source": [ "TF1.x モデルを定義します。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:41.309043Z", "iopub.status.busy": "2024-01-11T18:12:41.308783Z", "iopub.status.idle": "2024-01-11T18:12:41.317936Z", "shell.execute_reply": "2024-01-11T18:12:41.317117Z" }, "id": "ATa5fzL8mAwl" }, "outputs": [], "source": [ "# Assume there is an existing TF1.x model using estimator API\n", "# Wrap the model_fn to log necessary tensors for result comparison\n", "class SimpleModelWrapper():\n", " def __init__(self):\n", " self.logged_ops = {}\n", " self.logs = {\n", " 'step': [],\n", " 'lr': [],\n", " 'loss': [],\n", " 'grads_and_vars': [],\n", " 'layer_out': []}\n", " \n", " def model_fn(self, features, labels, mode, params):\n", " out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n", " out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n", " logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n", " loss = tf.compat.v1.losses.softmax_cross_entropy(labels, logits)\n", "\n", " # skip EstimatorSpec details for prediction and evaluation \n", " if mode == tf.estimator.ModeKeys.PREDICT:\n", " pass\n", " if mode == tf.estimator.ModeKeys.EVAL:\n", " pass\n", " assert mode == tf.estimator.ModeKeys.TRAIN\n", "\n", " global_step = tf.compat.v1.train.get_or_create_global_step()\n", " lr = tf.compat.v1.train.polynomial_decay(\n", " learning_rate=params['init_lr'],\n", " global_step=global_step,\n", " decay_steps=params['decay_steps'],\n", " end_learning_rate=params['end_lr'],\n", " power=params['lr_power'])\n", " \n", " optmizer = tf.compat.v1.train.GradientDescentOptimizer(lr)\n", " grads_and_vars = optmizer.compute_gradients(\n", " loss=loss,\n", " var_list=graph.get_collection(\n", " tf.compat.v1.GraphKeys.TRAINABLE_VARIABLES))\n", " train_op = optmizer.apply_gradients(\n", " grads_and_vars,\n", " global_step=global_step)\n", " \n", " # log tensors\n", " self.logged_ops['step'] = global_step\n", " self.logged_ops['lr'] = lr\n", " self.logged_ops['loss'] = loss\n", " self.logged_ops['grads_and_vars'] = grads_and_vars\n", " self.logged_ops['layer_out'] = {\n", " 'layer_1': out_1,\n", " 'layer_2': out_2,\n", " 'logits': logits}\n", "\n", " return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)\n", "\n", " def update_logs(self, logs):\n", " for key in logs.keys():\n", " model_tf1.logs[key].append(logs[key])" ] }, { "cell_type": "markdown", "metadata": { "id": "kki9yILSKS7f" }, "source": [ "次の [`v1.keras.utils.DeterministicRandomTestTool`](https://www.tensorflow.org/api_docs/python/tf/compat/v1/keras/utils/DeterministicRandomTestTool) クラスは、コンテキストマネージャ `scope()` を提供し、 TF1 グラフ/セッションと Eager execution の両方でステートフルなランダム演算が同じシードを使用できるようになります。\n", "\n", "このツールには、次の 2 つのテストモードがあります。\n", "\n", "1. `constant` は、呼び出された回数に関係なく、1 つの演算ごとに同じシードを使用します。\n", "2. `num_random_ops` は、以前に観測されたステートフルなランダム演算の数を演算シードとして使用します。\n", "\n", "これは、変数の作成と初期化に使用されるステートフルなランダム演算と、計算で使用されるステートフルなランダム演算(ドロップアウトレイヤーなど)の両方に適用されます。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:41.321188Z", "iopub.status.busy": "2024-01-11T18:12:41.320918Z", "iopub.status.idle": "2024-01-11T18:12:41.599698Z", "shell.execute_reply": "2024-01-11T18:12:41.598887Z" }, "id": "X6Y3RWMoKOl8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_46125/2689227634.py:1: The name tf.keras.utils.DeterministicRandomTestTool is deprecated. Please use tf.compat.v1.keras.utils.DeterministicRandomTestTool instead.\n", "\n" ] } ], "source": [ "random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')" ] }, { "cell_type": "markdown", "metadata": { "id": "mk5-ZzxcErX5" }, "source": [ "TF1.x モデルを Graph モードで実行します。数値的等価性を比較するために、最初の 3 つのトレーニングステップの統計を収集します。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:41.603395Z", "iopub.status.busy": "2024-01-11T18:12:41.603149Z", "iopub.status.idle": "2024-01-11T18:12:42.490309Z", "shell.execute_reply": "2024-01-11T18:12:42.489638Z" }, "id": "r5zhJHvsWA24" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-01-11 18:12:42.162090: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", "2024-01-11 18:12:42.162225: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory\n", "2024-01-11 18:12:42.162316: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory\n", "2024-01-11 18:12:42.162410: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory\n", "2024-01-11 18:12:42.230109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory\n", "2024-01-11 18:12:42.230297: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.\n", "Skipping registering GPU devices...\n", "/tmpfs/tmp/ipykernel_46125/1984550333.py:14: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n", " out_1 = tf.compat.v1.layers.dense(features, units=params['layer_1_size'])\n", "/tmpfs/tmp/ipykernel_46125/1984550333.py:15: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n", " out_2 = tf.compat.v1.layers.dense(out_1, units=params['layer_2_size'])\n", "/tmpfs/tmp/ipykernel_46125/1984550333.py:16: UserWarning: `tf.layers.dense` is deprecated and will be removed in a future version. Please use `tf.keras.layers.Dense` instead.\n", " logits = tf.compat.v1.layers.dense(out_2, units=params['num_classes'])\n" ] } ], "source": [ "with random_tool.scope():\n", " graph = tf.Graph()\n", " with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n", " model_tf1 = SimpleModelWrapper()\n", " # build the model\n", " inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n", " labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n", " spec = model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n", " train_op = spec.train_op\n", "\n", " sess.run(tf.compat.v1.global_variables_initializer())\n", " for step in range(step_num):\n", " # log everything and update the model for one step\n", " logs, _ = sess.run(\n", " [model_tf1.logged_ops, train_op],\n", " feed_dict={inputs: fake_x, labels: fake_y})\n", " model_tf1.update_logs(logs)" ] }, { "cell_type": "markdown", "metadata": { "id": "eZxjI8Nxz9Ea" }, "source": [ "TF2 モデルを定義します。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:42.494530Z", "iopub.status.busy": "2024-01-11T18:12:42.494280Z", "iopub.status.idle": "2024-01-11T18:12:42.503231Z", "shell.execute_reply": "2024-01-11T18:12:42.502432Z" }, "id": "AA67rh2TkS1M" }, "outputs": [], "source": [ "class SimpleModel(tf.keras.Model):\n", " def __init__(self, params, *args, **kwargs):\n", " super(SimpleModel, self).__init__(*args, **kwargs)\n", " # define the model\n", " self.dense_1 = tf.keras.layers.Dense(params['layer_1_size'])\n", " self.dense_2 = tf.keras.layers.Dense(params['layer_2_size'])\n", " self.out = tf.keras.layers.Dense(params['num_classes'])\n", " learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(\n", " initial_learning_rate=params['init_lr'],\n", " decay_steps=params['decay_steps'],\n", " end_learning_rate=params['end_lr'],\n", " power=params['lr_power']) \n", " self.optimizer = tf.keras.optimizers.legacy.SGD(learning_rate_fn)\n", " self.compiled_loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)\n", " self.logs = {\n", " 'lr': [],\n", " 'loss': [],\n", " 'grads': [],\n", " 'weights': [],\n", " 'layer_out': []}\n", "\n", " def call(self, inputs):\n", " out_1 = self.dense_1(inputs)\n", " out_2 = self.dense_2(out_1)\n", " logits = self.out(out_2)\n", " # log output features for every layer for comparison\n", " layer_wise_out = {\n", " 'layer_1': out_1,\n", " 'layer_2': out_2,\n", " 'logits': logits}\n", " self.logs['layer_out'].append(layer_wise_out)\n", " return logits\n", "\n", " def train_step(self, data):\n", " x, y = data\n", " with tf.GradientTape() as tape:\n", " logits = self(x)\n", " loss = self.compiled_loss(y, logits)\n", " grads = tape.gradient(loss, self.trainable_weights)\n", " # log training statistics\n", " step = self.optimizer.iterations.numpy()\n", " self.logs['lr'].append(self.optimizer.learning_rate(step).numpy())\n", " self.logs['loss'].append(loss.numpy())\n", " self.logs['grads'].append(grads)\n", " self.logs['weights'].append(self.trainable_weights)\n", " # update model\n", " self.optimizer.apply_gradients(zip(grads, self.trainable_weights))\n", " return" ] }, { "cell_type": "markdown", "metadata": { "id": "I5smAcaEE8nX" }, "source": [ "TF2 モデルを eager モードで実行します。数値的等価性を比較するために、最初の 3 つのトレーニングステップの統計を収集します。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:42.506726Z", "iopub.status.busy": "2024-01-11T18:12:42.506469Z", "iopub.status.idle": "2024-01-11T18:12:42.556193Z", "shell.execute_reply": "2024-01-11T18:12:42.555588Z" }, "id": "Q0AbXF_eE8cS" }, "outputs": [], "source": [ "random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n", "with random_tool.scope():\n", " model_tf2 = SimpleModel(params)\n", " for step in range(step_num):\n", " model_tf2.train_step([fake_x, fake_y])" ] }, { "cell_type": "markdown", "metadata": { "id": "cjJDjLcAz_gU" }, "source": [ "最初のいくつかのトレーニングステップの数値的等価性を比較します。\n", "\n", "また、[正当性と数値的等価性を検証するノートブック](./validate_correctness.ipynb)で、数値的等価性に関する追加のアドバイスを確認することもできます。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:42.559474Z", "iopub.status.busy": "2024-01-11T18:12:42.559230Z", "iopub.status.idle": "2024-01-11T18:12:42.595226Z", "shell.execute_reply": "2024-01-11T18:12:42.594375Z" }, "id": "6CbCUbsCiabC" }, "outputs": [], "source": [ "np.testing.assert_allclose(model_tf1.logs['lr'], model_tf2.logs['lr'])\n", "np.testing.assert_allclose(model_tf1.logs['loss'], model_tf2.logs['loss'])\n", "for step in range(step_num):\n", " for name in model_tf1.logs['layer_out'][step]:\n", " np.testing.assert_allclose(\n", " model_tf1.logs['layer_out'][step][name],\n", " model_tf2.logs['layer_out'][step][name])" ] }, { "cell_type": "markdown", "metadata": { "id": "dhVuuciimLIY" }, "source": [ "#### 単体テスト" ] }, { "cell_type": "markdown", "metadata": { "id": "sXZYFC6Hhqeb" }, "source": [ "移行コードのデバッグに役立ついくつかの種類の単体テストがあります。\n", "\n", "1. 1 つのフォワードパスの検証\n", "2. 数ステップのモデルトレーニングの数値的等価性検証\n", "3. ベンチマーク推論性能\n", "4. トレーニング済みのモデルが固定された単純なデータポイントに対して正しい予測を行う\n", "\n", "`@parameterized.parameters` を使用して、さまざまな構成でモデルをテストできます。[詳細(コードサンプル付き)](https://github.com/abseil/abseil-py/blob/master/absl/testing/parameterized.py)はこちらを参照してください。\n", "\n", "セッション API と Eager execution を同じテストケースで実行できます。以下のコードスニペットは、その方法を示しています。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:42.599160Z", "iopub.status.busy": "2024-01-11T18:12:42.598911Z", "iopub.status.idle": "2024-01-11T18:12:42.608968Z", "shell.execute_reply": "2024-01-11T18:12:42.608280Z" }, "id": "CdHqkgPPM2Bj" }, "outputs": [], "source": [ "import unittest\n", "\n", "class TestNumericalEquivalence(unittest.TestCase):\n", "\n", " # copied from code samples above\n", " def setup(self):\n", " # record statistics for 100 training steps\n", " step_num = 100\n", "\n", " # setup TF 1 model\n", " random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n", " with random_tool.scope():\n", " # run TF1.x code in graph mode with context management\n", " graph = tf.Graph()\n", " with graph.as_default(), tf.compat.v1.Session(graph=graph) as sess:\n", " self.model_tf1 = SimpleModelWrapper()\n", " # build the model\n", " inputs = tf.compat.v1.placeholder(tf.float32, shape=(None, params['input_size']))\n", " labels = tf.compat.v1.placeholder(tf.float32, shape=(None, params['num_classes']))\n", " spec = self.model_tf1.model_fn(inputs, labels, tf.estimator.ModeKeys.TRAIN, params)\n", " train_op = spec.train_op\n", "\n", " sess.run(tf.compat.v1.global_variables_initializer())\n", " for step in range(step_num):\n", " # log everything and update the model for one step\n", " logs, _ = sess.run(\n", " [self.model_tf1.logged_ops, train_op],\n", " feed_dict={inputs: fake_x, labels: fake_y})\n", " self.model_tf1.update_logs(logs)\n", "\n", " # setup TF2 model\n", " random_tool = v1.keras.utils.DeterministicRandomTestTool(mode='num_random_ops')\n", " with random_tool.scope():\n", " self.model_tf2 = SimpleModel(params)\n", " for step in range(step_num):\n", " self.model_tf2.train_step([fake_x, fake_y])\n", " \n", " def test_learning_rate(self):\n", " np.testing.assert_allclose(\n", " self.model_tf1.logs['lr'],\n", " self.model_tf2.logs['lr'])\n", "\n", " def test_training_loss(self):\n", " # adopt different tolerance strategies before and after 10 steps\n", " first_n_step = 10\n", "\n", " # absolute difference is limited below 1e-5\n", " # set `equal_nan` to be False to detect potential NaN loss issues\n", " abosolute_tolerance = 1e-5\n", " np.testing.assert_allclose(\n", " actual=self.model_tf1.logs['loss'][:first_n_step],\n", " desired=self.model_tf2.logs['loss'][:first_n_step],\n", " atol=abosolute_tolerance,\n", " equal_nan=False)\n", " \n", " # relative difference is limited below 5%\n", " relative_tolerance = 0.05\n", " np.testing.assert_allclose(self.model_tf1.logs['loss'][first_n_step:],\n", " self.model_tf2.logs['loss'][first_n_step:],\n", " rtol=relative_tolerance,\n", " equal_nan=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "gshSQdKIddpZ" }, "source": [ "## デバッグツール" ] }, { "cell_type": "markdown", "metadata": { "id": "CkMfCaJRclKv" }, "source": [ "### tf.print\n", "\n", "tf.print と print/logging.info の比較\n", "\n", "- 構成可能な引数を使用すると、`tf.print` は出力されたテンソルの各次元の最初と最後のいくつかの要素を再帰的に表示できます。詳細については、[API ドキュメント](https://www.tensorflow.org/api_docs/python/tf/print)を参照してください。\n", "- Eager execution では、`print` と `tf.print` の両方がテンソルの値を出力します。ただし、`print` にはデバイスからホストへのコピーが含まれる場合があり、コードが遅くなる可能性があります。\n", "- `tf.function` 内での使用を含む Graph モードでは、`tf.print` を使用して実際のテンソル値を出力する必要があります。`tf.print` はグラフ内の演算にコンパイルされますが、`print` と `logging.info` はトレース時にしかログに記録されません(多くの場合、これは希望されないことだと思います)。\n", "- `tf.print` は、`tf.RaggedTensor` や `tf.sparse.SparseTensor` などの複合テンソルの出力もサポートしています。\n", "- また、コールバックを使用して、指標と変数を監視することもできます。[logs dict](https://www.tensorflow.org/guide/keras/custom_callback#usage_of_logs_dict) と [self.model 属性](https://www.tensorflow.org/guide/keras/custom_callback#usage_of_selfmodel_attribute)でカスタムコールバックを使用する方法を確認してください。" ] }, { "cell_type": "markdown", "metadata": { "id": "S-5h3cX8Dc50" }, "source": [ "tf.print と tf.function 内の print の比較" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:42.612425Z", "iopub.status.busy": "2024-01-11T18:12:42.612172Z", "iopub.status.idle": "2024-01-11T18:12:42.674548Z", "shell.execute_reply": "2024-01-11T18:12:42.673903Z" }, "id": "gRED9FMyDKih" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tensor(\"add:0\", shape=(1,), dtype=float32)\n", "[2]\n" ] } ], "source": [ "# `print` prints info of tensor object\n", "# `tf.print` prints the tensor value\n", "@tf.function\n", "def dummy_func(num):\n", " num += 1\n", " print(num)\n", " tf.print(num)\n", " return num\n", "\n", "_ = dummy_func(tf.constant([1.0]))\n", "\n", "# Output:\n", "# Tensor(\"add:0\", shape=(1,), dtype=float32)\n", "# [2]" ] }, { "cell_type": "markdown", "metadata": { "id": "3QroLA_zDK2w" }, "source": [ "tf.distribute.Strategy\n", "\n", "- `tf.print` を含む `tf.function` がワーカーで実行される場合(たとえば、`TPUStrategy` または `ParameterServerStrategy` を使用する場合)、出力された値を見つけるには、ワーカー/パラメータサーバーログを確認する必要があります。\n", "- `print` または `logging.info` の場合、`ParameterServerStrategy` を使用するとログがコーディネータに出力され、TPU を使用する場合は、ログは worker0 の STDOUT に出力されます。\n", "\n", "tf.keras.Model\n", "\n", "- Sequential API モデルと Functional API モデルを使用する場合、モデル入力やいくつかのレイヤーの後の中間特徴などの値を出力する場合は、次のオプションがあります。\n", " 1. 入力を `tf.print ` する[カスタムレイヤーを作成します。](https://www.tensorflow.org/guide/keras/custom_layers_and_models)\n", " 2. 調査する中間出力をモデル出力に含めます。\n", "- `tf.keras.layers.Lambda` レイヤーには(逆)シリアル化の制限があります。チェックポイントの読み込みの問題を回避するには、カスタムサブクラス化されたレイヤーを記述します。詳しくは、[API ドキュメント](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda)を参照してください。\n", "- 実際の値にアクセスできない場合、`tf.keras.callbacks.LambdaCallback` で中間出力を `tf.print` することはできませんが、シンボリック Keras テンソルオブジェクトにのみアクセスできます。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "aKazGTr1ZUMG" }, "source": [ "オプション 1: カスタムレイヤーを作成します。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:42.678178Z", "iopub.status.busy": "2024-01-11T18:12:42.677633Z", "iopub.status.idle": "2024-01-11T18:12:43.070640Z", "shell.execute_reply": "2024-01-11T18:12:43.069644Z" }, "id": "8w4aY7wO0B4W" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[-0.327884018]\n", " [-0.109294683]\n", " [-0.218589365]]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/1 [==============================] - ETA: 0s - loss: 0.6077" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "1/1 [==============================] - 0s 277ms/step - loss: 0.6077\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class PrintLayer(tf.keras.layers.Layer):\n", " def call(self, inputs):\n", " tf.print(inputs)\n", " return inputs\n", "\n", "def get_model():\n", " inputs = tf.keras.layers.Input(shape=(1,))\n", " out_1 = tf.keras.layers.Dense(4)(inputs)\n", " out_2 = tf.keras.layers.Dense(1)(out_1)\n", " # use custom layer to tf.print intermediate features\n", " out_3 = PrintLayer()(out_2)\n", " model = tf.keras.Model(inputs=inputs, outputs=out_3)\n", " return model\n", "\n", "model = get_model()\n", "model.compile(optimizer=\"adam\", loss=\"mse\")\n", "model.fit([1, 2, 3], [0.0, 0.0, 1.0])" ] }, { "cell_type": "markdown", "metadata": { "id": "KNESOatq7iM9" }, "source": [ "オプション 2: 調査する中間出力をモデル出力に含めます。\n", "\n", "このような場合、`Model.fit` を使用するには、いくつかの[カスタマイズ](https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit)が必要になる場合があることに注意してください。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:12:43.074671Z", "iopub.status.busy": "2024-01-11T18:12:43.074151Z", "iopub.status.idle": "2024-01-11T18:12:43.079185Z", "shell.execute_reply": "2024-01-11T18:12:43.078273Z" }, "id": "MiifvdLk7g9J" }, "outputs": [], "source": [ "def get_model():\n", " inputs = tf.keras.layers.Input(shape=(1,))\n", " out_1 = tf.keras.layers.Dense(4)(inputs)\n", " out_2 = tf.keras.layers.Dense(1)(out_1)\n", " # include intermediate values in model outputs\n", " model = tf.keras.Model(\n", " inputs=inputs,\n", " outputs={\n", " 'inputs': inputs,\n", " 'out_1': out_1,\n", " 'out_2': out_2})\n", " return model" ] }, { "cell_type": "markdown", "metadata": { "id": "MvIKDZpHSLmQ" }, "source": [ "### pdb\n", "\n", "端末と Colab の両方で [pdb](https://docs.python.org/3/library/pdb.html) を使用して、デバッグ用の中間値を調べることができます。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Qu0n4O2umyT7" }, "source": [ "### TensorBoard でグラフを可視化する\n", "\n", "[TensorBoard を使用すると TensorFlow のグラフを調べられます](https://www.tensorflow.org/tensorboard/graphs)。TensorBoard は、要約を視覚化する優れたツールで [colab でもサポート](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks)されています。これを使用して、トレーニングプロセスを通じて TF1.x モデルと移行された TF2 モデルの間で学習率、モデルの重み、勾配スケーリング、トレーニング/検証指標、および、モデルの中間出力を比較し、値が期待どおりになっているかどうかを確認できます。" ] }, { "cell_type": "markdown", "metadata": { "id": "vBnxB6_xzlnT" }, "source": [ "### TensorFlow Profiler\n", "\n", "[TensorFlow Profiler](https://www.tensorflow.org/guide/profiler) は、GPU/TPU での実行タイムラインを視覚化するのに役立ちます。基本的な使い方については、この [Colab デモ](https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras)を参照してください。" ] }, { "cell_type": "markdown", "metadata": { "id": "9wNmCSHBpiGM" }, "source": [ "### MultiProcessRunner\n", "\n", "[MultiProcessRunner](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/multi_process_runner.py#L108) は、MultiWorkerMirroredStrategy と ParameterServerStrategy でデバッグする際に便利なツールです。使用法については、[この具体的な例](https://github.com/keras-team/keras/blob/master/keras/integration_test/mwms_multi_process_runner_test.py)を参照してください。\n", "\n", "特にこれら 2 つのストラテジーのケースでは、1) フローをカバーする単体テストを用意し、2) 単体テストでこれを使用して失敗を再現してみることをお勧めします。これは、修正を試みるたびに実際の分散ジョブが起動されることを避けるためです。" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "migration_debugging.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 0 }