{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2019 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2022-12-14T21:09:20.568832Z", "iopub.status.busy": "2022-12-14T21:09:20.568366Z", "iopub.status.idle": "2022-12-14T21:09:20.572020Z", "shell.execute_reply": "2022-12-14T21:09:20.571498Z" }, "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "qFdPvlXBOdUN" }, "source": [ "# 混合精度" ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "## 概述\n", "\n", "混合精度是指训练时在模型中同时使用 16 位和 32 位浮点类型,从而加快运行速度,减少内存使用的一种训练方法。通过让模型的某些部分保持使用 32 位类型以保持数值稳定性,可以缩短模型的单步用时,而在评估指标(如准确率)方面仍可以获得同等的训练效果。本指南介绍如何使用 Keras 混合精度 API 来加快模型速度。利用此 API 可以在现代 GPU 上将性能提高三倍以上,而在 TPU 上可以提高 60%。" ] }, { "cell_type": "markdown", "metadata": { "id": "3vsYi_bv7gS_" }, "source": [ "如今,大多数模型使用 float32 dtype,这种数据类型占用 32 位内存。但是,还有两种精度较低的 dtype,即 float16 和 bfloat16,它们都是占用 16 位内存。现代加速器使用 16 位 dtype 执行运算的速度更快,因为它们有执行 16 位计算的专用硬件,并且从内存中读取 16 位 dtype 的速度也更快。\n", "\n", "NVIDIA GPU 使用 float16 执行运算的速度比使用 float32 快,而 TPU 使用 bfloat16 执行运算的速度也比使用 float32 快。因此,在这些设备上应尽可能使用精度较低的 dtype。但是,出于对数值的要求,为了让模型训练获得相同的质量,一些变量和计算仍需使用 float32。利用 Keras 混合精度 API,float16 或 bfloat16 可以与 float32 混合使用,从而既可以获得 float16/bfloat16 的性能优势,也可以获得 float32 的数值稳定性。\n", "\n", "注:在本指南中,术语“数值稳定性”是指使用较低精度的 dtype(而不是较高精度的 dtype)对模型质量的影响。如果使用 float16 或 bfloat16 执行运算,则与使用 float32 执行运算相比,使用这些较低精度的 dtype 会导致模型获得的评估准确率或其他指标相对较低,那么我们就说这种运算“数值不稳定”。" ] }, { "cell_type": "markdown", "metadata": { "id": "MUXex9ctTuDB" }, "source": [ "## 设置" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:20.575426Z", "iopub.status.busy": "2022-12-14T21:09:20.575200Z", "iopub.status.idle": "2022-12-14T21:09:22.498402Z", "shell.execute_reply": "2022-12-14T21:09:22.497632Z" }, "id": "IqR2PQG4ZaZ0" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 21:09:21.518797: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n", "2022-12-14 21:09:21.518898: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n", "2022-12-14 21:09:21.518908: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n" ] } ], "source": [ "import tensorflow as tf\n", "\n", "from tensorflow import keras\n", "from tensorflow.keras import layers\n", "from tensorflow.keras import mixed_precision" ] }, { "cell_type": "markdown", "metadata": { "id": "814VXqdh8Q0r" }, "source": [ "## 支持的硬件\n", "\n", "虽然混合精度在大多数硬件上都可以运行,但是在最新的 NVIDIA GPU 和 Cloud TPU 上才能加速模型。NVIDIA GPU 支持混合使用 float16 和 float32,而 TPU 则支持混合使用 bfloat16 和 float32。\n", "\n", "在 NVIDIA GPU 中,计算能力为 7.0 或更高的 GPU 可以获得混合精度的最大性能优势,因为这些型号具有称为 Tensor 核心的特殊硬件单元,可以加速 float16 矩阵乘法和卷积运算。旧款 GPU 使用混合精度无法实现数学运算性能优势,不过可以节省内存和带宽,因此也可以在一定程度上提高速度。您可以在 NVIDIA 的 [CUDA GPU 网页](https://developer.nvidia.com/cuda-gpus)上查询 GPU 的计算能力。可以最大程度从混合精度受益的 GPU 示例包括 RTX GPU、V100 和 A100。" ] }, { "cell_type": "markdown", "metadata": { "id": "-q2hisD60F0_" }, "source": [ "注:如果在 Google Colab 中运行本指南中示例,则 GPU 运行时通常会连接 P100。P100 的计算能力为 6.0,预计速度提升不明显。\n", "\n", "您可以使用以下命令检查 GPU 类型。如果要使用此命令,必须安装 NVIDIA 驱动程序,否则会引发错误。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:22.503113Z", "iopub.status.busy": "2022-12-14T21:09:22.502346Z", "iopub.status.idle": "2022-12-14T21:09:22.662857Z", "shell.execute_reply": "2022-12-14T21:09:22.662044Z" }, "id": "j-Yzg_lfkoa_" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-60e44e96-c278-c603-6807-22566f25de74)\r\n", "GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-6702b791-ca8d-bd30-257a-f7397b6dfc84)\r\n", "GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-95b64f9f-e125-8b72-616a-849e27f562f7)\r\n", "GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-ab60c099-7b6d-b830-e8eb-4ed4bd54e0dc)\r\n" ] } ], "source": [ "!nvidia-smi -L" ] }, { "cell_type": "markdown", "metadata": { "id": "hu_pvZDN0El3" }, "source": [ "所有 Cloud TPU 均支持 bfloat16。\n", "\n", "即使在预计无法提升速度的 CPU 和旧款 GPU 上,混合精度 API 仍可以用于单元测试、调试或试用 API​​。不过,在 CPU 上,混合精度的运行速度会明显变慢。" ] }, { "cell_type": "markdown", "metadata": { "id": "HNOmvumB-orT" }, "source": [ "## 设置 dtype 策略" ] }, { "cell_type": "markdown", "metadata": { "id": "54ecYY2Hn16E" }, "source": [ "要在 Keras 中使用混合精度,您需要创建一条 `tf.keras.mixed_precision.Policy`,通常将其称为 *dtype 策略*。Dtype 策略可以指定将在其中运行的 dtype 层。在本指南中,您将从字符串 `'mixed_float16'` 构造策略,并将其设置为全局策略。这会导致随后创建的层使用 float16 和 float32 的混合精度。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:22.666888Z", "iopub.status.busy": "2022-12-14T21:09:22.666463Z", "iopub.status.idle": "2022-12-14T21:09:22.798341Z", "shell.execute_reply": "2022-12-14T21:09:22.797698Z" }, "id": "x3kElPVH-siO" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING\n", "Your GPUs may run slowly with dtype policy mixed_float16 because they do not have compute capability of at least 7.0. Your GPUs:\n", " Tesla P100-PCIE-16GB, compute capability 6.0 (x4)\n", "See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.\n", "If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once\n" ] } ], "source": [ "policy = mixed_precision.Policy('mixed_float16')\n", "mixed_precision.set_global_policy(policy)" ] }, { "cell_type": "markdown", "metadata": { "id": "6ids1rT_UM5q" }, "source": [ "简而言之,您可以直接将字符串传递给 `set_global_policy`,这通常在实践中完成。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:22.802046Z", "iopub.status.busy": "2022-12-14T21:09:22.801489Z", "iopub.status.idle": "2022-12-14T21:09:22.804593Z", "shell.execute_reply": "2022-12-14T21:09:22.804022Z" }, "id": "6a8iNFoBUSqR" }, "outputs": [], "source": [ "# Equivalent to the two lines above\n", "mixed_precision.set_global_policy('mixed_float16')" ] }, { "cell_type": "markdown", "metadata": { "id": "oGAMaa0Ho3yk" }, "source": [ "该策略指定了层的两个重要方面:完成层的计算所使用的 dtype 和层变量的 dtype。上面的代码创建了一条 `mixed_float16` 策略(即通过将字符串 `'mixed_float16'` 传递给其构造函数而构建的 `mixed_precision.Policy` )。凭借此策略,层可以使用 float16 计算和 float32 变量。计算使用 float16 来提高性能,而变量使用 float32 来确保数值稳定性。您可以直接在策略中查询这些属性。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:22.807970Z", "iopub.status.busy": "2022-12-14T21:09:22.807391Z", "iopub.status.idle": "2022-12-14T21:09:22.810943Z", "shell.execute_reply": "2022-12-14T21:09:22.810331Z" }, "id": "GQRbYm4f8p-k" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Compute dtype: float16\n", "Variable dtype: float32\n" ] } ], "source": [ "print('Compute dtype: %s' % policy.compute_dtype)\n", "print('Variable dtype: %s' % policy.variable_dtype)" ] }, { "cell_type": "markdown", "metadata": { "id": "MOFEcna28o4T" }, "source": [ "如前所述,在计算能力至少为 7.0 的 NVIDIA GPU 上,`mixed_float16` 策略可以大幅提升性能。在其他 GPU 和 CPU 上,该策略也可以运行,但可能无法提升性能。对于 TPU,则应使用 `mixed_bfloat16` 策略。" ] }, { "cell_type": "markdown", "metadata": { "id": "cAHpt128tVpK" }, "source": [ "## 构建模型" ] }, { "cell_type": "markdown", "metadata": { "id": "nB6ujaR8qMAy" }, "source": [ "接下来,我们开始构建一个简单的模型。过小的模型往往无法获得混合精度的优势,因为 TensorFlow 运行时的开销通常占据大部分执行时间,导致 GPU 的性能提升几乎可以忽略不计。因此,如果使用 GPU,我们会构建两个比较大的 `Dense` 层,每个层具有 4096 个单元。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:22.814484Z", "iopub.status.busy": "2022-12-14T21:09:22.814087Z", "iopub.status.idle": "2022-12-14T21:09:26.117589Z", "shell.execute_reply": "2022-12-14T21:09:26.116851Z" }, "id": "0DQM24hL_14Q" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The model will run with 4096 units on a GPU\n" ] } ], "source": [ "inputs = keras.Input(shape=(784,), name='digits')\n", "if tf.config.list_physical_devices('GPU'):\n", " print('The model will run with 4096 units on a GPU')\n", " num_units = 4096\n", "else:\n", " # Use fewer units on CPUs so the model finishes in a reasonable amount of time\n", " print('The model will run with 64 units on a CPU')\n", " num_units = 64\n", "dense1 = layers.Dense(num_units, activation='relu', name='dense_1')\n", "x = dense1(inputs)\n", "dense2 = layers.Dense(num_units, activation='relu', name='dense_2')\n", "x = dense2(x)" ] }, { "cell_type": "markdown", "metadata": { "id": "2dezdcqnOXHk" }, "source": [ "每个层都有一条策略,默认情况下会使用全局策略。因此,每个 `Dense` 层都具有 `mixed_float16` 策略,这是因为之前已将 `mixed_float16` 设置为全局策略。这样,dense 层就会执行 float16 计算,并使用 float32 变量。为了执行 float16 计算,它们会将输入转换为 float16 类型,因此,输出也是 float16 类型。它们的变量是 float32 类型,在调用层时,它们会将变量转换为 float16 类型,从而避免 dtype 不匹配所引起的错误。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:26.121570Z", "iopub.status.busy": "2022-12-14T21:09:26.120955Z", "iopub.status.idle": "2022-12-14T21:09:26.124819Z", "shell.execute_reply": "2022-12-14T21:09:26.124238Z" }, "id": "kC58MzP4PEcC" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "x.dtype: float16\n", "dense1.kernel.dtype: float32\n" ] } ], "source": [ "print(dense1.dtype_policy)\n", "print('x.dtype: %s' % x.dtype.name)\n", "# 'kernel' is dense1's variable\n", "print('dense1.kernel.dtype: %s' % dense1.kernel.dtype.name)" ] }, { "cell_type": "markdown", "metadata": { "id": "_WAZeqDyqZcb" }, "source": [ "接下来创建输出预测。通常,您可以按如下方法创建输出预测,但是对于 float16,其结果不一定具有数值稳定性。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:26.128035Z", "iopub.status.busy": "2022-12-14T21:09:26.127526Z", "iopub.status.idle": "2022-12-14T21:09:26.143915Z", "shell.execute_reply": "2022-12-14T21:09:26.143267Z" }, "id": "ybBq1JDwNIbz" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Outputs dtype: float16\n" ] } ], "source": [ "# INCORRECT: softmax and model output will be float16, when it should be float32\n", "outputs = layers.Dense(10, activation='softmax', name='predictions')(x)\n", "print('Outputs dtype: %s' % outputs.dtype.name)" ] }, { "cell_type": "markdown", "metadata": { "id": "D0gSWxc9NN7q" }, "source": [ "模型末尾的 softmax 激活值本应为 float32 类型。但由于 dtype 策略是 `mixed_float16`,softmax 激活通常会使用 float16 dtype 进行计算,并且会输出 float16 张量。\n", "\n", "这一问题可以通过分离 Dense 和 softmax 层,并将 `dtype='float32'` 传递至 softmax 层来解决。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:26.147032Z", "iopub.status.busy": "2022-12-14T21:09:26.146624Z", "iopub.status.idle": "2022-12-14T21:09:26.161597Z", "shell.execute_reply": "2022-12-14T21:09:26.161049Z" }, "id": "IGqCGn4BsODw" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Outputs dtype: float32\n" ] } ], "source": [ "# CORRECT: softmax and model output are float32\n", "x = layers.Dense(10, name='dense_logits')(x)\n", "outputs = layers.Activation('softmax', dtype='float32', name='predictions')(x)\n", "print('Outputs dtype: %s' % outputs.dtype.name)" ] }, { "cell_type": "markdown", "metadata": { "id": "tUdkY_DHsP8i" }, "source": [ "将 `dtype='float32'` 传递至 softmax 层的构造函数会将该层的 dtype 策略重写为 `float32` 策略,从而由后者执行计算并保持变量为 float32 类型。同样,我们也可以传递 `dtype=mixed_precision.Policy('float32')`;层始终将 dtype 参数转换为策略。由于 `Activation` 层没有变量,因此会忽略该策略的变量 dtype,但是该策略的计算 dtype 为 float32,因此 softmax 和模型的输出也是 float32。\n", "\n", "您可以在模型中间添加 float16 类型的 softmax,但模型末尾的 softmax 应为 float32 类型。原因是,如果从 softmax 传递给损失函数的中间张量是 float16 或 bfloat16 类型,则会出现数值问题。\n", "\n", "如果您认为使用 float16 计算无法获得数值稳定性,则可以通过传递 `dtype='float32'`,将任何层的 dtype 重写为 float32 类型。但通常,只有模型的最后一层才需要这样重写,因为对大多数层来说,`mixed_float16` 和 `mixed_bfloat16` 的精度已经足够。\n", "\n", "即使模型不以 softmax 结尾,输出也仍是 float32。虽然对这一特定模型来说并非必需,但可以使用以下代码将模型输出转换为 float32 类型:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:26.165065Z", "iopub.status.busy": "2022-12-14T21:09:26.164405Z", "iopub.status.idle": "2022-12-14T21:09:26.169587Z", "shell.execute_reply": "2022-12-14T21:09:26.169018Z" }, "id": "dzVAoLI56jR8" }, "outputs": [], "source": [ "# The linear activation is an identity function. So this simply casts 'outputs'\n", "# to float32. In this particular case, 'outputs' is already float32 so this is a\n", "# no-op.\n", "outputs = layers.Activation('linear', dtype='float32')(outputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "tpY4ZP7us5hA" }, "source": [ "接下来,完成并编译模型,并生成输入数据:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:26.172836Z", "iopub.status.busy": "2022-12-14T21:09:26.172470Z", "iopub.status.idle": "2022-12-14T21:09:26.540662Z", "shell.execute_reply": "2022-12-14T21:09:26.539919Z" }, "id": "g4OT3Z6kqYAL" }, "outputs": [], "source": [ "model = keras.Model(inputs=inputs, outputs=outputs)\n", "model.compile(loss='sparse_categorical_crossentropy',\n", " optimizer=keras.optimizers.RMSprop(),\n", " metrics=['accuracy'])\n", "\n", "(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()\n", "x_train = x_train.reshape(60000, 784).astype('float32') / 255\n", "x_test = x_test.reshape(10000, 784).astype('float32') / 255" ] }, { "cell_type": "markdown", "metadata": { "id": "0Sm8FJHegVRN" }, "source": [ "本示例将输入数据从 int8 强制转换为 float32。我们不转换为 float16 是因为在 CPU 上除以 255 时,float16 的运算速度比 float32 慢。在这种情况下,性能差距可以忽略不计,但一般来说,在 CPU 上执行运算时,数学处理输入应使用 float32 类型。该模型的第一层会将输入转换为 float16,因为每一层都会将浮点输入强制转换为其计算 dtype。\n", "\n", "检索模型的初始权重。这样可以通过加载权重来从头开始训练。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:26.545068Z", "iopub.status.busy": "2022-12-14T21:09:26.544481Z", "iopub.status.idle": "2022-12-14T21:09:26.664167Z", "shell.execute_reply": "2022-12-14T21:09:26.663472Z" }, "id": "0UYs-u_DgiA5" }, "outputs": [], "source": [ "initial_weights = model.get_weights()" ] }, { "cell_type": "markdown", "metadata": { "id": "zlqz6eVKs9aU" }, "source": [ "## 使用 Model.fit 训练模型\n", "\n", "接下来,训练模型:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:26.668223Z", "iopub.status.busy": "2022-12-14T21:09:26.667653Z", "iopub.status.idle": "2022-12-14T21:09:34.048180Z", "shell.execute_reply": "2022-12-14T21:09:34.047337Z" }, "id": "hxI7-0ewmC0A" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1/5\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/6 [====>.........................] - ETA: 8s - loss: 2.2976 - accuracy: 0.0897" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "2/6 [=========>....................] - ETA: 0s - loss: 2.0780 - accuracy: 0.2561" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/6 [==============>...............] - ETA: 0s - loss: 2.9404 - accuracy: 0.2969" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "4/6 [===================>..........] - ETA: 0s - loss: 3.0898 - accuracy: 0.3146" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "5/6 [========================>.....] - ETA: 0s - loss: 2.8144 - accuracy: 0.3598" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - ETA: 0s - loss: 2.5948 - accuracy: 0.4021" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - 3s 202ms/step - loss: 2.5948 - accuracy: 0.4021 - val_loss: 0.9373 - val_accuracy: 0.7832\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 2/5\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/6 [====>.........................] - ETA: 0s - loss: 0.9658 - accuracy: 0.7662" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "2/6 [=========>....................] - ETA: 0s - loss: 0.8466 - accuracy: 0.7857" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/6 [==============>...............] - ETA: 0s - loss: 0.9793 - accuracy: 0.7250" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "4/6 [===================>..........] - ETA: 0s - loss: 1.0943 - accuracy: 0.6801" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "5/6 [========================>.....] - ETA: 0s - loss: 1.0469 - accuracy: 0.6982" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - ETA: 0s - loss: 0.9973 - accuracy: 0.7110" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - 1s 153ms/step - loss: 0.9973 - accuracy: 0.7110 - val_loss: 0.6015 - val_accuracy: 0.8052\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 3/5\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/6 [====>.........................] - ETA: 0s - loss: 0.6387 - accuracy: 0.7900" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "2/6 [=========>....................] - ETA: 0s - loss: 0.5934 - accuracy: 0.8062" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/6 [==============>...............] - ETA: 0s - loss: 0.5739 - accuracy: 0.8124" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "4/6 [===================>..........] - ETA: 0s - loss: 0.5616 - accuracy: 0.8187" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "5/6 [========================>.....] - ETA: 0s - loss: 0.5592 - accuracy: 0.8191" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - ETA: 0s - loss: 0.5519 - accuracy: 0.8236" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - 1s 155ms/step - loss: 0.5519 - accuracy: 0.8236 - val_loss: 0.3905 - val_accuracy: 0.8788\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4/5\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/6 [====>.........................] - ETA: 0s - loss: 0.4147 - accuracy: 0.8693" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "2/6 [=========>....................] - ETA: 0s - loss: 0.4285 - accuracy: 0.8596" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/6 [==============>...............] - ETA: 0s - loss: 0.4880 - accuracy: 0.8451" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "4/6 [===================>..........] - ETA: 0s - loss: 0.4804 - accuracy: 0.8442" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "5/6 [========================>.....] - ETA: 0s - loss: 0.4563 - accuracy: 0.8532" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - ETA: 0s - loss: 0.4389 - accuracy: 0.8602" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - 1s 156ms/step - loss: 0.4389 - accuracy: 0.8602 - val_loss: 0.3258 - val_accuracy: 0.9003\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 5/5\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/6 [====>.........................] - ETA: 0s - loss: 0.3673 - accuracy: 0.8838" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "2/6 [=========>....................] - ETA: 0s - loss: 0.3673 - accuracy: 0.8818" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/6 [==============>...............] - ETA: 0s - loss: 0.3758 - accuracy: 0.8787" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "4/6 [===================>..........] - ETA: 0s - loss: 0.3581 - accuracy: 0.8874" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "5/6 [========================>.....] - ETA: 0s - loss: 0.3411 - accuracy: 0.8937" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - ETA: 0s - loss: 0.3266 - accuracy: 0.8994" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "6/6 [==============================] - 1s 154ms/step - loss: 0.3266 - accuracy: 0.8994 - val_loss: 0.2298 - val_accuracy: 0.9333\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "313/313 - 1s - loss: 0.2349 - accuracy: 0.9330 - 635ms/epoch - 2ms/step\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Test loss: 0.23491430282592773\n", "Test accuracy: 0.9330000281333923\n" ] } ], "source": [ "history = model.fit(x_train, y_train,\n", " batch_size=8192,\n", " epochs=5,\n", " validation_split=0.2)\n", "test_scores = model.evaluate(x_test, y_test, verbose=2)\n", "print('Test loss:', test_scores[0])\n", "print('Test accuracy:', test_scores[1])\n" ] }, { "cell_type": "markdown", "metadata": { "id": "MPhJ9OPWt4x5" }, "source": [ "请注意,模型会在日志中打印每个步骤的时间:例如,“25ms/step”。第一个周期可能会变慢,因为 TensorFlow 会花一些时间来优化模型,但之后每个步骤的时间应当会稳定下来。\n", "\n", "如果在 Colab 中运行本指南中,您可以使用 float32 比较混合精度的性能。为此,请在“Setting the dtype policy”部分将策略从 `mixed_float16` 更改为 `float32`,然后重新运行所有代码单元,直到此代码点。在计算能力至少为 7.0 的 GPU 上,您会发现每个步骤的时间大大增加,表明混合精度提升了模型的速度。在继续学习本指南之前,请确保将策略改回 `mixed_float16` 并重新运行代码单元。\n", "\n", "在计算能力至少为 8.0 的 GPU(Ampere GPU 及更高版本)上,使用混合精度时,与使用 float32 相比,您可能看不到本指南中小模型的性能提升。这是由于使用 [TensorFloat-32](https://tensorflow.google.cn/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution) 导致的,它会在 `tf.linalg.matmul` 等某些 float32 运算中自动使用较低精度的数学计算。使用 float32 时,TensorFloat-32 会展现混合精度的一些性能优势。不过,在真实模型中,由于内存带宽节省和 TensorFloat-32 不支持的运算,您通常仍会看到混合精度的显著性能提升。\n", "\n", "如果在 TPU 上运行混合精度,您会发现与在 GPU(尤其是 Ampere 架构之前的 GPU)上运行混合精度相比,性能提升并不明显。这是因为即使默认 dtype 策略为 float32,TPU 也会在后台执行一些 bfloat16 运算。这类似于 Ampere GPU 默认使用 TensorFloat-32 的方式。在实际模型上使用混合精度时,与 Ampere GPU 相比,TPU 获得的性能提升通常较少。\n", "\n", "对于很多实际模型,使用混合精度时还可以将批次大小加倍而不会耗尽内存,因为 float16 张量只需要使用 float32 一半的内存。不过,这对本文中所讲的小模型毫无意义,因为您几乎可以使用任何 dtype 来运行该模型,而每个批次可以包含有 60,000 张图片的整个 MNIST 数据集。" ] }, { "cell_type": "markdown", "metadata": { "id": "mNKMXlCvHgHb" }, "source": [ "## 损失放大\n", "\n", "损失放大是 `tf.keras.Model.fit` 使用 `mixed_float16` 策略自动执行,从而避免数值下溢的一种技术。本部分介绍什么是损失放大,下一部分介绍如何将其与自定义训练循环一起使用。" ] }, { "cell_type": "markdown", "metadata": { "id": "1xQX62t2ow0g" }, "source": [ "### 下溢和溢出\n", "\n", "float16 数据类型的动态范围比 float32 窄。这意味着大于 $65504$ 的数值会因溢出而变为无穷大,小于 $6.0 \\times 10^{-8}$ 的数值则会因下溢而变成零。float32 和 bfloat16 的动态范围要大得多,因此一般不会出现下溢或溢出的问题。\n", "\n", "例如:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.052071Z", "iopub.status.busy": "2022-12-14T21:09:34.051324Z", "iopub.status.idle": "2022-12-14T21:09:34.061270Z", "shell.execute_reply": "2022-12-14T21:09:34.060663Z" }, "id": "CHmXRb-yRWbE" }, "outputs": [ { "data": { "text/plain": [ "inf" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = tf.constant(256, dtype='float16')\n", "(x ** 2).numpy() # Overflow" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.064560Z", "iopub.status.busy": "2022-12-14T21:09:34.063983Z", "iopub.status.idle": "2022-12-14T21:09:34.069174Z", "shell.execute_reply": "2022-12-14T21:09:34.068554Z" }, "id": "5unZLhN0RfQM" }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = tf.constant(1e-5, dtype='float16')\n", "(x ** 2).numpy() # Underflow" ] }, { "cell_type": "markdown", "metadata": { "id": "pUIbhQypRVe_" }, "source": [ "实际上,float16 也极少出现下溢的情况。此外,在正向传递中出现下溢的情形更是十分罕见。但是,在反向传递中,梯度可能因下溢而变为零。损失放大就是一个防止出现下溢的技巧。" ] }, { "cell_type": "markdown", "metadata": { "id": "FAL5qij_oNqJ" }, "source": [ "### 损失放大概述\n", "\n", "损失放大的基本概念非常简单:只需将损失乘以某个大数字(如 $1024$)即可得到*损失放大{/em0值。这会将梯度放大 $1024$ 倍,大大降低了发生下溢的几率。计算出最终梯度后,将其除以 $1024$ 即可得到正确值。*\n", "\n", "该过程的伪代码是:\n", "\n", "```\n", "loss_scale = 1024\n", "loss = model(inputs)\n", "loss *= loss_scale\n", "# Assume `grads` are float32. You do not want to divide float16 gradients.\n", "grads = compute_gradient(loss, model.trainable_variables)\n", "grads /= loss_scale\n", "```\n", "\n", "选择合适的损失标度比较困难。如果损失标度太小,梯度可能仍会因下溢而变为零。如果太大,则会出现相反的问题:梯度可能因溢出而变为无穷大。\n", "\n", "为了解决这一问题,TensorFlow 会动态确定损失放大,因此,您不必手动选择。如果使用 `tf.keras.Model.fit`,则会自动完成损失放大,您不必做任何额外的工作。如果您使用自定义训练循环,则必须显式使用特殊的优化器封装容器 `tf.keras.mixed_precision.LossScaleOptimizer` 才能使用损失放大。下一部分会对此进行介绍。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "yqzbn8Ks9Q98" }, "source": [ "## 使用自定义训练循环训练模型" ] }, { "cell_type": "markdown", "metadata": { "id": "CRANRZZ69nA7" }, "source": [ "到目前为止,您已经使用 `tf.keras.Model.fit` 训练了一个具有混合精度的 Keras 模型。接下来,您会将混合精度与自定义训练循环一起使用。如果您还不知道什么是自定义训练循环,请先阅读[自定义训练指南](../tutorials/customization/custom_training_walkthrough.ipynb)。" ] }, { "cell_type": "markdown", "metadata": { "id": "wXTaM8EEyEuo" }, "source": [ "使用混合精度运行自定义训练循环需要对使用 float32 运行训练的模型进行两方面的更改:\n", "\n", "1. 使用混合精度构建模型(已完成)\n", "2. 如果使用 `mixed_float16`,则明确使用损失放大。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "M2zpp7_65mTZ" }, "source": [ "对于步骤 (2),您将使用 `tf.keras.mixed_precision.LossScaleOptimizer` 类,此类会封装优化器并应用损失放大。默认情况下,它会动态地确定损失放大,因此您不必选择其中之一。按如下方式构造一个 `LossScaleOptimizer`。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.073113Z", "iopub.status.busy": "2022-12-14T21:09:34.072491Z", "iopub.status.idle": "2022-12-14T21:09:34.077961Z", "shell.execute_reply": "2022-12-14T21:09:34.077357Z" }, "id": "ogZN3rIH0vpj" }, "outputs": [], "source": [ "optimizer = keras.optimizers.RMSprop()\n", "optimizer = mixed_precision.LossScaleOptimizer(optimizer)" ] }, { "cell_type": "markdown", "metadata": { "id": "FVy5gnBqTE9z" }, "source": [ "如果您愿意,可以选择一个显式损失放大或以其他方式自定义损失放大行为,但强烈建议保留默认的损失放大行为,因为经过验证,它可以在所有已知模型上很好地工作。如果要自定义损失放大行为,请参阅 `tf.keras.mixed_precision.LossScaleOptimizer` 文档。" ] }, { "cell_type": "markdown", "metadata": { "id": "JZYEr5hA3MXZ" }, "source": [ "接下来,定义损失对象和 `tf.data.Dataset`:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.081473Z", "iopub.status.busy": "2022-12-14T21:09:34.080926Z", "iopub.status.idle": "2022-12-14T21:09:34.492096Z", "shell.execute_reply": "2022-12-14T21:09:34.491400Z" }, "id": "9cE7Mm533hxe" }, "outputs": [], "source": [ "loss_object = tf.keras.losses.SparseCategoricalCrossentropy()\n", "train_dataset = (tf.data.Dataset.from_tensor_slices((x_train, y_train))\n", " .shuffle(10000).batch(8192))\n", "test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(8192)" ] }, { "cell_type": "markdown", "metadata": { "id": "4W0zxrxC3nww" }, "source": [ "接下来,定义训练步骤函数。您将使用损失放大优化器中的两个新方法来放大损失和缩小梯度:\n", "\n", "- `get_scaled_loss(loss)`:将损失值乘以损失标度值\n", "- `get_unscaled_gradients(gradients)`:获取一系列放大的梯度作为输入,并将每一个梯度除以损失标度,从而将其缩小为实际值\n", "\n", "为了防止梯度发生下溢,必须使用这些函数。随后,如果全部没有出现 Inf 或 NaN 值,则 `LossScaleOptimizer.apply_gradients` 会应用这些梯度。它还会更新损失标度,如果梯度出现 Inf 或 NaN 值,则会将其减半,而如果出现零值,则会增大损失标度。" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.496344Z", "iopub.status.busy": "2022-12-14T21:09:34.495682Z", "iopub.status.idle": "2022-12-14T21:09:34.500321Z", "shell.execute_reply": "2022-12-14T21:09:34.499703Z" }, "id": "V0vHlust4Rug" }, "outputs": [], "source": [ "@tf.function\n", "def train_step(x, y):\n", " with tf.GradientTape() as tape:\n", " predictions = model(x)\n", " loss = loss_object(y, predictions)\n", " scaled_loss = optimizer.get_scaled_loss(loss)\n", " scaled_gradients = tape.gradient(scaled_loss, model.trainable_variables)\n", " gradients = optimizer.get_unscaled_gradients(scaled_gradients)\n", " optimizer.apply_gradients(zip(gradients, model.trainable_variables))\n", " return loss" ] }, { "cell_type": "markdown", "metadata": { "id": "rcFxEjia6YPQ" }, "source": [ "在训练的开始阶段,`LossScaleOptimizer` 可能会跳过前几个步骤。先使用非常大的损失标度,以便快速确定最佳值。经过几个步骤后,损失标度将稳定下来,这时跳过的步骤将会很少。这一过程是自动执行的,不会影响训练质量。" ] }, { "cell_type": "markdown", "metadata": { "id": "IHIvKKhg4Y-G" }, "source": [ "现在,定义测试步骤:\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.503786Z", "iopub.status.busy": "2022-12-14T21:09:34.503286Z", "iopub.status.idle": "2022-12-14T21:09:34.506519Z", "shell.execute_reply": "2022-12-14T21:09:34.505954Z" }, "id": "nyk_xiZf42Tt" }, "outputs": [], "source": [ "@tf.function\n", "def test_step(x):\n", " return model(x, training=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "hBs98MZyhBOB" }, "source": [ "加载模型的初始权重,以便您可以从头开始重新训练:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.509739Z", "iopub.status.busy": "2022-12-14T21:09:34.509269Z", "iopub.status.idle": "2022-12-14T21:09:34.604393Z", "shell.execute_reply": "2022-12-14T21:09:34.603809Z" }, "id": "jpzOe3WEhFUJ" }, "outputs": [], "source": [ "model.set_weights(initial_weights)" ] }, { "cell_type": "markdown", "metadata": { "id": "s9Pi1ADM47Ud" }, "source": [ "最后,运行自定义训练循环:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T21:09:34.608139Z", "iopub.status.busy": "2022-12-14T21:09:34.607533Z", "iopub.status.idle": "2022-12-14T21:09:41.631948Z", "shell.execute_reply": "2022-12-14T21:09:41.631232Z" }, "id": "N274tJ3e4_6t" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch 0: loss=2.159815549850464, test accuracy=0.7400000095367432\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 1: loss=0.9636000394821167, test accuracy=0.8058000206947327\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 2: loss=0.4238981604576111, test accuracy=0.8766999840736389\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 3: loss=0.3603671193122864, test accuracy=0.921999990940094\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 4: loss=0.3221040666103363, test accuracy=0.9355999827384949\n" ] } ], "source": [ "for epoch in range(5):\n", " epoch_loss_avg = tf.keras.metrics.Mean()\n", " test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n", " name='test_accuracy')\n", " for x, y in train_dataset:\n", " loss = train_step(x, y)\n", " epoch_loss_avg(loss)\n", " for x, y in test_dataset:\n", " predictions = test_step(x)\n", " test_accuracy.update_state(y, predictions)\n", " print('Epoch {}: loss={}, test accuracy={}'.format(epoch, epoch_loss_avg.result(), test_accuracy.result()))" ] }, { "cell_type": "markdown", "metadata": { "id": "d7daQKGerOFE" }, "source": [ "## GPU 性能提示\n", "\n", "下面是在 GPU 上使用混合精度时的一些性能提示。\n", "\n", "### 增大批次大小\n", "\n", "当使用混合精度时,如果不影响模型质量,可以尝试使用双倍批次大小运行。因为 float16 张量只使用一半内存,所以,您通常可以将批次大小增大一倍,而不会耗尽内存。增大批次大小通常可以提高训练吞吐量,即模型每秒可以运行的训练元素数量。\n", "\n", "### 确保使用 GPU Tensor 核心\n", "\n", "如前所述,现代 NVIDIA GPU 使用称为 Tensor 核心的特殊硬件单元, 可以非常快速地执行 float16 矩阵乘法运算。但是,Tensor 核心要求张量的某些维度是 8 的倍数。在下面的示例中,当且仅当参数值是 8 的倍数时,才能使用 Tensor 核心。\n", "\n", "- tf.keras.layers.Dense(**units=64**)\n", "- tf.keras.layers.Conv2d(**filters=48**, kernel_size=7, stride=3)\n", " - 其他卷积层也是如此,如 tf.keras.layers.Conv3d\n", "- tf.keras.layers.LSTM(**units=64**)\n", " - 其他 RNN 也是如此,如 tf.keras.layers.GRU\n", "- tf.keras.Model.fit(epochs=2, **batch_size=128**)\n", "\n", "您应该尽可能使用 Tensor 核心。如果要了解更多信息,请参阅 [NVIDIA 深度学习性能指南](https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html),其中介绍了使用 Tensor 核心的具体要求以及与 Tensor 核心相关的其他性能信息。\n", "\n", "### XLA\n", "\n", "XLA 是一款可以进一步提高混合精度性能,也可以在较小程度上提高 float32 性能的编译器。请参阅 [XLA 指南](https://tensorflow.google.cn/xla)以了解详情。" ] }, { "cell_type": "markdown", "metadata": { "id": "2tFDX8fm6o_3" }, "source": [ "## Cloud TPU 性能提示\n", "\n", "就像在 GPU 上一样,您也应该尝试将批次大小增大一倍,因为 bfloat16 张量同样只使用一半内存。双倍批次大小可能会提高训练吞吐量。\n", "\n", "TPU 不需要任何其他特定于混合精度的调整即可获得最佳性能。TPU 已经要求使用 XLA,它们可以从某些是 $128$ 的倍数的维度获得优势,不过就像使用混合精度一样,但这同样适用于 float32 类型。有关一般 TPU 性能提示,请参阅 [Cloud TPU 性能指南](https://cloud.google.com/tpu/docs/performance-guide),这些提示对混合精度和 float32 张量均适用。" ] }, { "cell_type": "markdown", "metadata": { "id": "--wSEU91wO9w" }, "source": [ "## 总结\n", "\n", "- 如果您使用的是计算能力至少为 7.0 的 TPU 或 NVIDIA GPU,则应使用混合精度,因为它可以将性能提升多达 3 倍。\n", "\n", "- 您可以按如下代码使用混合精度:\n", "\n", " ```python\n", " # On TPUs, use 'mixed_bfloat16' instead\n", " mixed_precision.set_global_policy('mixed_float16')\n", " ```\n", "\n", "- 如果您的模型以 softmax 结尾,请确保其类型为 float32。不管您的模型以什么结尾,必须确保输出为 float32。\n", "- 如果您通过 `mixed_float16` 使用自定义训练循环,则除了上述几行代码外,您还需要使用 `tf.keras.mixed_precision.LossScaleOptimizer` 封装您的优化器。然后调用 `optimizer.get_scaled_loss` 来放大损失,并且调用 `optimizer.get_unscaled_gradients` 来缩小梯度。\n", "- 如果不会降低计算准确率,则可以将训练批次大小加倍。\n", "- 在 GPU 上,确保大部分张量维度是 $8$ 的倍数,从而最大限度提高性能\n", "\n", "有关使用 `tf.keras.mixed_precision` API 的混合精度示例,请查看[与训练性能相关的函数和类](https://github.com/tensorflow/models/blob/master/official/modeling/performance.py)。查看官方模型(例如 [Transformer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_encoder_block.py))了解详情。\n" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "mixed_precision.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 0 }