{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Tce3stUlHN0L"
   },
   "source": [
    "##### Copyright 2019 The TensorFlow Authors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cellView": "form",
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:20.568832Z",
     "iopub.status.busy": "2022-12-14T21:09:20.568366Z",
     "iopub.status.idle": "2022-12-14T21:09:20.572020Z",
     "shell.execute_reply": "2022-12-14T21:09:20.571498Z"
    },
    "id": "tuOe1ymfHZPu"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qFdPvlXBOdUN"
   },
   "source": [
    "# 混合精度"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MfBg1C5NB3X0"
   },
   "source": [
    "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td>     <a target=\"_blank\" href=\"https://tensorflow.google.cn/guide/mixed_precision\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看</a> </td>\n",
    "  <td><a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/guide/mixed_precision.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 中运行</a></td>\n",
    "  <td>     <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/guide/mixed_precision.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 GitHub 上查看源代码</a> </td>\n",
    "  <td>     <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/guide/mixed_precision.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a>   </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xHxb-dlhMIzW"
   },
   "source": [
    "## 概述\n",
    "\n",
    "混合精度是指训练时在模型中同时使用 16 位和 32 位浮点类型，从而加快运行速度，减少内存使用的一种训练方法。通过让模型的某些部分保持使用 32 位类型以保持数值稳定性，可以缩短模型的单步用时，而在评估指标（如准确率）方面仍可以获得同等的训练效果。本指南介绍如何使用 Keras 混合精度 API 来加快模型速度。利用此 API 可以在现代 GPU 上将性能提高三倍以上，而在 TPU 上可以提高 60％。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3vsYi_bv7gS_"
   },
   "source": [
    "如今，大多数模型使用 float32 dtype，这种数据类型占用 32 位内存。但是，还有两种精度较低的 dtype，即 float16 和 bfloat16，它们都是占用 16 位内存。现代加速器使用 16 位 dtype 执行运算的速度更快，因为它们有执行 16 位计算的专用硬件，并且从内存中读取 16 位 dtype 的速度也更快。\n",
    "\n",
    "NVIDIA GPU 使用 float16 执行运算的速度比使用 float32 快，而 TPU 使用 bfloat16 执行运算的速度也比使用 float32 快。因此，在这些设备上应尽可能使用精度较低的 dtype。但是，出于对数值的要求，为了让模型训练获得相同的质量，一些变量和计算仍需使用 float32。利用 Keras 混合精度 API，float16 或 bfloat16 可以与 float32 混合使用，从而既可以获得 float16/bfloat16 的性能优势，也可以获得 float32 的数值稳定性。\n",
    "\n",
    "注：在本指南中，术语“数值稳定性”是指使用较低精度的 dtype（而不是较高精度的 dtype）对模型质量的影响。如果使用 float16 或 bfloat16 执行运算，则与使用 float32 执行运算相比，使用这些较低精度的 dtype 会导致模型获得的评估准确率或其他指标相对较低，那么我们就说这种运算“数值不稳定”。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MUXex9ctTuDB"
   },
   "source": [
    "## 设置"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:20.575426Z",
     "iopub.status.busy": "2022-12-14T21:09:20.575200Z",
     "iopub.status.idle": "2022-12-14T21:09:22.498402Z",
     "shell.execute_reply": "2022-12-14T21:09:22.497632Z"
    },
    "id": "IqR2PQG4ZaZ0"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2022-12-14 21:09:21.518797: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n",
      "2022-12-14 21:09:21.518898: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n",
      "2022-12-14 21:09:21.518908: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "from tensorflow import keras\n",
    "from tensorflow.keras import layers\n",
    "from tensorflow.keras import mixed_precision"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "814VXqdh8Q0r"
   },
   "source": [
    "## 支持的硬件\n",
    "\n",
    "虽然混合精度在大多数硬件上都可以运行，但是在最新的 NVIDIA GPU 和 Cloud TPU 上才能加速模型。NVIDIA GPU 支持混合使用 float16 和 float32，而 TPU 则支持混合使用 bfloat16 和 float32。\n",
    "\n",
    "在 NVIDIA GPU 中，计算能力为 7.0 或更高的 GPU 可以获得混合精度的最大性能优势，因为这些型号具有称为 Tensor 核心的特殊硬件单元，可以加速 float16 矩阵乘法和卷积运算。旧款 GPU 使用混合精度无法实现数学运算性能优势，不过可以节省内存和带宽，因此也可以在一定程度上提高速度。您可以在 NVIDIA 的 [CUDA GPU 网页](https://developer.nvidia.com/cuda-gpus)上查询 GPU 的计算能力。可以最大程度从混合精度受益的 GPU 示例包括 RTX GPU、V100 和 A100。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-q2hisD60F0_"
   },
   "source": [
    "注：如果在 Google Colab 中运行本指南中示例，则 GPU 运行时通常会连接 P100。P100 的计算能力为 6.0，预计速度提升不明显。\n",
    "\n",
    "您可以使用以下命令检查 GPU 类型。如果要使用此命令，必须安装 NVIDIA 驱动程序，否则会引发错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:22.503113Z",
     "iopub.status.busy": "2022-12-14T21:09:22.502346Z",
     "iopub.status.idle": "2022-12-14T21:09:22.662857Z",
     "shell.execute_reply": "2022-12-14T21:09:22.662044Z"
    },
    "id": "j-Yzg_lfkoa_"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-60e44e96-c278-c603-6807-22566f25de74)\r\n",
      "GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-6702b791-ca8d-bd30-257a-f7397b6dfc84)\r\n",
      "GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-95b64f9f-e125-8b72-616a-849e27f562f7)\r\n",
      "GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-ab60c099-7b6d-b830-e8eb-4ed4bd54e0dc)\r\n"
     ]
    }
   ],
   "source": [
    "!nvidia-smi -L"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hu_pvZDN0El3"
   },
   "source": [
    "所有 Cloud TPU 均支持 bfloat16。\n",
    "\n",
    "即使在预计无法提升速度的 CPU 和旧款 GPU 上，混合精度 API 仍可以用于单元测试、调试或试用 API​​。不过，在 CPU 上，混合精度的运行速度会明显变慢。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HNOmvumB-orT"
   },
   "source": [
    "## 设置 dtype 策略"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "54ecYY2Hn16E"
   },
   "source": [
    "要在 Keras 中使用混合精度，您需要创建一条 `tf.keras.mixed_precision.Policy`，通常将其称为 *dtype 策略*。Dtype 策略可以指定将在其中运行的 dtype 层。在本指南中，您将从字符串 `'mixed_float16'` 构造策略，并将其设置为全局策略。这会导致随后创建的层使用 float16 和 float32 的混合精度。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:22.666888Z",
     "iopub.status.busy": "2022-12-14T21:09:22.666463Z",
     "iopub.status.idle": "2022-12-14T21:09:22.798341Z",
     "shell.execute_reply": "2022-12-14T21:09:22.797698Z"
    },
    "id": "x3kElPVH-siO"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING\n",
      "Your GPUs may run slowly with dtype policy mixed_float16 because they do not have compute capability of at least 7.0. Your GPUs:\n",
      "  Tesla P100-PCIE-16GB, compute capability 6.0 (x4)\n",
      "See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.\n",
      "If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once\n"
     ]
    }
   ],
   "source": [
    "policy = mixed_precision.Policy('mixed_float16')\n",
    "mixed_precision.set_global_policy(policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6ids1rT_UM5q"
   },
   "source": [
    "简而言之，您可以直接将字符串传递给 `set_global_policy`，这通常在实践中完成。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:22.802046Z",
     "iopub.status.busy": "2022-12-14T21:09:22.801489Z",
     "iopub.status.idle": "2022-12-14T21:09:22.804593Z",
     "shell.execute_reply": "2022-12-14T21:09:22.804022Z"
    },
    "id": "6a8iNFoBUSqR"
   },
   "outputs": [],
   "source": [
    "# Equivalent to the two lines above\n",
    "mixed_precision.set_global_policy('mixed_float16')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oGAMaa0Ho3yk"
   },
   "source": [
    "该策略指定了层的两个重要方面：完成层的计算所使用的 dtype 和层变量的 dtype。上面的代码创建了一条 `mixed_float16` 策略（即通过将字符串 `'mixed_float16'` 传递给其构造函数而构建的 `mixed_precision.Policy` ）。凭借此策略，层可以使用 float16 计算和 float32 变量。计算使用 float16 来提高性能，而变量使用 float32 来确保数值稳定性。您可以直接在策略中查询这些属性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:22.807970Z",
     "iopub.status.busy": "2022-12-14T21:09:22.807391Z",
     "iopub.status.idle": "2022-12-14T21:09:22.810943Z",
     "shell.execute_reply": "2022-12-14T21:09:22.810331Z"
    },
    "id": "GQRbYm4f8p-k"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Compute dtype: float16\n",
      "Variable dtype: float32\n"
     ]
    }
   ],
   "source": [
    "print('Compute dtype: %s' % policy.compute_dtype)\n",
    "print('Variable dtype: %s' % policy.variable_dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MOFEcna28o4T"
   },
   "source": [
    "如前所述，在计算能力至少为 7.0 的 NVIDIA GPU 上，`mixed_float16` 策略可以大幅提升性能。在其他 GPU 和 CPU 上，该策略也可以运行，但可能无法提升性能。对于 TPU，则应使用 `mixed_bfloat16` 策略。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cAHpt128tVpK"
   },
   "source": [
    "## 构建模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nB6ujaR8qMAy"
   },
   "source": [
    "接下来，我们开始构建一个简单的模型。过小的模型往往无法获得混合精度的优势，因为 TensorFlow 运行时的开销通常占据大部分执行时间，导致 GPU 的性能提升几乎可以忽略不计。因此，如果使用 GPU，我们会构建两个比较大的 `Dense` 层，每个层具有 4096 个单元。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:22.814484Z",
     "iopub.status.busy": "2022-12-14T21:09:22.814087Z",
     "iopub.status.idle": "2022-12-14T21:09:26.117589Z",
     "shell.execute_reply": "2022-12-14T21:09:26.116851Z"
    },
    "id": "0DQM24hL_14Q"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The model will run with 4096 units on a GPU\n"
     ]
    }
   ],
   "source": [
    "inputs = keras.Input(shape=(784,), name='digits')\n",
    "if tf.config.list_physical_devices('GPU'):\n",
    "  print('The model will run with 4096 units on a GPU')\n",
    "  num_units = 4096\n",
    "else:\n",
    "  # Use fewer units on CPUs so the model finishes in a reasonable amount of time\n",
    "  print('The model will run with 64 units on a CPU')\n",
    "  num_units = 64\n",
    "dense1 = layers.Dense(num_units, activation='relu', name='dense_1')\n",
    "x = dense1(inputs)\n",
    "dense2 = layers.Dense(num_units, activation='relu', name='dense_2')\n",
    "x = dense2(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2dezdcqnOXHk"
   },
   "source": [
    "每个层都有一条策略，默认情况下会使用全局策略。因此，每个 `Dense` 层都具有 `mixed_float16` 策略，这是因为之前已将 `mixed_float16` 设置为全局策略。这样，dense 层就会执行 float16 计算，并使用 float32 变量。为了执行 float16 计算，它们会将输入转换为 float16 类型，因此，输出也是 float16 类型。它们的变量是 float32 类型，在调用层时，它们会将变量转换为 float16 类型，从而避免 dtype 不匹配所引起的错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:26.121570Z",
     "iopub.status.busy": "2022-12-14T21:09:26.120955Z",
     "iopub.status.idle": "2022-12-14T21:09:26.124819Z",
     "shell.execute_reply": "2022-12-14T21:09:26.124238Z"
    },
    "id": "kC58MzP4PEcC"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Policy \"mixed_float16\">\n",
      "x.dtype: float16\n",
      "dense1.kernel.dtype: float32\n"
     ]
    }
   ],
   "source": [
    "print(dense1.dtype_policy)\n",
    "print('x.dtype: %s' % x.dtype.name)\n",
    "# 'kernel' is dense1's variable\n",
    "print('dense1.kernel.dtype: %s' % dense1.kernel.dtype.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_WAZeqDyqZcb"
   },
   "source": [
    "接下来创建输出预测。通常，您可以按如下方法创建输出预测，但是对于 float16，其结果不一定具有数值稳定性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:26.128035Z",
     "iopub.status.busy": "2022-12-14T21:09:26.127526Z",
     "iopub.status.idle": "2022-12-14T21:09:26.143915Z",
     "shell.execute_reply": "2022-12-14T21:09:26.143267Z"
    },
    "id": "ybBq1JDwNIbz"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Outputs dtype: float16\n"
     ]
    }
   ],
   "source": [
    "# INCORRECT: softmax and model output will be float16, when it should be float32\n",
    "outputs = layers.Dense(10, activation='softmax', name='predictions')(x)\n",
    "print('Outputs dtype: %s' % outputs.dtype.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "D0gSWxc9NN7q"
   },
   "source": [
    "模型末尾的 softmax 激活值本应为 float32 类型。但由于 dtype 策略是 `mixed_float16`，softmax 激活通常会使用 float16 dtype 进行计算，并且会输出 float16 张量。\n",
    "\n",
    "这一问题可以通过分离 Dense 和 softmax 层，并将 `dtype='float32'` 传递至 softmax 层来解决。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:26.147032Z",
     "iopub.status.busy": "2022-12-14T21:09:26.146624Z",
     "iopub.status.idle": "2022-12-14T21:09:26.161597Z",
     "shell.execute_reply": "2022-12-14T21:09:26.161049Z"
    },
    "id": "IGqCGn4BsODw"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Outputs dtype: float32\n"
     ]
    }
   ],
   "source": [
    "# CORRECT: softmax and model output are float32\n",
    "x = layers.Dense(10, name='dense_logits')(x)\n",
    "outputs = layers.Activation('softmax', dtype='float32', name='predictions')(x)\n",
    "print('Outputs dtype: %s' % outputs.dtype.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tUdkY_DHsP8i"
   },
   "source": [
    "将 `dtype='float32'` 传递至 softmax 层的构造函数会将该层的 dtype 策略重写为 `float32` 策略，从而由后者执行计算并保持变量为 float32 类型。同样，我们也可以传递 `dtype=mixed_precision.Policy('float32')`；层始终将 dtype 参数转换为策略。由于 `Activation` 层没有变量，因此会忽略该策略的变量 dtype，但是该策略的计算 dtype 为 float32，因此 softmax 和模型的输出也是 float32。\n",
    "\n",
    "您可以在模型中间添加 float16 类型的 softmax，但模型末尾的 softmax 应为 float32 类型。原因是，如果从 softmax 传递给损失函数的中间张量是 float16 或 bfloat16 类型，则会出现数值问题。\n",
    "\n",
    "如果您认为使用 float16 计算无法获得数值稳定性，则可以通过传递 `dtype='float32'`，将任何层的 dtype 重写为 float32 类型。但通常，只有模型的最后一层才需要这样重写，因为对大多数层来说，`mixed_float16` 和 `mixed_bfloat16` 的精度已经足够。\n",
    "\n",
    "即使模型不以 softmax 结尾，输出也仍是 float32。虽然对这一特定模型来说并非必需，但可以使用以下代码将模型输出转换为 float32 类型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:26.165065Z",
     "iopub.status.busy": "2022-12-14T21:09:26.164405Z",
     "iopub.status.idle": "2022-12-14T21:09:26.169587Z",
     "shell.execute_reply": "2022-12-14T21:09:26.169018Z"
    },
    "id": "dzVAoLI56jR8"
   },
   "outputs": [],
   "source": [
    "# The linear activation is an identity function. So this simply casts 'outputs'\n",
    "# to float32. In this particular case, 'outputs' is already float32 so this is a\n",
    "# no-op.\n",
    "outputs = layers.Activation('linear', dtype='float32')(outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tpY4ZP7us5hA"
   },
   "source": [
    "接下来，完成并编译模型，并生成输入数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:26.172836Z",
     "iopub.status.busy": "2022-12-14T21:09:26.172470Z",
     "iopub.status.idle": "2022-12-14T21:09:26.540662Z",
     "shell.execute_reply": "2022-12-14T21:09:26.539919Z"
    },
    "id": "g4OT3Z6kqYAL"
   },
   "outputs": [],
   "source": [
    "model = keras.Model(inputs=inputs, outputs=outputs)\n",
    "model.compile(loss='sparse_categorical_crossentropy',\n",
    "              optimizer=keras.optimizers.RMSprop(),\n",
    "              metrics=['accuracy'])\n",
    "\n",
    "(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()\n",
    "x_train = x_train.reshape(60000, 784).astype('float32') / 255\n",
    "x_test = x_test.reshape(10000, 784).astype('float32') / 255"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0Sm8FJHegVRN"
   },
   "source": [
    "本示例将输入数据从 int8 强制转换为 float32。我们不转换为 float16 是因为在 CPU 上除以 255 时，float16 的运算速度比 float32 慢。在这种情况下，性能差距可以忽略不计，但一般来说，在 CPU 上执行运算时，数学处理输入应使用 float32 类型。该模型的第一层会将输入转换为 float16，因为每一层都会将浮点输入强制转换为其计算 dtype。\n",
    "\n",
    "检索模型的初始权重。这样可以通过加载权重来从头开始训练。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:26.545068Z",
     "iopub.status.busy": "2022-12-14T21:09:26.544481Z",
     "iopub.status.idle": "2022-12-14T21:09:26.664167Z",
     "shell.execute_reply": "2022-12-14T21:09:26.663472Z"
    },
    "id": "0UYs-u_DgiA5"
   },
   "outputs": [],
   "source": [
    "initial_weights = model.get_weights()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zlqz6eVKs9aU"
   },
   "source": [
    "## 使用 Model.fit 训练模型\n",
    "\n",
    "接下来，训练模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:26.668223Z",
     "iopub.status.busy": "2022-12-14T21:09:26.667653Z",
     "iopub.status.idle": "2022-12-14T21:09:34.048180Z",
     "shell.execute_reply": "2022-12-14T21:09:34.047337Z"
    },
    "id": "hxI7-0ewmC0A"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1/5\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "1/6 [====>.........................] - ETA: 8s - loss: 2.2976 - accuracy: 0.0897"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "2/6 [=========>....................] - ETA: 0s - loss: 2.0780 - accuracy: 0.2561"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "3/6 [==============>...............] - ETA: 0s - loss: 2.9404 - accuracy: 0.2969"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "4/6 [===================>..........] - ETA: 0s - loss: 3.0898 - accuracy: 0.3146"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "5/6 [========================>.....] - ETA: 0s - loss: 2.8144 - accuracy: 0.3598"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - ETA: 0s - loss: 2.5948 - accuracy: 0.4021"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - 3s 202ms/step - loss: 2.5948 - accuracy: 0.4021 - val_loss: 0.9373 - val_accuracy: 0.7832\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2/5\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "1/6 [====>.........................] - ETA: 0s - loss: 0.9658 - accuracy: 0.7662"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "2/6 [=========>....................] - ETA: 0s - loss: 0.8466 - accuracy: 0.7857"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "3/6 [==============>...............] - ETA: 0s - loss: 0.9793 - accuracy: 0.7250"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "4/6 [===================>..........] - ETA: 0s - loss: 1.0943 - accuracy: 0.6801"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "5/6 [========================>.....] - ETA: 0s - loss: 1.0469 - accuracy: 0.6982"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - ETA: 0s - loss: 0.9973 - accuracy: 0.7110"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - 1s 153ms/step - loss: 0.9973 - accuracy: 0.7110 - val_loss: 0.6015 - val_accuracy: 0.8052\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 3/5\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "1/6 [====>.........................] - ETA: 0s - loss: 0.6387 - accuracy: 0.7900"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "2/6 [=========>....................] - ETA: 0s - loss: 0.5934 - accuracy: 0.8062"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "3/6 [==============>...............] - ETA: 0s - loss: 0.5739 - accuracy: 0.8124"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "4/6 [===================>..........] - ETA: 0s - loss: 0.5616 - accuracy: 0.8187"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "5/6 [========================>.....] - ETA: 0s - loss: 0.5592 - accuracy: 0.8191"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - ETA: 0s - loss: 0.5519 - accuracy: 0.8236"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - 1s 155ms/step - loss: 0.5519 - accuracy: 0.8236 - val_loss: 0.3905 - val_accuracy: 0.8788\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 4/5\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "1/6 [====>.........................] - ETA: 0s - loss: 0.4147 - accuracy: 0.8693"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "2/6 [=========>....................] - ETA: 0s - loss: 0.4285 - accuracy: 0.8596"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "3/6 [==============>...............] - ETA: 0s - loss: 0.4880 - accuracy: 0.8451"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "4/6 [===================>..........] - ETA: 0s - loss: 0.4804 - accuracy: 0.8442"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "5/6 [========================>.....] - ETA: 0s - loss: 0.4563 - accuracy: 0.8532"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - ETA: 0s - loss: 0.4389 - accuracy: 0.8602"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - 1s 156ms/step - loss: 0.4389 - accuracy: 0.8602 - val_loss: 0.3258 - val_accuracy: 0.9003\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 5/5\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "1/6 [====>.........................] - ETA: 0s - loss: 0.3673 - accuracy: 0.8838"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "2/6 [=========>....................] - ETA: 0s - loss: 0.3673 - accuracy: 0.8818"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "3/6 [==============>...............] - ETA: 0s - loss: 0.3758 - accuracy: 0.8787"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "4/6 [===================>..........] - ETA: 0s - loss: 0.3581 - accuracy: 0.8874"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "5/6 [========================>.....] - ETA: 0s - loss: 0.3411 - accuracy: 0.8937"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - ETA: 0s - loss: 0.3266 - accuracy: 0.8994"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r",
      "6/6 [==============================] - 1s 154ms/step - loss: 0.3266 - accuracy: 0.8994 - val_loss: 0.2298 - val_accuracy: 0.9333\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "313/313 - 1s - loss: 0.2349 - accuracy: 0.9330 - 635ms/epoch - 2ms/step\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test loss: 0.23491430282592773\n",
      "Test accuracy: 0.9330000281333923\n"
     ]
    }
   ],
   "source": [
    "history = model.fit(x_train, y_train,\n",
    "                    batch_size=8192,\n",
    "                    epochs=5,\n",
    "                    validation_split=0.2)\n",
    "test_scores = model.evaluate(x_test, y_test, verbose=2)\n",
    "print('Test loss:', test_scores[0])\n",
    "print('Test accuracy:', test_scores[1])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MPhJ9OPWt4x5"
   },
   "source": [
    "请注意，模型会在日志中打印每个步骤的时间：例如，“25ms/step”。第一个周期可能会变慢，因为 TensorFlow 会花一些时间来优化模型，但之后每个步骤的时间应当会稳定下来。\n",
    "\n",
    "如果在 Colab 中运行本指南中，您可以使用 float32 比较混合精度的性能。为此，请在“Setting the dtype policy”部分将策略从 `mixed_float16` 更改为 `float32`，然后重新运行所有代码单元，直到此代码点。在计算能力至少为 7.0 的 GPU 上，您会发现每个步骤的时间大大增加，表明混合精度提升了模型的速度。在继续学习本指南之前，请确保将策略改回 `mixed_float16` 并重新运行代码单元。\n",
    "\n",
    "在计算能力至少为 8.0 的 GPU（Ampere GPU 及更高版本）上，使用混合精度时，与使用 float32 相比，您可能看不到本指南中小模型的性能提升。这是由于使用 [TensorFloat-32](https://tensorflow.google.cn/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution) 导致的，它会在 `tf.linalg.matmul` 等某些 float32 运算中自动使用较低精度的数学计算。使用 float32 时，TensorFloat-32 会展现混合精度的一些性能优势。不过，在真实模型中，由于内存带宽节省和 TensorFloat-32 不支持的运算，您通常仍会看到混合精度的显著性能提升。\n",
    "\n",
    "如果在 TPU 上运行混合精度，您会发现与在 GPU（尤其是 Ampere 架构之前的 GPU）上运行混合精度相比，性能提升并不明显。这是因为即使默认 dtype 策略为 float32，TPU 也会在后台执行一些 bfloat16 运算。这类似于 Ampere GPU 默认使用 TensorFloat-32 的方式。在实际模型上使用混合精度时，与 Ampere GPU 相比，TPU 获得的性能提升通常较少。\n",
    "\n",
    "对于很多实际模型，使用混合精度时还可以将批次大小加倍而不会耗尽内存，因为 float16 张量只需要使用 float32 一半的内存。不过，这对本文中所讲的小模型毫无意义，因为您几乎可以使用任何 dtype 来运行该模型，而每个批次可以包含有 60,000 张图片的整个 MNIST 数据集。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mNKMXlCvHgHb"
   },
   "source": [
    "## 损失放大\n",
    "\n",
    "损失放大是 `tf.keras.Model.fit` 使用 `mixed_float16` 策略自动执行，从而避免数值下溢的一种技术。本部分介绍什么是损失放大，下一部分介绍如何将其与自定义训练循环一起使用。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1xQX62t2ow0g"
   },
   "source": [
    "### 下溢和溢出\n",
    "\n",
    "float16 数据类型的动态范围比 float32 窄。这意味着大于 $65504$ 的数值会因溢出而变为无穷大，小于 $6.0 \\times 10^{-8}$ 的数值则会因下溢而变成零。float32 和 bfloat16 的动态范围要大得多，因此一般不会出现下溢或溢出的问题。\n",
    "\n",
    "例如："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.052071Z",
     "iopub.status.busy": "2022-12-14T21:09:34.051324Z",
     "iopub.status.idle": "2022-12-14T21:09:34.061270Z",
     "shell.execute_reply": "2022-12-14T21:09:34.060663Z"
    },
    "id": "CHmXRb-yRWbE"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "inf"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = tf.constant(256, dtype='float16')\n",
    "(x ** 2).numpy()  # Overflow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.064560Z",
     "iopub.status.busy": "2022-12-14T21:09:34.063983Z",
     "iopub.status.idle": "2022-12-14T21:09:34.069174Z",
     "shell.execute_reply": "2022-12-14T21:09:34.068554Z"
    },
    "id": "5unZLhN0RfQM"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = tf.constant(1e-5, dtype='float16')\n",
    "(x ** 2).numpy()  # Underflow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pUIbhQypRVe_"
   },
   "source": [
    "实际上，float16 也极少出现下溢的情况。此外，在正向传递中出现下溢的情形更是十分罕见。但是，在反向传递中，梯度可能因下溢而变为零。损失放大就是一个防止出现下溢的技巧。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FAL5qij_oNqJ"
   },
   "source": [
    "### 损失放大概述\n",
    "\n",
    "损失放大的基本概念非常简单：只需将损失乘以某个大数字（如 $1024$）即可得到*损失放大{/em0值。这会将梯度放大 $1024$ 倍，大大降低了发生下溢的几率。计算出最终梯度后，将其除以 $1024$ 即可得到正确值。*\n",
    "\n",
    "该过程的伪代码是：\n",
    "\n",
    "```\n",
    "loss_scale = 1024\n",
    "loss = model(inputs)\n",
    "loss *= loss_scale\n",
    "# Assume `grads` are float32. You do not want to divide float16 gradients.\n",
    "grads = compute_gradient(loss, model.trainable_variables)\n",
    "grads /= loss_scale\n",
    "```\n",
    "\n",
    "选择合适的损失标度比较困难。如果损失标度太小，梯度可能仍会因下溢而变为零。如果太大，则会出现相反的问题：梯度可能因溢出而变为无穷大。\n",
    "\n",
    "为了解决这一问题，TensorFlow 会动态确定损失放大，因此，您不必手动选择。如果使用 `tf.keras.Model.fit`，则会自动完成损失放大，您不必做任何额外的工作。如果您使用自定义训练循环，则必须显式使用特殊的优化器封装容器 `tf.keras.mixed_precision.LossScaleOptimizer` 才能使用损失放大。下一部分会对此进行介绍。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yqzbn8Ks9Q98"
   },
   "source": [
    "## 使用自定义训练循环训练模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CRANRZZ69nA7"
   },
   "source": [
    "到目前为止，您已经使用 `tf.keras.Model.fit` 训练了一个具有混合精度的 Keras 模型。接下来，您会将混合精度与自定义训练循环一起使用。如果您还不知道什么是自定义训练循环，请先阅读[自定义训练指南](../tutorials/customization/custom_training_walkthrough.ipynb)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wXTaM8EEyEuo"
   },
   "source": [
    "使用混合精度运行自定义训练循环需要对使用 float32 运行训练的模型进行两方面的更改：\n",
    "\n",
    "1. 使用混合精度构建模型（已完成）\n",
    "2. 如果使用 `mixed_float16`，则明确使用损失放大。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "M2zpp7_65mTZ"
   },
   "source": [
    "对于步骤 (2)，您将使用 `tf.keras.mixed_precision.LossScaleOptimizer` 类，此类会封装优化器并应用损失放大。默认情况下，它会动态地确定损失放大，因此您不必选择其中之一。按如下方式构造一个 `LossScaleOptimizer`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.073113Z",
     "iopub.status.busy": "2022-12-14T21:09:34.072491Z",
     "iopub.status.idle": "2022-12-14T21:09:34.077961Z",
     "shell.execute_reply": "2022-12-14T21:09:34.077357Z"
    },
    "id": "ogZN3rIH0vpj"
   },
   "outputs": [],
   "source": [
    "optimizer = keras.optimizers.RMSprop()\n",
    "optimizer = mixed_precision.LossScaleOptimizer(optimizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FVy5gnBqTE9z"
   },
   "source": [
    "如果您愿意，可以选择一个显式损失放大或以其他方式自定义损失放大行为，但强烈建议保留默认的损失放大行为，因为经过验证，它可以在所有已知模型上很好地工作。如果要自定义损失放大行为，请参阅 `tf.keras.mixed_precision.LossScaleOptimizer` 文档。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JZYEr5hA3MXZ"
   },
   "source": [
    "接下来，定义损失对象和 `tf.data.Dataset`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.081473Z",
     "iopub.status.busy": "2022-12-14T21:09:34.080926Z",
     "iopub.status.idle": "2022-12-14T21:09:34.492096Z",
     "shell.execute_reply": "2022-12-14T21:09:34.491400Z"
    },
    "id": "9cE7Mm533hxe"
   },
   "outputs": [],
   "source": [
    "loss_object = tf.keras.losses.SparseCategoricalCrossentropy()\n",
    "train_dataset = (tf.data.Dataset.from_tensor_slices((x_train, y_train))\n",
    "                 .shuffle(10000).batch(8192))\n",
    "test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(8192)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4W0zxrxC3nww"
   },
   "source": [
    "接下来，定义训练步骤函数。您将使用损失放大优化器中的两个新方法来放大损失和缩小梯度：\n",
    "\n",
    "- `get_scaled_loss(loss)`：将损失值乘以损失标度值\n",
    "- `get_unscaled_gradients(gradients)`：获取一系列放大的梯度作为输入，并将每一个梯度除以损失标度，从而将其缩小为实际值\n",
    "\n",
    "为了防止梯度发生下溢，必须使用这些函数。随后，如果全部没有出现 Inf 或 NaN 值，则 `LossScaleOptimizer.apply_gradients` 会应用这些梯度。它还会更新损失标度，如果梯度出现 Inf 或 NaN 值，则会将其减半，而如果出现零值，则会增大损失标度。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.496344Z",
     "iopub.status.busy": "2022-12-14T21:09:34.495682Z",
     "iopub.status.idle": "2022-12-14T21:09:34.500321Z",
     "shell.execute_reply": "2022-12-14T21:09:34.499703Z"
    },
    "id": "V0vHlust4Rug"
   },
   "outputs": [],
   "source": [
    "@tf.function\n",
    "def train_step(x, y):\n",
    "  with tf.GradientTape() as tape:\n",
    "    predictions = model(x)\n",
    "    loss = loss_object(y, predictions)\n",
    "    scaled_loss = optimizer.get_scaled_loss(loss)\n",
    "  scaled_gradients = tape.gradient(scaled_loss, model.trainable_variables)\n",
    "  gradients = optimizer.get_unscaled_gradients(scaled_gradients)\n",
    "  optimizer.apply_gradients(zip(gradients, model.trainable_variables))\n",
    "  return loss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rcFxEjia6YPQ"
   },
   "source": [
    "在训练的开始阶段，`LossScaleOptimizer` 可能会跳过前几个步骤。先使用非常大的损失标度，以便快速确定最佳值。经过几个步骤后，损失标度将稳定下来，这时跳过的步骤将会很少。这一过程是自动执行的，不会影响训练质量。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IHIvKKhg4Y-G"
   },
   "source": [
    "现在，定义测试步骤：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.503786Z",
     "iopub.status.busy": "2022-12-14T21:09:34.503286Z",
     "iopub.status.idle": "2022-12-14T21:09:34.506519Z",
     "shell.execute_reply": "2022-12-14T21:09:34.505954Z"
    },
    "id": "nyk_xiZf42Tt"
   },
   "outputs": [],
   "source": [
    "@tf.function\n",
    "def test_step(x):\n",
    "  return model(x, training=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hBs98MZyhBOB"
   },
   "source": [
    "加载模型的初始权重，以便您可以从头开始重新训练："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.509739Z",
     "iopub.status.busy": "2022-12-14T21:09:34.509269Z",
     "iopub.status.idle": "2022-12-14T21:09:34.604393Z",
     "shell.execute_reply": "2022-12-14T21:09:34.603809Z"
    },
    "id": "jpzOe3WEhFUJ"
   },
   "outputs": [],
   "source": [
    "model.set_weights(initial_weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "s9Pi1ADM47Ud"
   },
   "source": [
    "最后，运行自定义训练循环："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2022-12-14T21:09:34.608139Z",
     "iopub.status.busy": "2022-12-14T21:09:34.607533Z",
     "iopub.status.idle": "2022-12-14T21:09:41.631948Z",
     "shell.execute_reply": "2022-12-14T21:09:41.631232Z"
    },
    "id": "N274tJ3e4_6t"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0: loss=2.159815549850464, test accuracy=0.7400000095367432\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1: loss=0.9636000394821167, test accuracy=0.8058000206947327\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 2: loss=0.4238981604576111, test accuracy=0.8766999840736389\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 3: loss=0.3603671193122864, test accuracy=0.921999990940094\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 4: loss=0.3221040666103363, test accuracy=0.9355999827384949\n"
     ]
    }
   ],
   "source": [
    "for epoch in range(5):\n",
    "  epoch_loss_avg = tf.keras.metrics.Mean()\n",
    "  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(\n",
    "      name='test_accuracy')\n",
    "  for x, y in train_dataset:\n",
    "    loss = train_step(x, y)\n",
    "    epoch_loss_avg(loss)\n",
    "  for x, y in test_dataset:\n",
    "    predictions = test_step(x)\n",
    "    test_accuracy.update_state(y, predictions)\n",
    "  print('Epoch {}: loss={}, test accuracy={}'.format(epoch, epoch_loss_avg.result(), test_accuracy.result()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d7daQKGerOFE"
   },
   "source": [
    "## GPU 性能提示\n",
    "\n",
    "下面是在 GPU 上使用混合精度时的一些性能提示。\n",
    "\n",
    "### 增大批次大小\n",
    "\n",
    "当使用混合精度时，如果不影响模型质量，可以尝试使用双倍批次大小运行。因为 float16 张量只使用一半内存，所以，您通常可以将批次大小增大一倍，而不会耗尽内存。增大批次大小通常可以提高训练吞吐量，即模型每秒可以运行的训练元素数量。\n",
    "\n",
    "### 确保使用 GPU Tensor 核心\n",
    "\n",
    "如前所述，现代 NVIDIA GPU 使用称为 Tensor 核心的特殊硬件单元， 可以非常快速地执行 float16 矩阵乘法运算。但是，Tensor 核心要求张量的某些维度是 8 的倍数。在下面的示例中，当且仅当参数值是 8 的倍数时，才能使用 Tensor 核心。\n",
    "\n",
    "- tf.keras.layers.Dense(**units=64**)\n",
    "- tf.keras.layers.Conv2d(**filters=48**, kernel_size=7, stride=3)\n",
    "    - 其他卷积层也是如此，如 tf.keras.layers.Conv3d\n",
    "- tf.keras.layers.LSTM(**units=64**)\n",
    "    - 其他 RNN 也是如此，如 tf.keras.layers.GRU\n",
    "- tf.keras.Model.fit(epochs=2, **batch_size=128**)\n",
    "\n",
    "您应该尽可能使用 Tensor 核心。如果要了解更多信息，请参阅 [NVIDIA 深度学习性能指南](https://docs.nvidia.com/deeplearning/sdk/dl-performance-guide/index.html)，其中介绍了使用 Tensor 核心的具体要求以及与 Tensor 核心相关的其他性能信息。\n",
    "\n",
    "### XLA\n",
    "\n",
    "XLA 是一款可以进一步提高混合精度性能，也可以在较小程度上提高 float32 性能的编译器。请参阅 [XLA 指南](https://tensorflow.google.cn/xla)以了解详情。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2tFDX8fm6o_3"
   },
   "source": [
    "## Cloud TPU 性能提示\n",
    "\n",
    "就像在 GPU 上一样，您也应该尝试将批次大小增大一倍，因为 bfloat16 张量同样只使用一半内存。双倍批次大小可能会提高训练吞吐量。\n",
    "\n",
    "TPU 不需要任何其他特定于混合精度的调整即可获得最佳性能。TPU 已经要求使用 XLA，它们可以从某些是 $128$ 的倍数的维度获得优势，不过就像使用混合精度一样，但这同样适用于 float32 类型。有关一般 TPU 性能提示，请参阅 [Cloud TPU 性能指南](https://cloud.google.com/tpu/docs/performance-guide)，这些提示对混合精度和 float32 张量均适用。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "--wSEU91wO9w"
   },
   "source": [
    "## 总结\n",
    "\n",
    "- 如果您使用的是计算能力至少为 7.0 的 TPU 或 NVIDIA GPU，则应使用混合精度，因为它可以将性能提升多达 3 倍。\n",
    "\n",
    "- 您可以按如下代码使用混合精度：\n",
    "\n",
    "    ```python\n",
    "    # On TPUs, use 'mixed_bfloat16' instead\n",
    "    mixed_precision.set_global_policy('mixed_float16')\n",
    "    ```\n",
    "\n",
    "- 如果您的模型以 softmax 结尾，请确保其类型为 float32。不管您的模型以什么结尾，必须确保输出为 float32。\n",
    "- 如果您通过 `mixed_float16` 使用自定义训练循环，则除了上述几行代码外，您还需要使用 `tf.keras.mixed_precision.LossScaleOptimizer` 封装您的优化器。然后调用 `optimizer.get_scaled_loss` 来放大损失，并且调用 `optimizer.get_unscaled_gradients` 来缩小梯度。\n",
    "- 如果不会降低计算准确率，则可以将训练批次大小加倍。\n",
    "- 在 GPU 上，确保大部分张量维度是 $8$ 的倍数，从而最大限度提高性能\n",
    "\n",
    "有关使用 `tf.keras.mixed_precision` API 的混合精度示例，请查看[与训练性能相关的函数和类](https://github.com/tensorflow/models/blob/master/official/modeling/performance.py)。查看官方模型（例如 [Transformer](https://github.com/tensorflow/models/blob/master/official/nlp/modeling/layers/transformer_encoder_block.py)）了解详情。\n"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "mixed_precision.ipynb",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}