{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oL9KopJirB2g"
   },
   "source": [
    "##### Copyright 2018 The TensorFlow Authors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cellView": "form",
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:43.555565Z",
     "iopub.status.busy": "2023-11-07T23:56:43.554954Z",
     "iopub.status.idle": "2023-11-07T23:56:43.559302Z",
     "shell.execute_reply": "2023-11-07T23:56:43.558625Z"
    },
    "id": "SKaX3Hd3ra6C"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AAK88XQ9Pm9N"
   },
   "source": [
    "# Unicode 字符串"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0TD5ZrvEMbhZ"
   },
   "source": [
    "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://tensorflow.google.cn/tutorials/load_data/unicode\">\n",
    "    <img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\" />\n",
    "    在 tensorFlow.google.cn 上查看</a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/load_data/unicode.ipynb\">\n",
    "    <img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\" />\n",
    "    在 Google Colab 中运行</a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/tutorials/load_data/unicode.ipynb\">\n",
    "    <img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\" />\n",
    "    在 GitHub 上查看源代码</a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/tutorials/load_data/unicode.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\" />下载 notebook</a>\n",
    "  </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LrHJrKYis06U"
   },
   "source": [
    "## 简介\n",
    "\n",
    "处理自然语言的模型通常使用不同的字符集来处理不同的语言。*Unicode* 是一种标准的编码系统，用于表示几乎所有语言的字符。每个字符使用 <code>0</code> 和 `0x10FFFF` 之间的唯一整数[码位](https://en.wikipedia.org/wiki/Code_point)进行编码。*Unicode 字符串*是由零个或更多码位组成的序列。\n",
    "\n",
    "本教程介绍了如何在 TensorFlow 中表示 Unicode 字符串，以及如何使用标准字符串运算的 Unicode 等效项对其进行操作。它会根据字符体系检测将 Unicode 字符串划分为不同词例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:43.563317Z",
     "iopub.status.busy": "2023-11-07T23:56:43.562744Z",
     "iopub.status.idle": "2023-11-07T23:56:46.215361Z",
     "shell.execute_reply": "2023-11-07T23:56:46.214317Z"
    },
    "id": "OIKHl5Lvn4gh"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-11-07 23:56:44.044088: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "2023-11-07 23:56:44.044147: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "2023-11-07 23:56:44.045783: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "n-LkcI-vtWNj"
   },
   "source": [
    "## `tf.string` 数据类型\n",
    "\n",
    "您可以使用基本的 TensorFlow `tf.string` `dtype` 构建字节字符串张量。Unicode 字符串默认使用 UTF-8 编码。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:46.219810Z",
     "iopub.status.busy": "2023-11-07T23:56:46.219339Z",
     "iopub.status.idle": "2023-11-07T23:56:48.563325Z",
     "shell.execute_reply": "2023-11-07T23:56:48.562576Z"
    },
    "id": "3yo-Qv6ntaFr"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \\xf0\\x9f\\x98\\x8a'>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.constant(u\"Thanks 😊\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2kA1ziG2tyCT"
   },
   "source": [
    "`tf.string` 张量可以容纳不同长度的字节字符串，因为字节字符串会被视为原子单元。字符串长度不包括在张量维度中。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.567477Z",
     "iopub.status.busy": "2023-11-07T23:56:48.566889Z",
     "iopub.status.idle": "2023-11-07T23:56:48.572214Z",
     "shell.execute_reply": "2023-11-07T23:56:48.571519Z"
    },
    "id": "eyINCmTztyyS"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TensorShape([2])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.constant([u\"You're\", u\"welcome!\"]).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jsMPnjb6UDJ1"
   },
   "source": [
    "注：使用 Python 构造字符串时，v2 和 v3 对 Unicode 的处理方式有所不同。在 v2 中，Unicode 字符串用前缀“u”表示（如上所示）。在 v3 中，字符串默认使用 Unicode 编码。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hUFZ7B1Lk-uj"
   },
   "source": [
    "## 表示 Unicode\n",
    "\n",
    "在 TensorFlow 中有两种表示 Unicode 字符串的标准方式：\n",
    "\n",
    "- `string` 标量 - 使用已知[字符编码](https://en.wikipedia.org/wiki/Character_encoding)对码位序列进行编码。\n",
    "- `int32` 向量 - 每个位置包含单个码位。\n",
    "\n",
    "例如，以下三个值均表示 Unicode 字符串 `\"语言处理\"`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.576112Z",
     "iopub.status.busy": "2023-11-07T23:56:48.575409Z",
     "iopub.status.idle": "2023-11-07T23:56:48.580709Z",
     "shell.execute_reply": "2023-11-07T23:56:48.580031Z"
    },
    "id": "cjQIkfJWvC_u"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(), dtype=string, numpy=b'\\xe8\\xaf\\xad\\xe8\\xa8\\x80\\xe5\\xa4\\x84\\xe7\\x90\\x86'>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Unicode string, represented as a UTF-8 encoded string scalar.\n",
    "text_utf8 = tf.constant(u\"语言处理\")\n",
    "text_utf8"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.584178Z",
     "iopub.status.busy": "2023-11-07T23:56:48.583519Z",
     "iopub.status.idle": "2023-11-07T23:56:48.589431Z",
     "shell.execute_reply": "2023-11-07T23:56:48.588742Z"
    },
    "id": "yQqcUECcvF2r"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(), dtype=string, numpy=b'\\x8b\\xed\\x8a\\x00Y\\x04t\\x06'>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Unicode string, represented as a UTF-16-BE encoded string scalar.\n",
    "text_utf16be = tf.constant(u\"语言处理\".encode(\"UTF-16-BE\"))\n",
    "text_utf16be"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.592886Z",
     "iopub.status.busy": "2023-11-07T23:56:48.592314Z",
     "iopub.status.idle": "2023-11-07T23:56:48.599532Z",
     "shell.execute_reply": "2023-11-07T23:56:48.598843Z"
    },
    "id": "ExdBr1t7vMuS"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Unicode string, represented as a vector of Unicode code points.\n",
    "text_chars = tf.constant([ord(char) for char in u\"语言处理\"])\n",
    "text_chars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "B8czv4JNpBnZ"
   },
   "source": [
    "### 在不同表示之间进行转换\n",
    "\n",
    "TensorFlow 提供了在下列不同表示之间进行转换的运算：\n",
    "\n",
    "- `tf.strings.unicode_decode`：将编码的字符串标量转换为码位的向量。\n",
    "- `tf.strings.unicode_encode`：将码位的向量转换为编码的字符串标量。\n",
    "- `tf.strings.unicode_transcode`：将编码的字符串标量转换为其他编码。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.603109Z",
     "iopub.status.busy": "2023-11-07T23:56:48.602532Z",
     "iopub.status.idle": "2023-11-07T23:56:48.611138Z",
     "shell.execute_reply": "2023-11-07T23:56:48.610473Z"
    },
    "id": "qb-UQ_oLpAJg"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_decode(text_utf8,\n",
    "                          input_encoding='UTF-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.614199Z",
     "iopub.status.busy": "2023-11-07T23:56:48.613941Z",
     "iopub.status.idle": "2023-11-07T23:56:48.625675Z",
     "shell.execute_reply": "2023-11-07T23:56:48.624970Z"
    },
    "id": "kEBUcunnp-9n"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(), dtype=string, numpy=b'\\xe8\\xaf\\xad\\xe8\\xa8\\x80\\xe5\\xa4\\x84\\xe7\\x90\\x86'>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_encode(text_chars,\n",
    "                          output_encoding='UTF-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.629113Z",
     "iopub.status.busy": "2023-11-07T23:56:48.628517Z",
     "iopub.status.idle": "2023-11-07T23:56:48.634218Z",
     "shell.execute_reply": "2023-11-07T23:56:48.633575Z"
    },
    "id": "0MLhWcLZrph-"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(), dtype=string, numpy=b'\\x8b\\xed\\x8a\\x00Y\\x04t\\x06'>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_transcode(text_utf8,\n",
    "                             input_encoding='UTF8',\n",
    "                             output_encoding='UTF-16-BE')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QVeLeVohqN7I"
   },
   "source": [
    "### 批次维度\n",
    "\n",
    "解码多个字符串时，每个字符串中的字符数可能不相等。返回结果是 [`tf.RaggedTensor`](../../guide/ragged_tensor.ipynb)，其中最里面的维度的长度会根据每个字符串中的字符数而变化："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.637701Z",
     "iopub.status.busy": "2023-11-07T23:56:48.637158Z",
     "iopub.status.idle": "2023-11-07T23:56:48.643048Z",
     "shell.execute_reply": "2023-11-07T23:56:48.642388Z"
    },
    "id": "N2jVzPymr_Mm"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[104, 195, 108, 108, 111]\n",
      "[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]\n",
      "[71, 246, 246, 100, 110, 105, 103, 104, 116]\n",
      "[128522]\n"
     ]
    }
   ],
   "source": [
    "# A batch of Unicode strings, each represented as a UTF8-encoded string.\n",
    "batch_utf8 = [s.encode('UTF-8') for s in\n",
    "              [u'hÃllo',  u'What is the weather tomorrow',  u'Göödnight', u'😊']]\n",
    "batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,\n",
    "                                               input_encoding='UTF-8')\n",
    "for sentence_chars in batch_chars_ragged.to_list():\n",
    "  print(sentence_chars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iRh3n1hPsJ9v"
   },
   "source": [
    "您可以直接使用此 `tf.RaggedTensor`，也可以使用 `tf.RaggedTensor.to_tensor` 和 `tf.RaggedTensor.to_sparse` 方法将其转换为带有填充的密集 `tf.Tensor` 或 `tf.SparseTensor`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.646514Z",
     "iopub.status.busy": "2023-11-07T23:56:48.645877Z",
     "iopub.status.idle": "2023-11-07T23:56:48.652905Z",
     "shell.execute_reply": "2023-11-07T23:56:48.652196Z"
    },
    "id": "yz17yeSMsUid"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[   104    195    108    108    111     -1     -1     -1     -1     -1\n",
      "      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1\n",
      "      -1     -1     -1     -1     -1     -1     -1     -1]\n",
      " [    87    104     97    116     32    105    115     32    116    104\n",
      "     101     32    119    101     97    116    104    101    114     32\n",
      "     116    111    109    111    114    114    111    119]\n",
      " [    71    246    246    100    110    105    103    104    116     -1\n",
      "      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1\n",
      "      -1     -1     -1     -1     -1     -1     -1     -1]\n",
      " [128522     -1     -1     -1     -1     -1     -1     -1     -1     -1\n",
      "      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1\n",
      "      -1     -1     -1     -1     -1     -1     -1     -1]]\n"
     ]
    }
   ],
   "source": [
    "batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)\n",
    "print(batch_chars_padded.numpy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.655890Z",
     "iopub.status.busy": "2023-11-07T23:56:48.655616Z",
     "iopub.status.idle": "2023-11-07T23:56:48.659866Z",
     "shell.execute_reply": "2023-11-07T23:56:48.659146Z"
    },
    "id": "kBjsPQp3rhfm"
   },
   "outputs": [],
   "source": [
    "batch_chars_sparse = batch_chars_ragged.to_sparse()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GCCkZh-nwlbL"
   },
   "source": [
    "在对多个具有相同长度的字符串进行编码时，可以将 `tf.Tensor` 用作输入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.663296Z",
     "iopub.status.busy": "2023-11-07T23:56:48.662743Z",
     "iopub.status.idle": "2023-11-07T23:56:48.690260Z",
     "shell.execute_reply": "2023-11-07T23:56:48.689566Z"
    },
    "id": "_lP62YUAwjK9"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]],\n",
    "                          output_encoding='UTF-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "w58CMRg9tamW"
   },
   "source": [
    "当对多个具有不同长度的字符串进行编码时，应将 `tf.RaggedTensor` 用作输入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.694307Z",
     "iopub.status.busy": "2023-11-07T23:56:48.693648Z",
     "iopub.status.idle": "2023-11-07T23:56:48.698918Z",
     "shell.execute_reply": "2023-11-07T23:56:48.698251Z"
    },
    "id": "d7GtOtrltaMl"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(4,), dtype=string, numpy=\n",
       "array([b'h\\xc3\\x83llo', b'What is the weather tomorrow',\n",
       "       b'G\\xc3\\xb6\\xc3\\xb6dnight', b'\\xf0\\x9f\\x98\\x8a'], dtype=object)>"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "T2Nh5Aj9xob3"
   },
   "source": [
    "如果您的张量具有填充或稀疏格式的多个字符串，请在调用 `unicode_encode` 之前将其转换为 `tf.RaggedTensor`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:48.702466Z",
     "iopub.status.busy": "2023-11-07T23:56:48.701838Z",
     "iopub.status.idle": "2023-11-07T23:56:49.108248Z",
     "shell.execute_reply": "2023-11-07T23:56:49.107481Z"
    },
    "id": "R2bYCYl0u-Ue"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(4,), dtype=string, numpy=\n",
       "array([b'h\\xc3\\x83llo', b'What is the weather tomorrow',\n",
       "       b'G\\xc3\\xb6\\xc3\\xb6dnight', b'\\xf0\\x9f\\x98\\x8a'], dtype=object)>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_encode(\n",
    "    tf.RaggedTensor.from_sparse(batch_chars_sparse),\n",
    "    output_encoding='UTF-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:49.111700Z",
     "iopub.status.busy": "2023-11-07T23:56:49.111425Z",
     "iopub.status.idle": "2023-11-07T23:56:49.954286Z",
     "shell.execute_reply": "2023-11-07T23:56:49.953539Z"
    },
    "id": "UlV2znh_u_zm"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: shape=(4,), dtype=string, numpy=\n",
       "array([b'h\\xc3\\x83llo', b'What is the weather tomorrow',\n",
       "       b'G\\xc3\\xb6\\xc3\\xb6dnight', b'\\xf0\\x9f\\x98\\x8a'], dtype=object)>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_encode(\n",
    "    tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),\n",
    "    output_encoding='UTF-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hQOOGkscvDpc"
   },
   "source": [
    "## Unicode 运算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NkmtsA_yvMB0"
   },
   "source": [
    "### 字符长度\n",
    "\n",
    "`tf.strings.length` 运算具有 `unit` 参数，该参数表示计算长度的方式。`unit` 默认为 `\"BYTE\"`，但也可以将其设置为其他值（例如 `\"UTF8_CHAR\"` 或 `\"UTF16_CHAR\"`），以确定每个已编码 `string` 中的 Unicode 码位数量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:49.958120Z",
     "iopub.status.busy": "2023-11-07T23:56:49.957859Z",
     "iopub.status.idle": "2023-11-07T23:56:49.964526Z",
     "shell.execute_reply": "2023-11-07T23:56:49.963778Z"
    },
    "id": "1ZzMe59mvLHr"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "11 bytes; 8 UTF-8 characters\n"
     ]
    }
   ],
   "source": [
    "# Note that the final character takes up 4 bytes in UTF8.\n",
    "thanks = u'Thanks 😊'.encode('UTF-8')\n",
    "num_bytes = tf.strings.length(thanks).numpy()\n",
    "num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()\n",
    "print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fHG85gxlvVU0"
   },
   "source": [
    "### 字符子字符串\n",
    "\n",
    "类似地，`tf.strings.substr` 运算会接受 \"`unit`\" 参数，并用它来确定 \"`pos`\" 和 \"`len`\" 参数包含的偏移类型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:49.967677Z",
     "iopub.status.busy": "2023-11-07T23:56:49.967441Z",
     "iopub.status.idle": "2023-11-07T23:56:49.972712Z",
     "shell.execute_reply": "2023-11-07T23:56:49.972096Z"
    },
    "id": "WlWRLV-4xWYq"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "b'\\xf0'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# default: unit='BYTE'. With len=1, we return a single byte.\n",
    "tf.strings.substr(thanks, pos=7, len=1).numpy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:49.975960Z",
     "iopub.status.busy": "2023-11-07T23:56:49.975477Z",
     "iopub.status.idle": "2023-11-07T23:56:49.980273Z",
     "shell.execute_reply": "2023-11-07T23:56:49.979560Z"
    },
    "id": "JfNUVDPwxkCS"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'\\xf0\\x9f\\x98\\x8a'\n"
     ]
    }
   ],
   "source": [
    "# Specifying unit='UTF8_CHAR', we return a single character, which in this case\n",
    "# is 4 bytes.\n",
    "print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "zJUEsVSyeIa3"
   },
   "source": [
    "### 拆分 Unicode 字符串\n",
    "\n",
    "`tf.strings.unicode_split` 运算会将 Unicode 字符串拆分为单个字符的子字符串："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:49.983479Z",
     "iopub.status.busy": "2023-11-07T23:56:49.983021Z",
     "iopub.status.idle": "2023-11-07T23:56:49.993301Z",
     "shell.execute_reply": "2023-11-07T23:56:49.992686Z"
    },
    "id": "dDjkh5G1ejMt"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\\xf0\\x9f\\x98\\x8a'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_split(thanks, 'UTF-8').numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HQqEEZEbdG9O"
   },
   "source": [
    "### 字符的字节偏移量\n",
    "\n",
    "为了将 `tf.strings.unicode_decode` 生成的字符张量与原始字符串对齐，了解每个字符开始位置的偏移量很有用。方法 `tf.strings.unicode_decode_with_offsets` 与 `unicode_decode` 类似，不同的是它会返回包含每个字符起始偏移量的第二张量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:49.996362Z",
     "iopub.status.busy": "2023-11-07T23:56:49.996120Z",
     "iopub.status.idle": "2023-11-07T23:56:50.003223Z",
     "shell.execute_reply": "2023-11-07T23:56:50.002424Z"
    },
    "id": "Cug7cmwYdowd"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "At byte offset 0: codepoint 127880\n",
      "At byte offset 4: codepoint 127881\n",
      "At byte offset 8: codepoint 127882\n"
     ]
    }
   ],
   "source": [
    "codepoints, offsets = tf.strings.unicode_decode_with_offsets(u\"🎈🎉🎊\", 'UTF-8')\n",
    "\n",
    "for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):\n",
    "  print(\"At byte offset {}: codepoint {}\".format(offset, codepoint))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2ZnCNxOvx66T"
   },
   "source": [
    "## Unicode 字符体系"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nRRHqkqNyGZ6"
   },
   "source": [
    "每个 Unicode 码位都属于某个码位集合，这些集合被称作[字符体系](https://en.wikipedia.org/wiki/Script_%28Unicode%29)。某个字符的字符体系有助于确定该字符可能所属的语言。例如，已知 'Б' 属于西里尔字符体系，表明包含该字符的现代文本很可能来自某个斯拉夫语种（如俄语或乌克兰语）。\n",
    "\n",
    "TensorFlow 提供了 `tf.strings.unicode_script` 运算来确定某一给定码位使用的是哪个字符体系。字符体系代码是对应于[国际 Unicode 组件](http://site.icu-project.org/home) (ICU) [`UScriptCode`](http://icu-project.org/apiref/icu4c/uscript_8h.html) 值的 `int32` 值。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.006495Z",
     "iopub.status.busy": "2023-11-07T23:56:50.006239Z",
     "iopub.status.idle": "2023-11-07T23:56:50.010853Z",
     "shell.execute_reply": "2023-11-07T23:56:50.010229Z"
    },
    "id": "K7DeYHrRyFPy"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[17  8]\n"
     ]
    }
   ],
   "source": [
    "uscript = tf.strings.unicode_script([33464, 1041])  # ['芸', 'Б']\n",
    "\n",
    "print(uscript.numpy())  # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2fW992a1lIY6"
   },
   "source": [
    "`tf.strings.unicode_script` 运算还可以应用于码位的多维 `tf.Tensor` 或 `tf.RaggedTensor`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.014063Z",
     "iopub.status.busy": "2023-11-07T23:56:50.013818Z",
     "iopub.status.idle": "2023-11-07T23:56:50.018227Z",
     "shell.execute_reply": "2023-11-07T23:56:50.017571Z"
    },
    "id": "uR7b8meLlFnp"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tf.RaggedTensor [[25, 25, 25, 25, 25],\n",
      " [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25,\n",
      "  0, 25, 25, 25, 25, 25, 25, 25, 25]                                      ,\n",
      " [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>\n"
     ]
    }
   ],
   "source": [
    "print(tf.strings.unicode_script(batch_chars_ragged))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "mx7HEFpBzEsB"
   },
   "source": [
    "## 示例：简单分词\n",
    "\n",
    "分词是将文本拆分为类似单词的单元的任务。当使用空格字符分隔单词时，这通常很容易，但是某些语言（如中文和日语）不使用空格，而某些语言（如德语）中存在长复合词，必须进行拆分才能分析其含义。在网页文本中，不同语言和字符体系常常混合在一起，例如“NY株価”（纽约证券交易所）。\n",
    "\n",
    "我们可以利用字符体系的变化进行粗略分词（不实现任何 ML 模型），从而估算词边界。这对类似上面“NY株価”示例的字符串都有效。这种方法对大多数使用空格的语言也都有效，因为各种字符体系中的空格字符都归类为 USCRIPT_COMMON，这是一种特殊的字符体系代码，不同于任何实际文本。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.021617Z",
     "iopub.status.busy": "2023-11-07T23:56:50.021387Z",
     "iopub.status.idle": "2023-11-07T23:56:50.024529Z",
     "shell.execute_reply": "2023-11-07T23:56:50.023894Z"
    },
    "id": "grsvFiC4BoPb"
   },
   "outputs": [],
   "source": [
    "# dtype: string; shape: [num_sentences]\n",
    "#\n",
    "# The sentences to process.  Edit this line to try out different inputs!\n",
    "sentence_texts = [u'Hello, world.', u'世界こんにちは']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CapnbShuGU8i"
   },
   "source": [
    "首先，我们将句子解码为字符码位，然后查找每个字符的字符体系标识符。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.028353Z",
     "iopub.status.busy": "2023-11-07T23:56:50.027686Z",
     "iopub.status.idle": "2023-11-07T23:56:50.033545Z",
     "shell.execute_reply": "2023-11-07T23:56:50.032903Z"
    },
    "id": "ReQVcDQh1MB8"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46],\n",
      " [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>\n",
      "<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0],\n",
      " [17, 17, 20, 20, 20, 20, 20]]>\n"
     ]
    }
   ],
   "source": [
    "# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]\n",
    "#\n",
    "# sentence_char_codepoint[i, j] is the codepoint for the j'th character in\n",
    "# the i'th sentence.\n",
    "sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')\n",
    "print(sentence_char_codepoint)\n",
    "\n",
    "# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]\n",
    "#\n",
    "# sentence_char_scripts[i, j] is the unicode script of the j'th character in\n",
    "# the i'th sentence.\n",
    "sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)\n",
    "print(sentence_char_script)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O2fapF5UGcUc"
   },
   "source": [
    "接下来，我们使用这些字符体系标识符来确定添加词边界的位置。我们在每个句子的开头添加一个词边界；如果某个字符与前一个字符属于不同的字符体系，也为该字符添加词边界。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.036958Z",
     "iopub.status.busy": "2023-11-07T23:56:50.036462Z",
     "iopub.status.idle": "2023-11-07T23:56:50.131880Z",
     "shell.execute_reply": "2023-11-07T23:56:50.131051Z"
    },
    "id": "7v5W6MOr1Rlc"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tf.Tensor([ 0  5  7 12 13 15], shape=(6,), dtype=int64)\n"
     ]
    }
   ],
   "source": [
    "# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]\n",
    "#\n",
    "# sentence_char_starts_word[i, j] is True if the j'th character in the i'th\n",
    "# sentence is the start of a word.\n",
    "sentence_char_starts_word = tf.concat(\n",
    "    [tf.fill([sentence_char_script.nrows(), 1], True),\n",
    "     tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],\n",
    "    axis=1)\n",
    "\n",
    "# dtype: int64; shape: [num_words]\n",
    "#\n",
    "# word_starts[i] is the index of the character that starts the i'th word (in\n",
    "# the flattened list of characters from all sentences).\n",
    "word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)\n",
    "print(word_starts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LAwh-1QkGuC9"
   },
   "source": [
    "然后，我们可以使用这些起始偏移量来构建 `RaggedTensor`，它包含了所有批次的单词列表："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.135186Z",
     "iopub.status.busy": "2023-11-07T23:56:50.134916Z",
     "iopub.status.idle": "2023-11-07T23:56:50.145700Z",
     "shell.execute_reply": "2023-11-07T23:56:50.145016Z"
    },
    "id": "bNiA1O_eBBCL"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46],\n",
      " [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>\n"
     ]
    }
   ],
   "source": [
    "# dtype: int32; shape: [num_words, (num_chars_per_word)]\n",
    "#\n",
    "# word_char_codepoint[i, j] is the codepoint for the j'th character in the\n",
    "# i'th word.\n",
    "word_char_codepoint = tf.RaggedTensor.from_row_starts(\n",
    "    values=sentence_char_codepoint.values,\n",
    "    row_starts=word_starts)\n",
    "print(word_char_codepoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "66a2ZnYmG2ao"
   },
   "source": [
    "最后，我们可以将词码位 `RaggedTensor` 划分回句子中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.149166Z",
     "iopub.status.busy": "2023-11-07T23:56:50.148528Z",
     "iopub.status.idle": "2023-11-07T23:56:50.166730Z",
     "shell.execute_reply": "2023-11-07T23:56:50.166050Z"
    },
    "id": "NCfwcqLSEjZb"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]],\n",
      " [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>\n"
     ]
    }
   ],
   "source": [
    "# dtype: int64; shape: [num_sentences]\n",
    "#\n",
    "# sentence_num_words[i] is the number of words in the i'th sentence.\n",
    "sentence_num_words = tf.reduce_sum(\n",
    "    tf.cast(sentence_char_starts_word, tf.int64),\n",
    "    axis=1)\n",
    "\n",
    "# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]\n",
    "#\n",
    "# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character\n",
    "# in the j'th word in the i'th sentence.\n",
    "sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(\n",
    "    values=word_char_codepoint,\n",
    "    row_lengths=sentence_num_words)\n",
    "print(sentence_word_char_codepoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xWaX8WcbHyqY"
   },
   "source": [
    "为了使最终结果更易于阅读，我们可以将其重新编码为 UTF-8 字符串："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-11-07T23:56:50.169905Z",
     "iopub.status.busy": "2023-11-07T23:56:50.169648Z",
     "iopub.status.idle": "2023-11-07T23:56:50.175039Z",
     "shell.execute_reply": "2023-11-07T23:56:50.174389Z"
    },
    "id": "HSivquOgFr3C"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[b'Hello', b', ', b'world', b'.'],\n",
       " [b'\\xe4\\xb8\\x96\\xe7\\x95\\x8c',\n",
       "  b'\\xe3\\x81\\x93\\xe3\\x82\\x93\\xe3\\x81\\xab\\xe3\\x81\\xa1\\xe3\\x81\\xaf']]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "oL9KopJirB2g"
   ],
   "name": "unicode.ipynb",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}