{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "ACbjNjyO4f_8" }, "source": [ "##### Copyright 2019 The TensorFlow Hub Authors.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\");" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:28:58.132180Z", "iopub.status.busy": "2021-08-13T20:28:58.131598Z", "iopub.status.idle": "2021-08-13T20:28:58.134681Z", "shell.execute_reply": "2021-08-13T20:28:58.134173Z" }, "id": "MCM50vaM4jiK" }, "outputs": [], "source": [ "# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "# ==============================================================================" ] }, { "cell_type": "markdown", "metadata": { "id": "9qOVy-_vmuUP" }, "source": [ "# 使用近似最近邻和文本嵌入向量构建语义搜索\n" ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本 查看 TF Hub 模型
" ] }, { "cell_type": "markdown", "metadata": { "id": "7Hks9F5qq6m2" }, "source": [ "本教程演示如何在给定输入数据的情况下,从 [TensorFlow Hub](https://tfhub.dev) (TF-Hub) 模块生成嵌入向量,并使用提取的嵌入向量构建近似最近邻 (ANN) 索引。随后,可以将该索引用于实时相似度匹配和检索。\n", "\n", "在处理包含大量数据的语料库时,通过扫描整个存储库实时查找与给定查询最相似的条目来执行精确匹配的效率不高。因此,我们使用一种近似相似度匹配算法。利用这种算法,我们在查找精确的最近邻匹配时会牺牲一点准确率,但是可以显著提高速度。\n", "\n", "在本教程中,我们将展示一个示例,在新闻标题语料库上进行实时文本搜索,以查找与查询最相似的标题。与关键字搜索不同,此过程会捕获在文本嵌入向量中编码的语义相似度。\n", "\n", "本教程的操作步骤如下:\n", "\n", "1. 下载样本数据\n", "2. 使用 TF-Hub 模块为数据生成嵌入向量\n", "3. 为嵌入向量构建 ANN 索引\n", "4. 使用索引进行相似度匹配\n", "\n", "我们将 [Apache Beam](https://beam.apache.org/documentation/programming-guide/) 与 [TensorFlow Transform](https://tensorflow.google.cn/tfx/tutorials/transform/simple) (TF-Transform) 结合使用,从 TF-Hub 模块生成嵌入向量。我们还使用 Spotify 的 [ANNOY](https://github.com/spotify/annoy) 库来构建近似最近邻索引。您可以在此 [Github 仓库](https://github.com/erikbern/ann-benchmarks)中找到 ANN 框架的基准测试。\n", "\n", "本教程使用 TensorFlow 1.0,并且仅适用于 TF-Hub 中的 TF1 [Hub 模块](https://tensorflow.google.cn/hub/tf1_hub_module)。请参阅本教程更新后的 [TF2 版本](https://github.com/tensorflow/hub/blob/master/examples/colab/tf2_semantic_approximate_nearest_neighbors.ipynb)。" ] }, { "cell_type": "markdown", "metadata": { "id": "Q0jr0QK9qO5P" }, "source": [ "## 设置" ] }, { "cell_type": "markdown", "metadata": { "id": "whMRj9qeqed4" }, "source": [ "安装所需的库。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:28:58.145234Z", "iopub.status.busy": "2021-08-13T20:28:58.144592Z", "iopub.status.idle": "2021-08-13T20:29:22.216844Z", "shell.execute_reply": "2021-08-13T20:29:22.217184Z" }, "id": "qmXkLPoaqS--" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] } ], "source": [ "!pip install -q apache_beam\n", "!pip install -q 'scikit_learn~=0.23.0' # For gaussian_random_matrix.\n", "!pip install -q annoy" ] }, { "cell_type": "markdown", "metadata": { "id": "A-vBZiCCqld0" }, "source": [ "导入所需的库。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:29:22.222866Z", "iopub.status.busy": "2021-08-13T20:29:22.220177Z", "iopub.status.idle": "2021-08-13T20:29:25.434891Z", "shell.execute_reply": "2021-08-13T20:29:25.435270Z" }, "id": "6NTYbdWcseuK" }, "outputs": [], "source": [ "import os\n", "import sys\n", "import pathlib\n", "import pickle\n", "from collections import namedtuple\n", "from datetime import datetime\n", "\n", "import numpy as np\n", "import apache_beam as beam\n", "import annoy\n", "from sklearn.random_projection import gaussian_random_matrix\n", "\n", "import tensorflow.compat.v1 as tf\n", "import tensorflow_hub as hub" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:29:25.446651Z", "iopub.status.busy": "2021-08-13T20:29:25.445842Z", "iopub.status.idle": "2021-08-13T20:30:00.007430Z", "shell.execute_reply": "2021-08-13T20:30:00.007946Z" }, "id": "_GF0GnLqGdPQ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] } ], "source": [ "# TFT needs to be installed afterwards\n", "!pip install -q tensorflow_transform==0.24\n", "import tensorflow_transform as tft\n", "import tensorflow_transform.beam as tft_beam" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:00.013939Z", "iopub.status.busy": "2021-08-13T20:30:00.013272Z", "iopub.status.idle": "2021-08-13T20:30:00.016125Z", "shell.execute_reply": "2021-08-13T20:30:00.015627Z" }, "id": "tx0SZa6-7b-f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TF version: 2.6.0\n", "TF-Hub version: 0.12.0\n", "TF-Transform version: 0.24.0\n", "Apache Beam version: 2.31.0\n" ] } ], "source": [ "print('TF version: {}'.format(tf.__version__))\n", "print('TF-Hub version: {}'.format(hub.__version__))\n", "print('TF-Transform version: {}'.format(tft.__version__))\n", "print('Apache Beam version: {}'.format(beam.__version__))" ] }, { "cell_type": "markdown", "metadata": { "id": "P6Imq876rLWx" }, "source": [ "## 1. 下载样本数据\n", "\n", "[A Million News Headlines](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL#) 数据集包含著名的澳大利亚广播公司 (ABC) 在 15 年内发布的新闻标题。此新闻数据集汇总了从 2003 年初至 2017 年底在全球范围内发生的重大事件的历史记录,其中对澳大利亚的关注更为细致。\n", "\n", "**格式**:以制表符分隔的两列数据:1) 发布日期和 2) 标题文本。我们只对标题文本感兴趣。\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:00.026864Z", "iopub.status.busy": "2021-08-13T20:30:00.025665Z", "iopub.status.idle": "2021-08-13T20:30:06.353838Z", "shell.execute_reply": "2021-08-13T20:30:06.353250Z" }, "id": "OpF57n8e5C9D" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2021-08-13 20:30:00-- https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true\r\n", "Resolving dataverse.harvard.edu (dataverse.harvard.edu)... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "54.162.175.159, 72.44.40.54, 18.211.119.52\r\n", "Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|54.162.175.159|:443... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "connected.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "HTTP request sent, awaiting response... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "200 OK\r\n", "Length: 57600231 (55M) [text/tab-separated-values]\r\n", "Saving to: ‘raw.tsv’\r\n", "\r\n", "\r", "raw.tsv 0%[ ] 0 --.-KB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 0%[ ] 97.56K 263KB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 0%[ ] 417.56K 563KB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 3%[ ] 1.67M 1.50MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 9%[> ] 5.34M 4.07MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 17%[==> ] 9.34M 5.60MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 25%[====> ] 14.08M 7.53MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 33%[=====> ] 18.31M 8.23MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 41%[=======> ] 23.02M 9.49MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 49%[========> ] 27.28M 9.81MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 58%[==========> ] 32.00M 10.7MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 65%[============> ] 36.25M 10.9MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 74%[=============> ] 40.70M 11.5MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 82%[===============> ] 45.22M 12.2MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 90%[=================> ] 49.59M 13.3MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 98%[==================> ] 54.20M 14.5MB/s eta 0s \r", "raw.tsv 100%[===================>] 54.93M 14.7MB/s in 4.5s \r\n", "\r\n", "2021-08-13 20:30:05 (12.3 MB/s) - ‘raw.tsv’ saved [57600231/57600231]\r\n", "\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1103664 raw.tsv\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "publish_date\theadline_text\r\n", "20030219\t\"aba decides against community broadcasting licence\"\r\n", "20030219\t\"act fire witnesses must be aware of defamation\"\r\n", "20030219\t\"a g calls for infrastructure protection summit\"\r\n", "20030219\t\"air nz staff in aust strike for pay rise\"\r\n", "20030219\t\"air nz strike to affect australian travellers\"\r\n", "20030219\t\"ambitious olsson wins triple jump\"\r\n", "20030219\t\"antic delighted with record breaking barca\"\r\n", "20030219\t\"aussie qualifier stosur wastes four memphis match\"\r\n", "20030219\t\"aust addresses un security council over iraq\"\r\n" ] } ], "source": [ "!wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv\n", "!wc -l raw.tsv\n", "!head raw.tsv" ] }, { "cell_type": "markdown", "metadata": { "id": "Reeoc9z0zTxJ" }, "source": [ "为了简单起见,我们仅保留标题文本并移除发布日期。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:06.362458Z", "iopub.status.busy": "2021-08-13T20:30:06.361795Z", "iopub.status.idle": "2021-08-13T20:30:07.384036Z", "shell.execute_reply": "2021-08-13T20:30:07.383347Z" }, "id": "INPWa4upv_yJ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: cannot remove 'corpus': No such file or directory\r\n" ] } ], "source": [ "!rm -r corpus\n", "!mkdir corpus\n", "\n", "with open('corpus/text.txt', 'w') as out_file:\n", " with open('raw.tsv', 'r') as in_file:\n", " for line in in_file:\n", " headline = line.split('\\t')[1].strip().strip('\"')\n", " out_file.write(headline+\"\\n\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.395609Z", "iopub.status.busy": "2021-08-13T20:30:07.394904Z", "iopub.status.idle": "2021-08-13T20:30:07.508773Z", "shell.execute_reply": "2021-08-13T20:30:07.508299Z" }, "id": "5-oedX40z6o2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "severe storms forecast for nye in south east queensland\r\n", "snake catcher pleads for people not to kill reptiles\r\n", "south australia prepares for party to welcome new year\r\n", "strikers cool off the heat with big win in adelaide\r\n", "stunning images from the sydney to hobart yacht\r\n", "the ashes smiths warners near miss liven up boxing day test\r\n", "timelapse: brisbanes new year fireworks\r\n", "what 2017 meant to the kids of australia\r\n", "what the papodopoulos meeting may mean for ausus\r\n", "who is george papadopoulos the former trump campaign aide\r\n" ] } ], "source": [ "!tail corpus/text.txt" ] }, { "cell_type": "markdown", "metadata": { "id": "ls0Zh7kYz3PM" }, "source": [ "## 用于加载 TF-Hub 模块的辅助函数" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.515528Z", "iopub.status.busy": "2021-08-13T20:30:07.514857Z", "iopub.status.idle": "2021-08-13T20:30:07.517190Z", "shell.execute_reply": "2021-08-13T20:30:07.516685Z" }, "id": "vSt_jmyKz3Xp" }, "outputs": [], "source": [ "def load_module(module_url):\n", " embed_module = hub.Module(module_url)\n", " placeholder = tf.placeholder(dtype=tf.string)\n", " embed = embed_module(placeholder)\n", " session = tf.Session()\n", " session.run([tf.global_variables_initializer(), tf.tables_initializer()])\n", " print('TF-Hub module is loaded.')\n", "\n", " def _embeddings_fn(sentences):\n", " computed_embeddings = session.run(\n", " embed, feed_dict={placeholder: sentences})\n", " return computed_embeddings\n", "\n", " return _embeddings_fn" ] }, { "cell_type": "markdown", "metadata": { "id": "2AngMtH50jNb" }, "source": [ "## 2. 为数据生成嵌入向量\n", "\n", "在本教程中,我们使用 [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/2) 为标题数据生成嵌入向量。之后,可以轻松地使用句子嵌入向量计算句子级别的含义相似度。我们使用 Apache Beam 和 TF-Transform 来运行嵌入向量生成过程。" ] }, { "cell_type": "markdown", "metadata": { "id": "F_DvXnDB1pEX" }, "source": [ "### 嵌入向量提取方法" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.522547Z", "iopub.status.busy": "2021-08-13T20:30:07.521908Z", "iopub.status.idle": "2021-08-13T20:30:07.524250Z", "shell.execute_reply": "2021-08-13T20:30:07.523849Z" }, "id": "yL7OEY1E0A35" }, "outputs": [], "source": [ "encoder = None\n", "\n", "def embed_text(text, module_url, random_projection_matrix):\n", " # Beam will run this function in different processes that need to\n", " # import hub and load embed_fn (if not previously loaded)\n", " global encoder\n", " if not encoder:\n", " encoder = hub.Module(module_url)\n", " embedding = encoder(text)\n", " if random_projection_matrix is not None:\n", " # Perform random projection for the embedding\n", " embedding = tf.matmul(\n", " embedding, tf.cast(random_projection_matrix, embedding.dtype))\n", " return embedding\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_don5gXy9D59" }, "source": [ "### 创建 TFT preprocess_fn 方法" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.529667Z", "iopub.status.busy": "2021-08-13T20:30:07.528982Z", "iopub.status.idle": "2021-08-13T20:30:07.531019Z", "shell.execute_reply": "2021-08-13T20:30:07.531380Z" }, "id": "fwYlrzzK9ECE" }, "outputs": [], "source": [ "def make_preprocess_fn(module_url, random_projection_matrix=None):\n", " '''Makes a tft preprocess_fn'''\n", "\n", " def _preprocess_fn(input_features):\n", " '''tft preprocess_fn'''\n", " text = input_features['text']\n", " # Generate the embedding for the input text\n", " embedding = embed_text(text, module_url, random_projection_matrix)\n", " \n", " output_features = {\n", " 'text': text, \n", " 'embedding': embedding\n", " }\n", " \n", " return output_features\n", " \n", " return _preprocess_fn" ] }, { "cell_type": "markdown", "metadata": { "id": "SQ492LN7A-NZ" }, "source": [ "### 创建数据集元数据" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.536351Z", "iopub.status.busy": "2021-08-13T20:30:07.535552Z", "iopub.status.idle": "2021-08-13T20:30:07.537587Z", "shell.execute_reply": "2021-08-13T20:30:07.537927Z" }, "id": "d2D4332VA-2V" }, "outputs": [], "source": [ "def create_metadata():\n", " '''Creates metadata for the raw data'''\n", " from tensorflow_transform.tf_metadata import dataset_metadata\n", " from tensorflow_transform.tf_metadata import schema_utils\n", " feature_spec = {'text': tf.FixedLenFeature([], dtype=tf.string)}\n", " schema = schema_utils.schema_from_feature_spec(feature_spec)\n", " metadata = dataset_metadata.DatasetMetadata(schema)\n", " return metadata" ] }, { "cell_type": "markdown", "metadata": { "id": "5zlSLPzRBm6H" }, "source": [ "### Beam 流水线" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.545619Z", "iopub.status.busy": "2021-08-13T20:30:07.544994Z", "iopub.status.idle": "2021-08-13T20:30:07.547037Z", "shell.execute_reply": "2021-08-13T20:30:07.547396Z" }, "id": "jCGUIB172m2G" }, "outputs": [], "source": [ "def run_hub2emb(args):\n", " '''Runs the embedding generation pipeline'''\n", "\n", " options = beam.options.pipeline_options.PipelineOptions(**args)\n", " args = namedtuple(\"options\", args.keys())(*args.values())\n", "\n", " raw_metadata = create_metadata()\n", " converter = tft.coders.CsvCoder(\n", " column_names=['text'], schema=raw_metadata.schema)\n", "\n", " with beam.Pipeline(args.runner, options=options) as pipeline:\n", " with tft_beam.Context(args.temporary_dir):\n", " # Read the sentences from the input file\n", " sentences = ( \n", " pipeline\n", " | 'Read sentences from files' >> beam.io.ReadFromText(\n", " file_pattern=args.data_dir)\n", " | 'Convert to dictionary' >> beam.Map(converter.decode)\n", " )\n", "\n", " sentences_dataset = (sentences, raw_metadata)\n", " preprocess_fn = make_preprocess_fn(args.module_url, args.random_projection_matrix)\n", " # Generate the embeddings for the sentence using the TF-Hub module\n", " embeddings_dataset, _ = (\n", " sentences_dataset\n", " | 'Extract embeddings' >> tft_beam.AnalyzeAndTransformDataset(preprocess_fn)\n", " )\n", "\n", " embeddings, transformed_metadata = embeddings_dataset\n", " # Write the embeddings to TFRecords files\n", " embeddings | 'Write embeddings to TFRecords' >> beam.io.tfrecordio.WriteToTFRecord(\n", " file_path_prefix='{}/emb'.format(args.output_dir),\n", " file_name_suffix='.tfrecords',\n", " coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))" ] }, { "cell_type": "markdown", "metadata": { "id": "uHbq4t2gCDAG" }, "source": [ "### 生成随机投影权重矩阵\n", "\n", "[随机投影](https://en.wikipedia.org/wiki/Random_projection)是一种简单而强大的技术,用于减少位于欧几里得空间中的一组点的维数。有关理论背景,请参阅[约翰逊-林登斯特劳斯引理](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)。\n", "\n", "利用随机投影降低嵌入向量的维数,这样,构建和查询 ANN 索引需要的时间将减少。\n", "\n", "在本教程中,我们使用 [Scikit-learn](https://scikit-learn.org/stable/modules/random_projection.html#gaussian-random-projection) 库中的[高斯随机投影](https://en.wikipedia.org/wiki/Random_projection#Gaussian_random_projection)。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.552502Z", "iopub.status.busy": "2021-08-13T20:30:07.551852Z", "iopub.status.idle": "2021-08-13T20:30:07.554075Z", "shell.execute_reply": "2021-08-13T20:30:07.553676Z" }, "id": "T1aYPeOUCDIP" }, "outputs": [], "source": [ "def generate_random_projection_weights(original_dim, projected_dim):\n", " random_projection_matrix = None\n", " if projected_dim and original_dim > projected_dim:\n", " random_projection_matrix = gaussian_random_matrix(\n", " n_components=projected_dim, n_features=original_dim).T\n", " print(\"A Gaussian random weight matrix was creates with shape of {}\".format(random_projection_matrix.shape))\n", " print('Storing random projection matrix to disk...')\n", " with open('random_projection_matrix', 'wb') as handle:\n", " pickle.dump(random_projection_matrix, \n", " handle, protocol=pickle.HIGHEST_PROTOCOL)\n", " \n", " return random_projection_matrix" ] }, { "cell_type": "markdown", "metadata": { "id": "CHxZX2Z3Nk64" }, "source": [ "### 设置参数\n", "\n", "如果要使用原始嵌入向量空间构建索引而不进行随机投影,请将 `projected_dim` 参数设置为 `None`。请注意,这会减慢高维嵌入向量的索引编制步骤。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2021-08-13T20:30:07.557899Z", "iopub.status.busy": "2021-08-13T20:30:07.557238Z", "iopub.status.idle": "2021-08-13T20:30:07.559668Z", "shell.execute_reply": "2021-08-13T20:30:07.559205Z" }, "id": "feMVXFL0NlIM" }, "outputs": [], "source": [ "module_url = 'https://tfhub.dev/google/universal-sentence-encoder/2' #@param {type:\"string\"}\n", "projected_dim = 64 #@param {type:\"number\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "On-MbzD922kb" }, "source": [ "### 运行流水线" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:07.565965Z", "iopub.status.busy": "2021-08-13T20:30:07.565250Z", "iopub.status.idle": "2021-08-13T20:30:32.958911Z", "shell.execute_reply": "2021-08-13T20:30:32.959368Z" }, "id": "Y3I1Wv4i21yY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-13 20:30:26.966963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:26.976093: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:26.977097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:26.979352: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA\n", "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2021-08-13 20:30:26.979875: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:26.980866: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:26.981900: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-13 20:30:27.530786: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:27.531809: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:27.532757: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:27.533694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TF-Hub module is loaded.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "A Gaussian random weight matrix was creates with shape of (512, 64)\n", "Storing random projection matrix to disk...\n", "Pipeline args are set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/sklearn/utils/deprecation.py:86: FutureWarning: Function gaussian_random_matrix is deprecated; gaussian_random_matrix is deprecated in 0.22 and will be removed in version 0.24.\n", " warnings.warn(msg, category=FutureWarning)\n" ] }, { "data": { "text/plain": [ "{'job_name': 'hub2emb-210813-203032',\n", " 'runner': 'DirectRunner',\n", " 'batch_size': 1024,\n", " 'data_dir': 'corpus/*.txt',\n", " 'output_dir': PosixPath('/tmp/tmpkesovm9_'),\n", " 'temporary_dir': PosixPath('/tmp/tmpqpz2pkha'),\n", " 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2',\n", " 'random_projection_matrix': array([[-2.97584214e-01, 5.90328172e-02, 7.48115269e-02, ...,\n", " -1.42816723e-01, -2.40606602e-01, 5.00410557e-02],\n", " [ 1.80695381e-01, -9.91138130e-02, 5.89191257e-02, ...,\n", " 7.68998767e-03, -3.91882684e-02, 1.71986674e-01],\n", " [ 4.96522147e-02, -2.27708372e-04, -2.94756524e-02, ...,\n", " 6.39973185e-02, 1.11058183e-01, -3.29520942e-03],\n", " ...,\n", " [ 1.58865772e-01, -7.22440178e-02, 9.41307834e-02, ...,\n", " 1.09094549e-01, 4.02851134e-03, -7.77274763e-02],\n", " [-8.11898743e-02, -4.25131494e-03, -2.09521004e-01, ...,\n", " 3.53013693e-02, 5.40856036e-03, -1.84767115e-01],\n", " [ 7.55975990e-02, -1.03924361e-01, -3.53450446e-01, ...,\n", " -1.40783240e-01, -1.23743172e-01, 5.55453961e-02]])}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tempfile\n", "\n", "output_dir = pathlib.Path(tempfile.mkdtemp())\n", "temporary_dir = pathlib.Path(tempfile.mkdtemp())\n", "\n", "g = tf.Graph()\n", "with g.as_default():\n", " original_dim = load_module(module_url)(['']).shape[1]\n", " random_projection_matrix = None\n", "\n", " if projected_dim:\n", " random_projection_matrix = generate_random_projection_weights(\n", " original_dim, projected_dim)\n", "\n", "args = {\n", " 'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),\n", " 'runner': 'DirectRunner',\n", " 'batch_size': 1024,\n", " 'data_dir': 'corpus/*.txt',\n", " 'output_dir': output_dir,\n", " 'temporary_dir': temporary_dir,\n", " 'module_url': module_url,\n", " 'random_projection_matrix': random_projection_matrix,\n", "}\n", "\n", "print(\"Pipeline args are set.\")\n", "args" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:30:32.987251Z", "iopub.status.busy": "2021-08-13T20:30:32.986200Z", "iopub.status.idle": "2021-08-13T20:33:01.303576Z", "shell.execute_reply": "2021-08-13T20:33:01.303118Z" }, "id": "iS9obmeP4ZOA" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Running pipeline...\n" ] }, { "data": { "application/javascript": [ "\n", " if (typeof window.interactive_beam_jquery == 'undefined') {\n", " var jqueryScript = document.createElement('script');\n", " jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n", " jqueryScript.type = 'text/javascript';\n", " jqueryScript.onload = function() {\n", " var datatableScript = document.createElement('script');\n", " datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n", " datatableScript.type = 'text/javascript';\n", " datatableScript.onload = function() {\n", " window.interactive_beam_jquery = jQuery.noConflict(true);\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }\n", " document.head.appendChild(datatableScript);\n", " };\n", " document.head.appendChild(jqueryScript);\n", " } else {\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "ename": "ModuleNotFoundError", "evalue": "No module named 'pyarrow.vendored'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/pyarrow/pandas-shim.pxi\u001b[0m in \u001b[0;36mpyarrow.lib._PandasAPIShim._check_import\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/pyarrow/pandas-shim.pxi\u001b[0m in \u001b[0;36mpyarrow.lib._PandasAPIShim._import_pandas\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pyarrow.vendored'" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Exception ignored in: 'pyarrow.lib._PandasAPIShim._have_pandas_internal'\n", "Traceback (most recent call last):\n", " File \"pyarrow/pandas-shim.pxi\", line 110, in pyarrow.lib._PandasAPIShim._check_import\n", " File \"pyarrow/pandas-shim.pxi\", line 56, in pyarrow.lib._PandasAPIShim._import_pandas\n", "ModuleNotFoundError: No module named 'pyarrow.vendored'\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-13 20:30:34.260887: W tensorflow/core/common_runtime/graph_constructor.cc:1511] Importing a graph with a lower producer version 26 into an existing graph with producer version 808. Shape inference will have run different parts of the graph with different producer versions.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-13 20:30:35.866330: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:35.866868: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:35.867284: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:35.867742: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:35.868131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:35.868477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Assets added to graph.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Assets added to graph.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:No assets to write.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:No assets to write.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:SavedModel written to: /tmp/tmpqpz2pkha/tftransform_tmp/8c319032fd204c15a7484c4003fb2e0e/saved_model.pb\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:SavedModel written to: /tmp/tmpqpz2pkha/tftransform_tmp/8c319032fd204c15a7484c4003fb2e0e/saved_model.pb\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use ref() instead.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use ref() instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['-f', '/tmp/tmpjloz3vzz.json', '--HistoryManager.hist_file=:memory:']\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.options.pipeline_options:Discarding invalid overrides: {'batch_size': 1024, 'data_dir': 'corpus/*.txt', 'output_dir': PosixPath('/tmp/tmpkesovm9_'), 'temporary_dir': PosixPath('/tmp/tmpqpz2pkha'), 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2', 'random_projection_matrix': array([[-2.97584214e-01, 5.90328172e-02, 7.48115269e-02, ...,\n", " -1.42816723e-01, -2.40606602e-01, 5.00410557e-02],\n", " [ 1.80695381e-01, -9.91138130e-02, 5.89191257e-02, ...,\n", " 7.68998767e-03, -3.91882684e-02, 1.71986674e-01],\n", " [ 4.96522147e-02, -2.27708372e-04, -2.94756524e-02, ...,\n", " 6.39973185e-02, 1.11058183e-01, -3.29520942e-03],\n", " ...,\n", " [ 1.58865772e-01, -7.22440178e-02, 9.41307834e-02, ...,\n", " 1.09094549e-01, 4.02851134e-03, -7.77274763e-02],\n", " [-8.11898743e-02, -4.25131494e-03, -2.09521004e-01, ...,\n", " 3.53013693e-02, 5.40856036e-03, -1.84767115e-01],\n", " [ 7.55975990e-02, -1.03924361e-01, -3.53450446e-01, ...,\n", " -1.40783240e-01, -1.23743172e-01, 5.55453961e-02]])}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-13 20:30:40.090969: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:40.091471: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:40.091794: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:40.092228: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:40.092582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:40.092861: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-13 20:30:46.963424: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:46.963926: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:46.964231: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:46.964663: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:46.965020: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:30:46.965304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2min 36s, sys: 6.8 s, total: 2min 43s\n", "Wall time: 2min 28s\n", "Pipeline is done.\n" ] } ], "source": [ "!rm -r {output_dir}\n", "!rm -r {temporary_dir}\n", "\n", "print(\"Running pipeline...\")\n", "%time run_hub2emb(args)\n", "print(\"Pipeline is done.\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:33:01.336797Z", "iopub.status.busy": "2021-08-13T20:33:01.307842Z", "iopub.status.idle": "2021-08-13T20:33:01.481689Z", "shell.execute_reply": "2021-08-13T20:33:01.482051Z" }, "id": "JAwOo7gQWvVd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "emb-00000-of-00001.tfrecords\r\n" ] } ], "source": [ "!ls {output_dir}" ] }, { "cell_type": "markdown", "metadata": { "id": "HVnee4e6U90u" }, "source": [ "读取生成的部分嵌入向量…" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:33:01.489628Z", "iopub.status.busy": "2021-08-13T20:33:01.488699Z", "iopub.status.idle": "2021-08-13T20:33:01.502056Z", "shell.execute_reply": "2021-08-13T20:33:01.502433Z" }, "id": "-K7pGXlXOj1N" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmp/ipykernel_30829/2258356591.py:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use eager execution and: \n", "`tf.data.TFRecordDataset(path)`\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmp/ipykernel_30829/2258356591.py:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use eager execution and: \n", "`tf.data.TFRecordDataset(path)`\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Embedding dimensions: 64\n", "[b'headline_text']: [ 0.00262176 -0.04697324 0.13821325 0.0233497 -0.02620244 -0.10388613\n", " 0.21338759 -0.02209234 -0.17964238 0.09275205]\n", "Embedding dimensions: 64\n", "[b'aba decides against community broadcasting licence']: [ 0.05847587 -0.07534308 -0.20445269 0.06922759 0.11247684 -0.00068962\n", " -0.06814004 0.05918114 -0.056692 -0.056681 ]\n", "Embedding dimensions: 64\n", "[b'act fire witnesses must be aware of defamation']: [ 0.19353217 0.02340996 -0.14971143 0.06321372 0.17323506 -0.02091776\n", " 0.18536443 -0.09348775 -0.0891809 -0.00271657]\n", "Embedding dimensions: 64\n", "[b'a g calls for infrastructure protection summit']: [-0.04273582 -0.01241589 0.13310218 -0.10297301 0.23529018 -0.07574648\n", " -0.14112787 0.17638578 -0.14110327 0.10147867]\n", "Embedding dimensions: 64\n", "[b'air nz staff in aust strike for pay rise']: [ 0.08407466 -0.19696297 0.0993693 -0.00812466 0.20142363 0.01679755\n", " -0.11149033 0.0273495 -0.08338891 -0.06217552]\n" ] } ], "source": [ "import itertools\n", "\n", "embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')\n", "sample = 5\n", "record_iterator = tf.io.tf_record_iterator(path=embed_file)\n", "for string_record in itertools.islice(record_iterator, sample):\n", " example = tf.train.Example()\n", " example.ParseFromString(string_record)\n", " text = example.features.feature['text'].bytes_list.value\n", " embedding = np.array(example.features.feature['embedding'].float_list.value)\n", " print(\"Embedding dimensions: {}\".format(embedding.shape[0]))\n", " print(\"{}: {}\".format(text, embedding[:10]))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "agGoaMSgY8wN" }, "source": [ "## 3. 为嵌入向量构建 ANN 索引\n", "\n", "[ANNOY](https://github.com/spotify/annoy) (Approximate Nearest Neighbors Oh Yeah) 是一个带有 Python 绑定的 C++ 库,用于搜索空间中接近给定查询点的点。它还会创建基于文件的大型只读数据结构,并将其映射到内存中。它由 [Spotify](https://www.spotify.com) 构建并用于推荐音乐。" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:33:01.513524Z", "iopub.status.busy": "2021-08-13T20:33:01.512518Z", "iopub.status.idle": "2021-08-13T20:33:01.514585Z", "shell.execute_reply": "2021-08-13T20:33:01.514939Z" }, "id": "UcPDspU3WjgH" }, "outputs": [], "source": [ "def build_index(embedding_files_pattern, index_filename, vector_length, \n", " metric='angular', num_trees=100):\n", " '''Builds an ANNOY index'''\n", "\n", " annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)\n", " # Mapping between the item and its identifier in the index\n", " mapping = {}\n", "\n", " embed_files = tf.gfile.Glob(embedding_files_pattern)\n", " print('Found {} embedding file(s).'.format(len(embed_files)))\n", "\n", " item_counter = 0\n", " for f, embed_file in enumerate(embed_files):\n", " print('Loading embeddings in file {} of {}...'.format(\n", " f+1, len(embed_files)))\n", " record_iterator = tf.io.tf_record_iterator(\n", " path=embed_file)\n", "\n", " for string_record in record_iterator:\n", " example = tf.train.Example()\n", " example.ParseFromString(string_record)\n", " text = example.features.feature['text'].bytes_list.value[0].decode(\"utf-8\")\n", " mapping[item_counter] = text\n", " embedding = np.array(\n", " example.features.feature['embedding'].float_list.value)\n", " annoy_index.add_item(item_counter, embedding)\n", " item_counter += 1\n", " if item_counter % 100000 == 0:\n", " print('{} items loaded to the index'.format(item_counter))\n", "\n", " print('A total of {} items added to the index'.format(item_counter))\n", "\n", " print('Building the index with {} trees...'.format(num_trees))\n", " annoy_index.build(n_trees=num_trees)\n", " print('Index is successfully built.')\n", " \n", " print('Saving index to disk...')\n", " annoy_index.save(index_filename)\n", " print('Index is saved to disk.')\n", " print(\"Index file size: {} GB\".format(\n", " round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))\n", " annoy_index.unload()\n", "\n", " print('Saving mapping to disk...')\n", " with open(index_filename + '.mapping', 'wb') as handle:\n", " pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", " print('Mapping is saved to disk.')\n", " print(\"Mapping file size: {} MB\".format(\n", " round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:33:01.549278Z", "iopub.status.busy": "2021-08-13T20:33:01.520570Z", "iopub.status.idle": "2021-08-13T20:34:36.960506Z", "shell.execute_reply": "2021-08-13T20:34:36.959958Z" }, "id": "AgyOQhUq6FNE" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: cannot remove 'index': No such file or directory\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "rm: cannot remove 'index.mapping': No such file or directory\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Found 1 embedding file(s).\n", "Loading embeddings in file 1 of 1...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "100000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "200000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "300000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "400000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "500000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "600000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "700000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "800000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "900000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1000000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1100000 items loaded to the index\n", "A total of 1103664 items added to the index\n", "Building the index with 100 trees...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Index is successfully built.\n", "Saving index to disk...\n", "Index is saved to disk.\n", "Index file size: 1.68 GB\n", "Saving mapping to disk...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Mapping is saved to disk.\n", "Mapping file size: 50.61 MB\n", "CPU times: user 6min 4s, sys: 4.39 s, total: 6min 9s\n", "Wall time: 1min 35s\n" ] } ], "source": [ "embedding_files = \"{}/emb-*.tfrecords\".format(output_dir)\n", "embedding_dimension = projected_dim\n", "index_filename = \"index\"\n", "\n", "!rm {index_filename}\n", "!rm {index_filename}.mapping\n", "\n", "%time build_index(embedding_files, index_filename, embedding_dimension)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:34:36.966244Z", "iopub.status.busy": "2021-08-13T20:34:36.965597Z", "iopub.status.idle": "2021-08-13T20:34:37.146669Z", "shell.execute_reply": "2021-08-13T20:34:37.145948Z" }, "id": "Ic31Tm5cgAd5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "corpus\tindex.mapping\t\t raw.tsv\r\n", "index\trandom_projection_matrix semantic_approximate_nearest_neighbors.ipynb\r\n" ] } ], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": { "id": "maGxDl8ufP-p" }, "source": [ "## 4. 使用索引进行相似度匹配\n", "\n", "现在,我们可以使用 ANN 索引查找与输入查询语义接近的新闻标题。" ] }, { "cell_type": "markdown", "metadata": { "id": "_dIs8W78fYPp" }, "source": [ "### 加载索引和映射文件" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:34:37.222912Z", "iopub.status.busy": "2021-08-13T20:34:37.152569Z", "iopub.status.idle": "2021-08-13T20:34:37.523033Z", "shell.execute_reply": "2021-08-13T20:34:37.522552Z" }, "id": "jlTTrbQHayvb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Annoy index is loaded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/kbuilder/.local/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Mapping file is loaded.\n" ] } ], "source": [ "index = annoy.AnnoyIndex(embedding_dimension)\n", "index.load(index_filename, prefault=True)\n", "print('Annoy index is loaded.')\n", "with open(index_filename + '.mapping', 'rb') as handle:\n", " mapping = pickle.load(handle)\n", "print('Mapping file is loaded.')\n" ] }, { "cell_type": "markdown", "metadata": { "id": "y6liFMSUh08J" }, "source": [ "### 相似度匹配方法" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:34:37.528481Z", "iopub.status.busy": "2021-08-13T20:34:37.527833Z", "iopub.status.idle": "2021-08-13T20:34:37.530235Z", "shell.execute_reply": "2021-08-13T20:34:37.529800Z" }, "id": "mUxjTag8hc16" }, "outputs": [], "source": [ "def find_similar_items(embedding, num_matches=5):\n", " '''Finds similar items to a given embedding in the ANN index'''\n", " ids = index.get_nns_by_vector(\n", " embedding, num_matches, search_k=-1, include_distances=False)\n", " items = [mapping[i] for i in ids]\n", " return items" ] }, { "cell_type": "markdown", "metadata": { "id": "hjerNpmZja0A" }, "source": [ "### 从给定查询中提取嵌入向量" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:34:37.536467Z", "iopub.status.busy": "2021-08-13T20:34:37.535850Z", "iopub.status.idle": "2021-08-13T20:34:43.636272Z", "shell.execute_reply": "2021-08-13T20:34:43.636671Z" }, "id": "a0IIXzfBjZ19" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading the TF-Hub module...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-13 20:34:39.584652: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:34:39.585218: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:34:39.585533: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:34:39.586016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:34:39.586389: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-13 20:34:39.586691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TF-Hub module is loaded.\n", "TF-Hub module is loaded.\n", "Loading random projection matrix...\n", "random projection matrix is loaded.\n" ] } ], "source": [ "# Load the TF-Hub module\n", "print(\"Loading the TF-Hub module...\")\n", "g = tf.Graph()\n", "with g.as_default():\n", " embed_fn = load_module(module_url)\n", "print(\"TF-Hub module is loaded.\")\n", "\n", "random_projection_matrix = None\n", "if os.path.exists('random_projection_matrix'):\n", " print(\"Loading random projection matrix...\")\n", " with open('random_projection_matrix', 'rb') as handle:\n", " random_projection_matrix = pickle.load(handle)\n", " print('random projection matrix is loaded.')\n", "\n", "def extract_embeddings(query):\n", " '''Generates the embedding for the query'''\n", " query_embedding = embed_fn([query])[0]\n", " if random_projection_matrix is not None:\n", " query_embedding = query_embedding.dot(random_projection_matrix)\n", " return query_embedding" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2021-08-13T20:34:43.641385Z", "iopub.status.busy": "2021-08-13T20:34:43.640781Z", "iopub.status.idle": "2021-08-13T20:34:43.866533Z", "shell.execute_reply": "2021-08-13T20:34:43.867258Z" }, "id": "kCoCNROujEIO" }, "outputs": [ { "data": { "text/plain": [ "array([-0.04585221, -0.14095478, 0.22087142, -0.17864118, -0.02164789,\n", " -0.06688423, -0.2533522 , 0.21252237, -0.04564023, 0.19766541])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract_embeddings(\"Hello Machine Learning!\")[:10]" ] }, { "cell_type": "markdown", "metadata": { "id": "nE_Q60nCk_ZB" }, "source": [ "### 输入查询以查找最相似的条目" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2021-08-13T20:34:43.872334Z", "iopub.status.busy": "2021-08-13T20:34:43.871203Z", "iopub.status.idle": "2021-08-13T20:34:43.892228Z", "shell.execute_reply": "2021-08-13T20:34:43.893041Z" }, "id": "wC0uLjvfk5nB" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating embedding for the query...\n", "CPU times: user 10.3 ms, sys: 31 ms, total: 41.3 ms\n", "Wall time: 5.22 ms\n", "\n", "Finding relevant items in the index...\n", "CPU times: user 6.04 ms, sys: 210 µs, total: 6.25 ms\n", "Wall time: 789 µs\n", "\n", "Results:\n", "=========\n", "confronting global challenges\n", "confidence boost in tasmanian economy\n", "global bid to exploit food technologys potential\n", "global response\n", "global financial woes spark local fears\n", "challenges to austs future\n", "economic downturn hampers aids battle\n", "environment centre getting global reputation\n", "aus markets down on global fears\n", "globalisation research wins society award\n" ] } ], "source": [ "#@title { run: \"auto\" }\n", "query = \"confronting global challenges\" #@param {type:\"string\"}\n", "print(\"Generating embedding for the query...\")\n", "%time query_embedding = extract_embeddings(query)\n", "\n", "print(\"\")\n", "print(\"Finding relevant items in the index...\")\n", "%time items = find_similar_items(query_embedding, 10)\n", "\n", "print(\"\")\n", "print(\"Results:\")\n", "print(\"=========\")\n", "for item in items:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": { "id": "wwtMtyOeDKwt" }, "source": [ "## 了解更多信息\n", "\n", "您可以在 [tensorflow.org](https://tensorflow.google.cn/) 上详细了解 TensorFlow,并在 [tensorflow.org/hub](https://tensorflow.google.cn/hub/) 上查看 TF-Hub API 文档。此外,还可以在 [tfhub.dev](https://tfhub.dev/) 上找到可用的 TensorFlow Hub 模块,包括更多的文本嵌入向量模块和图像特征向量模块。\n", "\n", "另外,请查看[机器学习速成课程](https://developers.google.com/machine-learning/crash-course/),这是 Google 对机器学习的快节奏实用介绍。" ] } ], "metadata": { "colab": { "collapsed_sections": [ "ls0Zh7kYz3PM", "_don5gXy9D59", "SQ492LN7A-NZ" ], "name": "semantic_approximate_nearest_neighbors.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 0 }