{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "ACbjNjyO4f_8" }, "source": [ "##### Copyright 2019 The TensorFlow Hub Authors.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\");" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:01:10.752807Z", "iopub.status.busy": "2021-08-14T06:01:10.752134Z", "iopub.status.idle": "2021-08-14T06:01:10.756230Z", "shell.execute_reply": "2021-08-14T06:01:10.755677Z" }, "id": "MCM50vaM4jiK" }, "outputs": [], "source": [ "# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# http://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License.\n", "# ==============================================================================" ] }, { "cell_type": "markdown", "metadata": { "id": "9qOVy-_vmuUP" }, "source": [ "# 最近傍とテキスト埋め込みによるセマンティック検索\n" ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "\n", " \n", " \n", " \n", " \n", " \n", "
TensorFlow.org で表示 Google Colab で実行 GitHub で表示 ノートブックをダウンロード TF Hub モデルを参照
" ] }, { "cell_type": "markdown", "metadata": { "id": "7Hks9F5qq6m2" }, "source": [ "このチュートリアルでは、[TensorFlow Hub](https://tfhub.dev)(TF-Hub)が提供する入力データから埋め込みを生成し、抽出された埋め込みを使用して最近傍(ANN)インデックスを構築する方法を説明します。構築されたインデックスは、リアルタイムに類似性の一致と検索を行うために使用できます。\n", "\n", "大規模なコーパスのデータを取り扱う場合、特定のクエリに対して最も類似するアイテムをリアルタイムで見つけるために、レポジトリ全体をスキャンして完全一致を行うというのは、効率的ではありません。そのため、おおよその類似性一致アルゴリズムを使用することで、正確な最近傍の一致を見つける際の精度を少しだけ犠牲にし、速度を大幅に向上させることができます。\n", "\n", "このチュートリアルでは、ニュースの見出しのコーパスに対してリアルタイムテキスト検索を行い、クエリに最も類似する見出しを見つけ出す例を示します。この検索はキーワード検索とは異なり、テキスト埋め込みにエンコードされた意味的類似性をキャプチャします。\n", "\n", "このチュートリアルの手順は次のとおりです。\n", "\n", "1. サンプルデータをダウンロードする。\n", "2. TF-Hub モジュールを使用して、データの埋め込みを生成する。\n", "3. 埋め込みの ANN インデックスを構築する。\n", "4. インデックスを使って、類似性の一致を実施する。\n", "\n", "TF-Hub モジュールから埋め込みを生成するには、[TensorFlow Transform](https://beam.apache.org/documentation/programming-guide/)(TF-Transform)を使った [Apache Beam](https://www.tensorflow.org/tfx/tutorials/transform/simple) を使用します。また、最近傍インデックスの構築には、Spotify の [ANNOY](https://github.com/spotify/annoy) ライブラリを使用します。ANN フレームワークのベンチマークは、こちらの [Github リポジトリ](https://github.com/erikbern/ann-benchmarks)をご覧ください。\n", "\n", "このチュートリアルでは TensorFlow 1.0 を使用し、TF1 の [Hub モジュール](https://www.tensorflow.org/hub/tf1_hub_module)のみと連携します。更新版は、[このチュートリアルの TF2 バージョン](https://github.com/tensorflow/hub/blob/master/examples/colab/tf2_semantic_approximate_nearest_neighbors.ipynb)をご覧ください。" ] }, { "cell_type": "markdown", "metadata": { "id": "Q0jr0QK9qO5P" }, "source": [ "## セットアップ" ] }, { "cell_type": "markdown", "metadata": { "id": "whMRj9qeqed4" }, "source": [ "必要なライブラリをインストールします。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:01:10.767111Z", "iopub.status.busy": "2021-08-14T06:01:10.763677Z", "iopub.status.idle": "2021-08-14T06:01:33.677559Z", "shell.execute_reply": "2021-08-14T06:01:33.676989Z" }, "id": "qmXkLPoaqS--" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] } ], "source": [ "!pip install -q apache_beam\n", "!pip install -q 'scikit_learn~=0.23.0' # For gaussian_random_matrix.\n", "!pip install -q annoy" ] }, { "cell_type": "markdown", "metadata": { "id": "A-vBZiCCqld0" }, "source": [ "必要なライブラリをインポートします。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:01:33.683521Z", "iopub.status.busy": "2021-08-14T06:01:33.682868Z", "iopub.status.idle": "2021-08-14T06:01:36.762170Z", "shell.execute_reply": "2021-08-14T06:01:36.761575Z" }, "id": "6NTYbdWcseuK" }, "outputs": [], "source": [ "import os\n", "import sys\n", "import pathlib\n", "import pickle\n", "from collections import namedtuple\n", "from datetime import datetime\n", "\n", "import numpy as np\n", "import apache_beam as beam\n", "import annoy\n", "from sklearn.random_projection import gaussian_random_matrix\n", "\n", "import tensorflow.compat.v1 as tf\n", "import tensorflow_hub as hub" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:01:36.772627Z", "iopub.status.busy": "2021-08-14T06:01:36.766397Z", "iopub.status.idle": "2021-08-14T06:02:10.663211Z", "shell.execute_reply": "2021-08-14T06:02:10.663599Z" }, "id": "_GF0GnLqGdPQ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[33mWARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.\r\n", "You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.\u001b[0m\r\n" ] } ], "source": [ "# TFT needs to be installed afterwards\n", "!pip install -q tensorflow_transform==0.24\n", "import tensorflow_transform as tft\n", "import tensorflow_transform.beam as tft_beam" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:10.669006Z", "iopub.status.busy": "2021-08-14T06:02:10.668393Z", "iopub.status.idle": "2021-08-14T06:02:10.670941Z", "shell.execute_reply": "2021-08-14T06:02:10.670510Z" }, "id": "tx0SZa6-7b-f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TF version: 2.6.0\n", "TF-Hub version: 0.12.0\n", "TF-Transform version: 0.24.0\n", "Apache Beam version: 2.31.0\n" ] } ], "source": [ "print('TF version: {}'.format(tf.__version__))\n", "print('TF-Hub version: {}'.format(hub.__version__))\n", "print('TF-Transform version: {}'.format(tft.__version__))\n", "print('Apache Beam version: {}'.format(beam.__version__))" ] }, { "cell_type": "markdown", "metadata": { "id": "P6Imq876rLWx" }, "source": [ "## 1. サンプルデータをダウンロードする\n", "\n", "[A Million News Headlines](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL#) データセットには、15 年にわたって発行されたニュースの見出しが含まれます。出典は、有名なオーストラリア放送協会(ABC)です。このニュースデータセットは、2003 年の始めから 2017 年の終わりまでの特筆すべき世界的なイベントについて、オーストラリアにより焦点を当てた記録が含まれます。\n", "\n", "**形式**: 1)発行日と 2)見出しのテキストの 2 列をタブ区切りにしたデータ。このチュートリアルで関心があるのは、見出しのテキストのみです。\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:10.681680Z", "iopub.status.busy": "2021-08-14T06:02:10.679893Z", "iopub.status.idle": "2021-08-14T06:02:16.877284Z", "shell.execute_reply": "2021-08-14T06:02:16.876734Z" }, "id": "OpF57n8e5C9D" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2021-08-14 06:02:10-- https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true\r\n", "Resolving dataverse.harvard.edu (dataverse.harvard.edu)... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "72.44.40.54, 18.211.119.52, 54.162.175.159\r\n", "Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|72.44.40.54|:443... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "connected.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "HTTP request sent, awaiting response... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "200 OK\r\n", "Length: 57600231 (55M) [text/tab-separated-values]\r\n", "Saving to: ‘raw.tsv’\r\n", "\r\n", "\r", "raw.tsv 0%[ ] 0 --.-KB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 0%[ ] 89.54K 244KB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 0%[ ] 409.54K 558KB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 3%[ ] 1.65M 1.50MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 10%[=> ] 5.54M 4.26MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 16%[==> ] 9.29M 5.62MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 26%[====> ] 14.43M 7.79MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 33%[=====> ] 18.27M 8.29MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 42%[=======> ] 23.33M 9.70MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 49%[========> ] 27.26M 9.88MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 58%[==========> ] 32.38M 10.9MB/s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 65%[============> ] 36.21M 11.4MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 71%[=============> ] 39.21M 11.2MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 79%[==============> ] 43.90M 11.9MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 87%[================> ] 48.13M 12.9MB/s eta 2s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 93%[=================> ] 51.13M 13.9MB/s eta 0s " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "raw.tsv 100%[===================>] 54.93M 14.8MB/s in 4.4s \r\n", "\r\n", "2021-08-14 06:02:16 (12.4 MB/s) - ‘raw.tsv’ saved [57600231/57600231]\r\n", "\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1103664 raw.tsv\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "publish_date\theadline_text\r\n", "20030219\t\"aba decides against community broadcasting licence\"\r\n", "20030219\t\"act fire witnesses must be aware of defamation\"\r\n", "20030219\t\"a g calls for infrastructure protection summit\"\r\n", "20030219\t\"air nz staff in aust strike for pay rise\"\r\n", "20030219\t\"air nz strike to affect australian travellers\"\r\n", "20030219\t\"ambitious olsson wins triple jump\"\r\n", "20030219\t\"antic delighted with record breaking barca\"\r\n", "20030219\t\"aussie qualifier stosur wastes four memphis match\"\r\n", "20030219\t\"aust addresses un security council over iraq\"\r\n" ] } ], "source": [ "!wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv\n", "!wc -l raw.tsv\n", "!head raw.tsv" ] }, { "cell_type": "markdown", "metadata": { "id": "Reeoc9z0zTxJ" }, "source": [ "単純化するため、見出しのテキストのみを維持し、発行日は削除します。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:16.889441Z", "iopub.status.busy": "2021-08-14T06:02:16.884012Z", "iopub.status.idle": "2021-08-14T06:02:17.863566Z", "shell.execute_reply": "2021-08-14T06:02:17.863016Z" }, "id": "INPWa4upv_yJ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: cannot remove 'corpus': No such file or directory\r\n" ] } ], "source": [ "!rm -r corpus\n", "!mkdir corpus\n", "\n", "with open('corpus/text.txt', 'w') as out_file:\n", " with open('raw.tsv', 'r') as in_file:\n", " for line in in_file:\n", " headline = line.split('\\t')[1].strip().strip('\"')\n", " out_file.write(headline+\"\\n\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:17.874388Z", "iopub.status.busy": "2021-08-14T06:02:17.873751Z", "iopub.status.idle": "2021-08-14T06:02:17.987553Z", "shell.execute_reply": "2021-08-14T06:02:17.987900Z" }, "id": "5-oedX40z6o2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "severe storms forecast for nye in south east queensland\r\n", "snake catcher pleads for people not to kill reptiles\r\n", "south australia prepares for party to welcome new year\r\n", "strikers cool off the heat with big win in adelaide\r\n", "stunning images from the sydney to hobart yacht\r\n", "the ashes smiths warners near miss liven up boxing day test\r\n", "timelapse: brisbanes new year fireworks\r\n", "what 2017 meant to the kids of australia\r\n", "what the papodopoulos meeting may mean for ausus\r\n", "who is george papadopoulos the former trump campaign aide\r\n" ] } ], "source": [ "!tail corpus/text.txt" ] }, { "cell_type": "markdown", "metadata": { "id": "ls0Zh7kYz3PM" }, "source": [ "## TF-Hub モジュールを読み込むためのヘルパー関数" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:17.993983Z", "iopub.status.busy": "2021-08-14T06:02:17.993363Z", "iopub.status.idle": "2021-08-14T06:02:17.995661Z", "shell.execute_reply": "2021-08-14T06:02:17.995260Z" }, "id": "vSt_jmyKz3Xp" }, "outputs": [], "source": [ "def load_module(module_url):\n", " embed_module = hub.Module(module_url)\n", " placeholder = tf.placeholder(dtype=tf.string)\n", " embed = embed_module(placeholder)\n", " session = tf.Session()\n", " session.run([tf.global_variables_initializer(), tf.tables_initializer()])\n", " print('TF-Hub module is loaded.')\n", "\n", " def _embeddings_fn(sentences):\n", " computed_embeddings = session.run(\n", " embed, feed_dict={placeholder: sentences})\n", " return computed_embeddings\n", "\n", " return _embeddings_fn" ] }, { "cell_type": "markdown", "metadata": { "id": "2AngMtH50jNb" }, "source": [ "## 2. データの埋め込みを生成する\n", "\n", "このチュートリアルでは、[ユニバーサルセンテンスエンコーダ](https://tfhub.dev/google/universal-sentence-encoder/2)を使用して、見出しデータの埋め込みを生成します。その後で、文章レベルの意味の類似性を計算するために、文章埋め込みを簡単に使用することが可能となります。埋め込み生成プロセスは、Apache Beam と TF-Transform を使用して実行します。" ] }, { "cell_type": "markdown", "metadata": { "id": "F_DvXnDB1pEX" }, "source": [ "### 埋め込み抽出メソッド" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:18.000705Z", "iopub.status.busy": "2021-08-14T06:02:18.000125Z", "iopub.status.idle": "2021-08-14T06:02:18.002442Z", "shell.execute_reply": "2021-08-14T06:02:18.002053Z" }, "id": "yL7OEY1E0A35" }, "outputs": [], "source": [ "encoder = None\n", "\n", "def embed_text(text, module_url, random_projection_matrix):\n", " # Beam will run this function in different processes that need to\n", " # import hub and load embed_fn (if not previously loaded)\n", " global encoder\n", " if not encoder:\n", " encoder = hub.Module(module_url)\n", " embedding = encoder(text)\n", " if random_projection_matrix is not None:\n", " # Perform random projection for the embedding\n", " embedding = tf.matmul(\n", " embedding, tf.cast(random_projection_matrix, embedding.dtype))\n", " return embedding\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_don5gXy9D59" }, "source": [ "### TFT preprocess_fn メソッドの作成" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:18.007427Z", "iopub.status.busy": "2021-08-14T06:02:18.006871Z", "iopub.status.idle": "2021-08-14T06:02:18.009207Z", "shell.execute_reply": "2021-08-14T06:02:18.008665Z" }, "id": "fwYlrzzK9ECE" }, "outputs": [], "source": [ "def make_preprocess_fn(module_url, random_projection_matrix=None):\n", " '''Makes a tft preprocess_fn'''\n", "\n", " def _preprocess_fn(input_features):\n", " '''tft preprocess_fn'''\n", " text = input_features['text']\n", " # Generate the embedding for the input text\n", " embedding = embed_text(text, module_url, random_projection_matrix)\n", " \n", " output_features = {\n", " 'text': text, \n", " 'embedding': embedding\n", " }\n", " \n", " return output_features\n", " \n", " return _preprocess_fn" ] }, { "cell_type": "markdown", "metadata": { "id": "SQ492LN7A-NZ" }, "source": [ "### データセットのメタデータの作成" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:18.013616Z", "iopub.status.busy": "2021-08-14T06:02:18.013065Z", "iopub.status.idle": "2021-08-14T06:02:18.015254Z", "shell.execute_reply": "2021-08-14T06:02:18.014779Z" }, "id": "d2D4332VA-2V" }, "outputs": [], "source": [ "def create_metadata():\n", " '''Creates metadata for the raw data'''\n", " from tensorflow_transform.tf_metadata import dataset_metadata\n", " from tensorflow_transform.tf_metadata import schema_utils\n", " feature_spec = {'text': tf.FixedLenFeature([], dtype=tf.string)}\n", " schema = schema_utils.schema_from_feature_spec(feature_spec)\n", " metadata = dataset_metadata.DatasetMetadata(schema)\n", " return metadata" ] }, { "cell_type": "markdown", "metadata": { "id": "5zlSLPzRBm6H" }, "source": [ "### Beam パイプライン" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:18.022629Z", "iopub.status.busy": "2021-08-14T06:02:18.021999Z", "iopub.status.idle": "2021-08-14T06:02:18.023652Z", "shell.execute_reply": "2021-08-14T06:02:18.024022Z" }, "id": "jCGUIB172m2G" }, "outputs": [], "source": [ "def run_hub2emb(args):\n", " '''Runs the embedding generation pipeline'''\n", "\n", " options = beam.options.pipeline_options.PipelineOptions(**args)\n", " args = namedtuple(\"options\", args.keys())(*args.values())\n", "\n", " raw_metadata = create_metadata()\n", " converter = tft.coders.CsvCoder(\n", " column_names=['text'], schema=raw_metadata.schema)\n", "\n", " with beam.Pipeline(args.runner, options=options) as pipeline:\n", " with tft_beam.Context(args.temporary_dir):\n", " # Read the sentences from the input file\n", " sentences = ( \n", " pipeline\n", " | 'Read sentences from files' >> beam.io.ReadFromText(\n", " file_pattern=args.data_dir)\n", " | 'Convert to dictionary' >> beam.Map(converter.decode)\n", " )\n", "\n", " sentences_dataset = (sentences, raw_metadata)\n", " preprocess_fn = make_preprocess_fn(args.module_url, args.random_projection_matrix)\n", " # Generate the embeddings for the sentence using the TF-Hub module\n", " embeddings_dataset, _ = (\n", " sentences_dataset\n", " | 'Extract embeddings' >> tft_beam.AnalyzeAndTransformDataset(preprocess_fn)\n", " )\n", "\n", " embeddings, transformed_metadata = embeddings_dataset\n", " # Write the embeddings to TFRecords files\n", " embeddings | 'Write embeddings to TFRecords' >> beam.io.tfrecordio.WriteToTFRecord(\n", " file_path_prefix='{}/emb'.format(args.output_dir),\n", " file_name_suffix='.tfrecords',\n", " coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))" ] }, { "cell_type": "markdown", "metadata": { "id": "uHbq4t2gCDAG" }, "source": [ "### ランダムプロジェクションの重み行列を生成する\n", "\n", "[ランダムプロジェクション](https://en.wikipedia.org/wiki/Random_projection)は、ユークリッド空間に存在する一連の点の次元を縮小するために使用される、単純でありながら高性能のテクニックです。理論的背景については、[Johnson-Lindenstrauss の補題](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)をご覧ください。\n", "\n", "ランダムプロジェクションを使用して埋め込みの次元を縮小するということは、ANN インデックスの構築とクエリに必要となる時間を短縮できるということです。\n", "\n", "このチュートリアルでは、[Scikit-learn](https://en.wikipedia.org/wiki/Random_projection#Gaussian_random_projection) ライブラリの[ガウスランダムプロジェクションを使用します。](https://scikit-learn.org/stable/modules/random_projection.html#gaussian-random-projection)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:18.028894Z", "iopub.status.busy": "2021-08-14T06:02:18.028273Z", "iopub.status.idle": "2021-08-14T06:02:18.030182Z", "shell.execute_reply": "2021-08-14T06:02:18.030510Z" }, "id": "T1aYPeOUCDIP" }, "outputs": [], "source": [ "def generate_random_projection_weights(original_dim, projected_dim):\n", " random_projection_matrix = None\n", " if projected_dim and original_dim > projected_dim:\n", " random_projection_matrix = gaussian_random_matrix(\n", " n_components=projected_dim, n_features=original_dim).T\n", " print(\"A Gaussian random weight matrix was creates with shape of {}\".format(random_projection_matrix.shape))\n", " print('Storing random projection matrix to disk...')\n", " with open('random_projection_matrix', 'wb') as handle:\n", " pickle.dump(random_projection_matrix, \n", " handle, protocol=pickle.HIGHEST_PROTOCOL)\n", " \n", " return random_projection_matrix" ] }, { "cell_type": "markdown", "metadata": { "id": "CHxZX2Z3Nk64" }, "source": [ "### パラメータの設定\n", "\n", "ランダムプロジェクションを使用せずに、元の埋め込み空間を使用してインデックスを構築する場合は、`projected_dim` パラメータを `None` に設定します。これにより、高次元埋め込みのインデックス作成ステップが減速することに注意してください。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2021-08-14T06:02:18.034269Z", "iopub.status.busy": "2021-08-14T06:02:18.033701Z", "iopub.status.idle": "2021-08-14T06:02:18.035599Z", "shell.execute_reply": "2021-08-14T06:02:18.035155Z" }, "id": "feMVXFL0NlIM" }, "outputs": [], "source": [ "module_url = 'https://tfhub.dev/google/universal-sentence-encoder/2' #@param {type:\"string\"}\n", "projected_dim = 64 #@param {type:\"number\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "On-MbzD922kb" }, "source": [ "### パイプラインの実行" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:18.041231Z", "iopub.status.busy": "2021-08-14T06:02:18.040692Z", "iopub.status.idle": "2021-08-14T06:02:40.893101Z", "shell.execute_reply": "2021-08-14T06:02:40.893461Z" }, "id": "Y3I1Wv4i21yY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-14 06:02:35.233546: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.242186: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.243204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.245094: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA\n", "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2021-08-14 06:02:35.245692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.246601: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.247558: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-14 06:02:35.765545: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.766516: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.767498: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:35.768326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TF-Hub module is loaded.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "A Gaussian random weight matrix was creates with shape of (512, 64)\n", "Storing random projection matrix to disk...\n", "Pipeline args are set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/sklearn/utils/deprecation.py:86: FutureWarning: Function gaussian_random_matrix is deprecated; gaussian_random_matrix is deprecated in 0.22 and will be removed in version 0.24.\n", " warnings.warn(msg, category=FutureWarning)\n" ] }, { "data": { "text/plain": [ "{'job_name': 'hub2emb-210814-060240',\n", " 'runner': 'DirectRunner',\n", " 'batch_size': 1024,\n", " 'data_dir': 'corpus/*.txt',\n", " 'output_dir': PosixPath('/tmp/tmpc6afrtky'),\n", " 'temporary_dir': PosixPath('/tmp/tmpm8qt0b58'),\n", " 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2',\n", " 'random_projection_matrix': array([[ 1.02620476e-01, 1.42609552e-01, 2.31834089e-02, ...,\n", " 3.34482362e-02, -9.16320501e-02, 6.08857313e-03],\n", " [-3.83626651e-02, -4.28119068e-03, 2.33800471e-01, ...,\n", " 2.09980119e-04, 1.35151201e-01, 2.27909783e-01],\n", " [ 2.45493073e-02, -7.33840377e-02, -2.39605360e-01, ...,\n", " 3.41644499e-02, -7.02873932e-02, -2.84315778e-01],\n", " ...,\n", " [-1.95808735e-01, -5.37650104e-02, 1.04212784e-01, ...,\n", " 9.01655723e-02, -1.15924190e-01, 8.84887858e-02],\n", " [ 1.11187595e-02, 1.64015586e-01, -3.21288737e-01, ...,\n", " 1.05356039e-01, -1.62969901e-01, 2.51348063e-01],\n", " [-1.42711265e-01, -9.61572580e-02, 1.09919747e-01, ...,\n", " 9.04922650e-02, -2.12339462e-01, 4.21877595e-02]])}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tempfile\n", "\n", "output_dir = pathlib.Path(tempfile.mkdtemp())\n", "temporary_dir = pathlib.Path(tempfile.mkdtemp())\n", "\n", "g = tf.Graph()\n", "with g.as_default():\n", " original_dim = load_module(module_url)(['']).shape[1]\n", " random_projection_matrix = None\n", "\n", " if projected_dim:\n", " random_projection_matrix = generate_random_projection_weights(\n", " original_dim, projected_dim)\n", "\n", "args = {\n", " 'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),\n", " 'runner': 'DirectRunner',\n", " 'batch_size': 1024,\n", " 'data_dir': 'corpus/*.txt',\n", " 'output_dir': output_dir,\n", " 'temporary_dir': temporary_dir,\n", " 'module_url': module_url,\n", " 'random_projection_matrix': random_projection_matrix,\n", "}\n", "\n", "print(\"Pipeline args are set.\")\n", "args" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:02:40.920761Z", "iopub.status.busy": "2021-08-14T06:02:40.919591Z", "iopub.status.idle": "2021-08-14T06:05:04.326139Z", "shell.execute_reply": "2021-08-14T06:05:04.325679Z" }, "id": "iS9obmeP4ZOA" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Running pipeline...\n" ] }, { "data": { "application/javascript": [ "\n", " if (typeof window.interactive_beam_jquery == 'undefined') {\n", " var jqueryScript = document.createElement('script');\n", " jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n", " jqueryScript.type = 'text/javascript';\n", " jqueryScript.onload = function() {\n", " var datatableScript = document.createElement('script');\n", " datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n", " datatableScript.type = 'text/javascript';\n", " datatableScript.onload = function() {\n", " window.interactive_beam_jquery = jQuery.noConflict(true);\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }\n", " document.head.appendChild(datatableScript);\n", " };\n", " document.head.appendChild(jqueryScript);\n", " } else {\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "ename": "ModuleNotFoundError", "evalue": "No module named 'pyarrow.vendored'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/pyarrow/pandas-shim.pxi\u001b[0m in \u001b[0;36mpyarrow.lib._PandasAPIShim._check_import\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/pyarrow/pandas-shim.pxi\u001b[0m in \u001b[0;36mpyarrow.lib._PandasAPIShim._import_pandas\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pyarrow.vendored'" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Exception ignored in: 'pyarrow.lib._PandasAPIShim._have_pandas_internal'\n", "Traceback (most recent call last):\n", " File \"pyarrow/pandas-shim.pxi\", line 110, in pyarrow.lib._PandasAPIShim._check_import\n", " File \"pyarrow/pandas-shim.pxi\", line 56, in pyarrow.lib._PandasAPIShim._import_pandas\n", "ModuleNotFoundError: No module named 'pyarrow.vendored'\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-14 06:02:42.144849: W tensorflow/core/common_runtime/graph_constructor.cc:1511] Importing a graph with a lower producer version 26 into an existing graph with producer version 808. Shape inference will have run different parts of the graph with different producer versions.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-14 06:02:43.603852: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:43.604464: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:43.604885: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:43.605359: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:43.605762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:43.606097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Assets added to graph.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Assets added to graph.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:No assets to write.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:No assets to write.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:SavedModel written to: /tmp/tmpm8qt0b58/tftransform_tmp/56663095425d4d018925728b505aa39e/saved_model.pb\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:SavedModel written to: /tmp/tmpm8qt0b58/tftransform_tmp/56663095425d4d018925728b505aa39e/saved_model.pb\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use ref() instead.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.7/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use ref() instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Tensorflow version (2.6.0) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['-f', '/tmp/tmpfer191yo.json', '--HistoryManager.hist_file=:memory:']\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.options.pipeline_options:Discarding invalid overrides: {'batch_size': 1024, 'data_dir': 'corpus/*.txt', 'output_dir': PosixPath('/tmp/tmpc6afrtky'), 'temporary_dir': PosixPath('/tmp/tmpm8qt0b58'), 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2', 'random_projection_matrix': array([[ 1.02620476e-01, 1.42609552e-01, 2.31834089e-02, ...,\n", " 3.34482362e-02, -9.16320501e-02, 6.08857313e-03],\n", " [-3.83626651e-02, -4.28119068e-03, 2.33800471e-01, ...,\n", " 2.09980119e-04, 1.35151201e-01, 2.27909783e-01],\n", " [ 2.45493073e-02, -7.33840377e-02, -2.39605360e-01, ...,\n", " 3.41644499e-02, -7.02873932e-02, -2.84315778e-01],\n", " ...,\n", " [-1.95808735e-01, -5.37650104e-02, 1.04212784e-01, ...,\n", " 9.01655723e-02, -1.15924190e-01, 8.84887858e-02],\n", " [ 1.11187595e-02, 1.64015586e-01, -3.21288737e-01, ...,\n", " 1.05356039e-01, -1.62969901e-01, 2.51348063e-01],\n", " [-1.42711265e-01, -9.61572580e-02, 1.09919747e-01, ...,\n", " 9.04922650e-02, -2.12339462e-01, 4.21877595e-02]])}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-14 06:02:47.727963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:47.728440: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:47.728776: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:47.729173: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:47.729496: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:47.729764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-14 06:02:54.170460: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:54.171105: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:54.171542: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:54.172069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:54.172515: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:02:54.172875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2min 31s, sys: 6.18 s, total: 2min 37s\n", "Wall time: 2min 23s\n", "Pipeline is done.\n" ] } ], "source": [ "!rm -r {output_dir}\n", "!rm -r {temporary_dir}\n", "\n", "print(\"Running pipeline...\")\n", "%time run_hub2emb(args)\n", "print(\"Pipeline is done.\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:05:04.362675Z", "iopub.status.busy": "2021-08-14T06:05:04.362089Z", "iopub.status.idle": "2021-08-14T06:05:04.505610Z", "shell.execute_reply": "2021-08-14T06:05:04.505138Z" }, "id": "JAwOo7gQWvVd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "emb-00000-of-00001.tfrecords\r\n" ] } ], "source": [ "!ls {output_dir}" ] }, { "cell_type": "markdown", "metadata": { "id": "HVnee4e6U90u" }, "source": [ "生成された埋め込みをいくつか読み取ります。" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:05:04.512652Z", "iopub.status.busy": "2021-08-14T06:05:04.512044Z", "iopub.status.idle": "2021-08-14T06:05:04.527064Z", "shell.execute_reply": "2021-08-14T06:05:04.527393Z" }, "id": "-K7pGXlXOj1N" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmp/ipykernel_16377/2258356591.py:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use eager execution and: \n", "`tf.data.TFRecordDataset(path)`\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmp/ipykernel_16377/2258356591.py:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use eager execution and: \n", "`tf.data.TFRecordDataset(path)`\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Embedding dimensions: 64\n", "[b'headline_text']: [ 0.14941755 0.05681387 0.01837291 0.15602173 -0.04690704 0.08429583\n", " -0.0878481 -0.24873284 0.10639744 0.12141651]\n", "Embedding dimensions: 64\n", "[b'aba decides against community broadcasting licence']: [-0.12688878 -0.09361811 0.23090497 -0.12106405 -0.16310519 0.09448653\n", " -0.03058668 0.01884805 -0.01119653 0.04711347]\n", "Embedding dimensions: 64\n", "[b'act fire witnesses must be aware of defamation']: [ 0.06108518 -0.24045897 0.00934747 -0.03547037 0.02692597 0.11331488\n", " -0.09989329 0.38848352 -0.02332795 0.07045569]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Embedding dimensions: 64\n", "[b'a g calls for infrastructure protection summit']: [ 0.06097336 -0.23125464 0.20088762 0.05249105 -0.06265712 0.07405787\n", " -0.04854955 -0.06437907 0.03590827 -0.01955524]\n", "Embedding dimensions: 64\n", "[b'air nz staff in aust strike for pay rise']: [ 0.01049893 -0.03635513 0.11150192 -0.07005765 0.14374056 0.11521669\n", " -0.20018959 -0.01112215 -0.09252568 -0.07241055]\n" ] } ], "source": [ "import itertools\n", "\n", "embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')\n", "sample = 5\n", "record_iterator = tf.io.tf_record_iterator(path=embed_file)\n", "for string_record in itertools.islice(record_iterator, sample):\n", " example = tf.train.Example()\n", " example.ParseFromString(string_record)\n", " text = example.features.feature['text'].bytes_list.value\n", " embedding = np.array(example.features.feature['embedding'].float_list.value)\n", " print(\"Embedding dimensions: {}\".format(embedding.shape[0]))\n", " print(\"{}: {}\".format(text, embedding[:10]))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "agGoaMSgY8wN" }, "source": [ "## 3. 埋め込みの ANN インデックスを構築する\n", "\n", "[ANNOY](https://github.com/spotify/annoy)(Approximate Nearest Neighbors Oh Yeah)は、特定のクエリ点に近い空間内のポイントを検索するための、Python バインディングを使った C++ ライブラリです。メモリにマッピングされた、大規模な読み取り専用ファイルベースのデータ構造も作成します。[Spotify](https://www.spotify.com) が構築したもので、おすすめの音楽に使用されています。" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:05:04.538569Z", "iopub.status.busy": "2021-08-14T06:05:04.537935Z", "iopub.status.idle": "2021-08-14T06:05:04.539668Z", "shell.execute_reply": "2021-08-14T06:05:04.540047Z" }, "id": "UcPDspU3WjgH" }, "outputs": [], "source": [ "def build_index(embedding_files_pattern, index_filename, vector_length, \n", " metric='angular', num_trees=100):\n", " '''Builds an ANNOY index'''\n", "\n", " annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)\n", " # Mapping between the item and its identifier in the index\n", " mapping = {}\n", "\n", " embed_files = tf.gfile.Glob(embedding_files_pattern)\n", " print('Found {} embedding file(s).'.format(len(embed_files)))\n", "\n", " item_counter = 0\n", " for f, embed_file in enumerate(embed_files):\n", " print('Loading embeddings in file {} of {}...'.format(\n", " f+1, len(embed_files)))\n", " record_iterator = tf.io.tf_record_iterator(\n", " path=embed_file)\n", "\n", " for string_record in record_iterator:\n", " example = tf.train.Example()\n", " example.ParseFromString(string_record)\n", " text = example.features.feature['text'].bytes_list.value[0].decode(\"utf-8\")\n", " mapping[item_counter] = text\n", " embedding = np.array(\n", " example.features.feature['embedding'].float_list.value)\n", " annoy_index.add_item(item_counter, embedding)\n", " item_counter += 1\n", " if item_counter % 100000 == 0:\n", " print('{} items loaded to the index'.format(item_counter))\n", "\n", " print('A total of {} items added to the index'.format(item_counter))\n", "\n", " print('Building the index with {} trees...'.format(num_trees))\n", " annoy_index.build(n_trees=num_trees)\n", " print('Index is successfully built.')\n", " \n", " print('Saving index to disk...')\n", " annoy_index.save(index_filename)\n", " print('Index is saved to disk.')\n", " print(\"Index file size: {} GB\".format(\n", " round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))\n", " annoy_index.unload()\n", "\n", " print('Saving mapping to disk...')\n", " with open(index_filename + '.mapping', 'wb') as handle:\n", " pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", " print('Mapping is saved to disk.')\n", " print(\"Mapping file size: {} MB\".format(\n", " round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:05:04.545835Z", "iopub.status.busy": "2021-08-14T06:05:04.545257Z", "iopub.status.idle": "2021-08-14T06:06:36.362469Z", "shell.execute_reply": "2021-08-14T06:06:36.362878Z" }, "id": "AgyOQhUq6FNE" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rm: cannot remove 'index': No such file or directory\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "rm: cannot remove 'index.mapping': No such file or directory\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Found 1 embedding file(s).\n", "Loading embeddings in file 1 of 1...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "100000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "200000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "300000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "400000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "500000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "600000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "700000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "800000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "900000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1000000 items loaded to the index\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1100000 items loaded to the index\n", "A total of 1103664 items added to the index\n", "Building the index with 100 trees...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Index is successfully built.\n", "Saving index to disk...\n", "Index is saved to disk.\n", "Index file size: 1.66 GB\n", "Saving mapping to disk...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Mapping is saved to disk.\n", "Mapping file size: 50.61 MB\n", "CPU times: user 5min 46s, sys: 3.83 s, total: 5min 50s\n", "Wall time: 1min 31s\n" ] } ], "source": [ "embedding_files = \"{}/emb-*.tfrecords\".format(output_dir)\n", "embedding_dimension = projected_dim\n", "index_filename = \"index\"\n", "\n", "!rm {index_filename}\n", "!rm {index_filename}.mapping\n", "\n", "%time build_index(embedding_files, index_filename, embedding_dimension)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:06:36.401788Z", "iopub.status.busy": "2021-08-14T06:06:36.367380Z", "iopub.status.idle": "2021-08-14T06:06:36.546005Z", "shell.execute_reply": "2021-08-14T06:06:36.546391Z" }, "id": "Ic31Tm5cgAd5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "corpus\tindex.mapping\t\t raw.tsv\r\n", "index\trandom_projection_matrix semantic_approximate_nearest_neighbors.ipynb\r\n" ] } ], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": { "id": "maGxDl8ufP-p" }, "source": [ "## 4. インデックスを使って、類似性の一致を実施する\n", "\n", "ANN インデックスを使用して、入力クエリに意味的に近いニュースの見出しを検索できるようになりました。" ] }, { "cell_type": "markdown", "metadata": { "id": "_dIs8W78fYPp" }, "source": [ "### インデックスとマッピングファイルを読み込む" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:06:36.614660Z", "iopub.status.busy": "2021-08-14T06:06:36.551843Z", "iopub.status.idle": "2021-08-14T06:06:36.889932Z", "shell.execute_reply": "2021-08-14T06:06:36.889445Z" }, "id": "jlTTrbQHayvb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Annoy index is loaded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/kbuilder/.local/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: The default argument for metric will be removed in future version of Annoy. Please pass metric='angular' explicitly.\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Mapping file is loaded.\n" ] } ], "source": [ "index = annoy.AnnoyIndex(embedding_dimension)\n", "index.load(index_filename, prefault=True)\n", "print('Annoy index is loaded.')\n", "with open(index_filename + '.mapping', 'rb') as handle:\n", " mapping = pickle.load(handle)\n", "print('Mapping file is loaded.')\n" ] }, { "cell_type": "markdown", "metadata": { "id": "y6liFMSUh08J" }, "source": [ "### 類似性の一致メソッド" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:06:36.895127Z", "iopub.status.busy": "2021-08-14T06:06:36.894520Z", "iopub.status.idle": "2021-08-14T06:06:36.896711Z", "shell.execute_reply": "2021-08-14T06:06:36.896300Z" }, "id": "mUxjTag8hc16" }, "outputs": [], "source": [ "def find_similar_items(embedding, num_matches=5):\n", " '''Finds similar items to a given embedding in the ANN index'''\n", " ids = index.get_nns_by_vector(\n", " embedding, num_matches, search_k=-1, include_distances=False)\n", " items = [mapping[i] for i in ids]\n", " return items" ] }, { "cell_type": "markdown", "metadata": { "id": "hjerNpmZja0A" }, "source": [ "### 特定のクエリから埋め込みを抽出する" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:06:36.902643Z", "iopub.status.busy": "2021-08-14T06:06:36.902011Z", "iopub.status.idle": "2021-08-14T06:06:42.490125Z", "shell.execute_reply": "2021-08-14T06:06:42.490455Z" }, "id": "a0IIXzfBjZ19" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading the TF-Hub module...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Saver not created because there are no variables in the graph to restore\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-08-14 06:06:38.697281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:06:38.697812: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:06:38.698135: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:06:38.698541: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:06:38.698887: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n", "2021-08-14 06:06:38.699208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory: -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TF-Hub module is loaded.\n", "TF-Hub module is loaded.\n", "Loading random projection matrix...\n", "random projection matrix is loaded.\n" ] } ], "source": [ "# Load the TF-Hub module\n", "print(\"Loading the TF-Hub module...\")\n", "g = tf.Graph()\n", "with g.as_default():\n", " embed_fn = load_module(module_url)\n", "print(\"TF-Hub module is loaded.\")\n", "\n", "random_projection_matrix = None\n", "if os.path.exists('random_projection_matrix'):\n", " print(\"Loading random projection matrix...\")\n", " with open('random_projection_matrix', 'rb') as handle:\n", " random_projection_matrix = pickle.load(handle)\n", " print('random projection matrix is loaded.')\n", "\n", "def extract_embeddings(query):\n", " '''Generates the embedding for the query'''\n", " query_embedding = embed_fn([query])[0]\n", " if random_projection_matrix is not None:\n", " query_embedding = query_embedding.dot(random_projection_matrix)\n", " return query_embedding" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2021-08-14T06:06:42.495008Z", "iopub.status.busy": "2021-08-14T06:06:42.494398Z", "iopub.status.idle": "2021-08-14T06:06:42.700661Z", "shell.execute_reply": "2021-08-14T06:06:42.701321Z" }, "id": "kCoCNROujEIO" }, "outputs": [ { "data": { "text/plain": [ "array([-0.0145482 , -0.06719958, 0.07120819, -0.14009826, -0.04229673,\n", " 0.08441613, -0.1013097 , -0.19552414, -0.05008004, -0.02184109])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "extract_embeddings(\"Hello Machine Learning!\")[:10]" ] }, { "cell_type": "markdown", "metadata": { "id": "nE_Q60nCk_ZB" }, "source": [ "### クエリを入力して、類似性の最も高いアイテムを検索する" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2021-08-14T06:06:42.705824Z", "iopub.status.busy": "2021-08-14T06:06:42.704894Z", "iopub.status.idle": "2021-08-14T06:06:42.724141Z", "shell.execute_reply": "2021-08-14T06:06:42.724859Z" }, "id": "wC0uLjvfk5nB" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating embedding for the query...\n", "CPU times: user 16.8 ms, sys: 16.2 ms, total: 33 ms\n", "Wall time: 4.16 ms\n", "\n", "Finding relevant items in the index...\n", "CPU times: user 5.4 ms, sys: 183 µs, total: 5.59 ms\n", "Wall time: 706 µs\n", "\n", "Results:\n", "=========\n", "confronting global challenges\n", "nff challenges social media interpretation of\n", "hopes for craft project to boost social cohesion\n", "act closer to integrated planning system\n", "ai analysis of the next stage\n", "kambalda boom helping neighbours\n", "origin plots solar revolution\n", "uni given access to qlds biggest supercomputer\n", "ozasia meeting performance targets\n", "riverland adopts suicide prevention scheme\n" ] } ], "source": [ "#@title { run: \"auto\" }\n", "query = \"confronting global challenges\" #@param {type:\"string\"}\n", "print(\"Generating embedding for the query...\")\n", "%time query_embedding = extract_embeddings(query)\n", "\n", "print(\"\")\n", "print(\"Finding relevant items in the index...\")\n", "%time items = find_similar_items(query_embedding, 10)\n", "\n", "print(\"\")\n", "print(\"Results:\")\n", "print(\"=========\")\n", "for item in items:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": { "id": "wwtMtyOeDKwt" }, "source": [ "## 今後の学習\n", "\n", "[tensorflow.org/hub](https://www.tensorflow.org/) では、TensorFlow についてさらに学習し、TF-Hub API ドキュメントを確認することができます。また、[tfhub.dev](https://www.tensorflow.org/hub/) では、その他のテキスト埋め込みモジュールや画像特徴量ベクトルモジュールなど、利用可能な TensorFlow Hub モジュールを検索することができます。\n", "\n", "さらに、Google の [Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/) もご覧ください。機械学習の実用的な導入をテンポよく学習できます。" ] } ], "metadata": { "colab": { "collapsed_sections": [ "ls0Zh7kYz3PM", "_don5gXy9D59", "SQ492LN7A-NZ" ], "name": "semantic_approximate_nearest_neighbors.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 0 }