##### Copyright 2019 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [1]:
# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# 使用近似最近邻和文本嵌入向量构建语义搜索


<table class="tfo-notebook-buttons" align="left">
  <td>     <a target="_blank" href="https://tensorflow.google.cn/hub/tutorials/tf2_semantic_approximate_nearest_neighbors"><img src="https://tensorflow.google.cn/images/tf_logo_32px.png">在 TensorFlow.org 上查看</a> </td>
  <td>     <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/hub/tutorials/tf2_semantic_approximate_nearest_neighbors.ipynb"><img src="https://tensorflow.google.cn/images/colab_logo_32px.png">在 Google Colab 运行</a> </td>
  <td>     <a target="_blank" href="https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/hub/tutorials/tf2_semantic_approximate_nearest_neighbors.ipynb"><img src="https://tensorflow.google.cn/images/GitHub-Mark-32px.png">在 GitHub 上查看源代码</a> </td>
  <td>     <a href="https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/hub/tutorials/tf2_semantic_approximate_nearest_neighbors.ipynb"><img src="https://tensorflow.google.cn/images/download_logo_32px.png">下载笔记本</a> </td>
  <td>     <a href="https://tfhub.dev/google/nnlm-en-dim128/2"><img src="https://tensorflow.google.cn/images/hub_logo_32px.png">	查看 TF Hub 模型</a> </td>
</table>

本教程演示了如何在给定输入数据的情况下，从 [TensorFlow Hub](https://tfhub.dev) (TF-Hub) 模块生成嵌入向量，并使用提取的嵌入向量构建近似最近邻 (ANN) 索引。之后，可以将该索引用于实时相似度匹配和检索。

在处理包含大量数据的语料库时，通过扫描整个存储库实时查找与给定查询最相似的条目来执行精确匹配的效率不高。因此，我们使用一种近似相似度匹配算法。利用这种算法，我们在查找精确的最近邻匹配时会牺牲一点准确率，但是可以显著提高速度。

在本教程中，我们将展示一个示例，在新闻标题语料库上进行实时文本搜索，以查找与查询最相似的标题。与关键字搜索不同，此过程会捕获在文本嵌入向量中编码的语义相似度。

本教程的步骤如下：

1. 下载示例数据。
2. 使用 TF-Hub 模型为数据生成嵌入向量
3. 为嵌入向量构建 ANN 索引
4. 使用索引进行相似度匹配

我们使用 [Apache Beam](https://beam.apache.org/documentation/programming-guide/) 从 TF-Hub 模型生成嵌入向量。此外，我们还使用 Spotify 的 [ANNOY](https://github.com/spotify/annoy) 库来构建近似最近邻索引。

### 更多模型

对于具有相同架构，但使用不同的语言进行训练的模型，请参考[此](https://tfhub.dev/google/collections/nnlm/1)集合。在[这里](https://tfhub.dev/s?module-type=text-embedding)可以找到 [tfhub.dev](tfhub.dev) 上当前托管的所有文本嵌入向量。 

## 安装

安装所需的库。

In [2]:
!pip install apache_beam
!pip install 'scikit_learn~=0.23.0'  # For gaussian_random_matrix.
!pip install annoy

Collecting apache_beam


  Using cached apache_beam-2.43.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.3 MB)


Collecting httplib2<0.21.0,>=0.8


  Using cached httplib2-0.20.4-py3-none-any.whl (96 kB)
Collecting dill<0.3.2,>=0.3.1.1
  Using cached dill-0.3.1.1-py3-none-any.whl


Collecting pyarrow<10.0.0,>=0.15.1


  Using cached pyarrow-9.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.3 MB)


Collecting fastavro<2,>=0.23.6
  Using cached fastavro-1.7.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)


Collecting cloudpickle~=2.2.0
  Using cached cloudpickle-2.2.0-py3-none-any.whl (25 kB)
Collecting crcmod<2.0,>=1.7
  Using cached crcmod-1.7-cp39-cp39-linux_x86_64.whl


Collecting orjson<4.0
  Using cached orjson-3.8.3-cp39-cp39-manylinux_2_28_x86_64.whl (144 kB)


Collecting fasteners<1.0,>=0.3
  Using cached fasteners-0.18-py3-none-any.whl (18 kB)


Collecting hdfs<3.0.0,>=2.1.0
  Using cached hdfs-2.7.0-py3-none-any.whl (34 kB)


Collecting regex>=2020.6.8
  Using cached regex-2022.10.31-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)


Collecting proto-plus<2,>=1.7.1
  Using cached proto_plus-1.22.1-py3-none-any.whl (47 kB)


Collecting numpy<1.23.0,>=1.14.3


  Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)


Collecting objsize<0.6.0,>=0.5.2
  Using cached objsize-0.5.2-py3-none-any.whl (8.2 kB)


Collecting zstandard<1,>=0.18.0
  Using cached zstandard-0.19.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)




Collecting pymongo<4.0.0,>=3.8.0
  Using cached pymongo-3.13.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (515 kB)


Collecting docopt
  Using cached docopt-0.6.2-py2.py3-none-any.whl




Installing collected packages: docopt, crcmod, zstandard, regex, pymongo, proto-plus, orjson, objsize, numpy, httplib2, fasteners, fastavro, dill, cloudpickle, pyarrow, hdfs, apache_beam


  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.0rc2


    Uninstalling numpy-1.24.0rc2:


      Successfully uninstalled numpy-1.24.0rc2


  Attempting uninstall: dill
    Found existing installation: dill 0.3.6
    Uninstalling dill-0.3.6:
      Successfully uninstalled dill-0.3.6


Successfully installed apache_beam-2.43.0 cloudpickle-2.2.0 crcmod-1.7 dill-0.3.1.1 docopt-0.6.2 fastavro-1.7.0 fasteners-0.18 hdfs-2.7.0 httplib2-0.20.4 numpy-1.22.4 objsize-0.5.2 orjson-3.8.3 proto-plus-1.22.1 pyarrow-9.0.0 pymongo-3.13.0 regex-2022.10.31 zstandard-0.19.0


Collecting scikit_learn~=0.23.0
  Using cached scikit_learn-0.23.2-cp39-cp39-linux_x86_64.whl


Installing collected packages: scikit_learn
  Attempting uninstall: scikit_learn
    Found existing installation: scikit-learn 1.2.0


    Uninstalling scikit-learn-1.2.0:
      Successfully uninstalled scikit-learn-1.2.0


Successfully installed scikit_learn-0.23.2


Collecting annoy
  Using cached annoy-1.17.1-cp39-cp39-linux_x86_64.whl


Installing collected packages: annoy


Successfully installed annoy-1.17.1


导入所需的库

In [3]:
import os
import sys
import pickle
from collections import namedtuple
from datetime import datetime
import numpy as np
import apache_beam as beam
from apache_beam.transforms import util
import tensorflow as tf
import tensorflow_hub as hub
import annoy
from sklearn.random_projection import gaussian_random_matrix

2022-12-14 21:52:16.005809: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 21:52:16.005909: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


In [4]:
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('Apache Beam version: {}'.format(beam.__version__))

TF version: 2.11.0
TF-Hub version: 0.12.0
Apache Beam version: 2.43.0


## 1. 	下载示例数据

[A Million News Headlines](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL#) 数据集包含著名的澳大利亚广播公司 (ABC) 在 15 年内发布的新闻标题。此新闻数据集汇总了从 2003 年初至 2017 年底在全球范围内发生的重大事件的历史记录，其中对澳大利亚的关注更为细致。

**格式**：以制表符分隔的两列数据：1) 发布日期和 2) 标题文本。我们只对标题文本感兴趣。


In [5]:
!wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
!wc -l raw.tsv
!head raw.tsv

--2022-12-14 21:52:17--  https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
Resolving dataverse.harvard.edu (dataverse.harvard.edu)... 

52.54.15.150, 44.213.44.146, 52.23.87.139
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)|52.54.15.150|:443... 

connected.


HTTP request sent, awaiting response... 

200 OK
Length: 57600231 (55M) [text/tab-separated-values]
Saving to: ‘raw.tsv’

raw.tsv               0%[                    ]       0  --.-KB/s               

raw.tsv               5%[>                   ]   3.14M  15.2MB/s               






2022-12-14 21:52:18 (68.6 MB/s) - ‘raw.tsv’ saved [57600231/57600231]



1103664 raw.tsv


publish_date	headline_text
20030219	"aba decides against community broadcasting licence"
20030219	"act fire witnesses must be aware of defamation"
20030219	"a g calls for infrastructure protection summit"
20030219	"air nz staff in aust strike for pay rise"
20030219	"air nz strike to affect australian travellers"
20030219	"ambitious olsson wins triple jump"
20030219	"antic delighted with record breaking barca"
20030219	"aussie qualifier stosur wastes four memphis match"
20030219	"aust addresses un security council over iraq"


为简单起见，我们仅保留标题文本并移除发布日期。

In [6]:
!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
  with open('raw.tsv', 'r') as in_file:
    for line in in_file:
      headline = line.split('\t')[1].strip().strip('"')
      out_file.write(headline+"\n")

rm: cannot remove 'corpus': No such file or directory


In [7]:
!tail corpus/text.txt

severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide


## 2. 为数据生成嵌入向量。

在本教程中，我们使用[神经网络语言模型 (NNLM)](https://tfhub.dev/google/nnlm-en-dim128/2) 为标题数据生成嵌入向量。之后，可以轻松地使用句子嵌入向量计算句子级别的含义相似度。我们使用 Apache Beam 来运行嵌入向量生成过程。

### 嵌入向量提取方法

In [8]:
embed_fn = None

def generate_embeddings(text, model_url, random_projection_matrix=None):
  # Beam will run this function in different processes that need to
  # import hub and load embed_fn (if not previously loaded)
  global embed_fn
  if embed_fn is None:
    embed_fn = hub.load(model_url)
  embedding = embed_fn(text).numpy()
  if random_projection_matrix is not None:
    embedding = embedding.dot(random_projection_matrix)
  return text, embedding


### 转换为 tf.Example 方法

In [9]:
def to_tf_example(entries):
  examples = []

  text_list, embedding_list = entries
  for i in range(len(text_list)):
    text = text_list[i]
    embedding = embedding_list[i]

    features = {
        'text': tf.train.Feature(
            bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
        'embedding': tf.train.Feature(
            float_list=tf.train.FloatList(value=embedding.tolist()))
    }
  
    example = tf.train.Example(
        features=tf.train.Features(
            feature=features)).SerializeToString(deterministic=True)
  
    examples.append(example)
  
  return examples

### Beam 流水线

In [10]:
def run_hub2emb(args):
  '''Runs the embedding generation pipeline'''

  options = beam.options.pipeline_options.PipelineOptions(**args)
  args = namedtuple("options", args.keys())(*args.values())

  with beam.Pipeline(args.runner, options=options) as pipeline:
    (
        pipeline
        | 'Read sentences from files' >> beam.io.ReadFromText(
            file_pattern=args.data_dir)
        | 'Batch elements' >> util.BatchElements(
            min_batch_size=args.batch_size, max_batch_size=args.batch_size)
        | 'Generate embeddings' >> beam.Map(
            generate_embeddings, args.model_url, args.random_projection_matrix)
        | 'Encode to tf example' >> beam.FlatMap(to_tf_example)
        | 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
            file_path_prefix='{}/emb'.format(args.output_dir),
            file_name_suffix='.tfrecords')
    )

### 生成随机投影权重矩阵

[随机投影](https://en.wikipedia.org/wiki/Random_projection)是一种简单而强大的技术，用于降低位于欧几里得空间中的一组点的维数。有关理论背景，请参阅[约翰逊-林登斯特劳斯引理](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)。

利用随机投影降低嵌入向量的维数，这样，构建和查询 ANN 索引需要的时间将减少。

在本教程中，我们使用 [Scikit-learn](https://scikit-learn.org/stable/modules/random_projection.html#gaussian-random-projection) 库中的[高斯随机投影](https://en.wikipedia.org/wiki/Random_projection#Gaussian_random_projection)。

In [11]:
def generate_random_projection_weights(original_dim, projected_dim):
  random_projection_matrix = None
  random_projection_matrix = gaussian_random_matrix(
      n_components=projected_dim, n_features=original_dim).T
  print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
  print('Storing random projection matrix to disk...')
  with open('random_projection_matrix', 'wb') as handle:
    pickle.dump(random_projection_matrix, 
                handle, protocol=pickle.HIGHEST_PROTOCOL)
        
  return random_projection_matrix

### 设置参数

如果要使用原始嵌入向量空间构建索引而不进行随机投影，请将 `projected_dim` 参数设置为 `None`。请注意，这会减慢高维嵌入的索引编制步骤。

In [12]:
model_url = 'https://tfhub.dev/google/nnlm-en-dim128/2' #@param {type:"string"}
projected_dim = 64  #@param {type:"number"}

### 运行流水线

In [13]:
import tempfile

output_dir = tempfile.mkdtemp()
original_dim = hub.load(model_url)(['']).shape[1]
random_projection_matrix = None

if projected_dim:
  random_projection_matrix = generate_random_projection_weights(
      original_dim, projected_dim)

args = {
    'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
    'runner': 'DirectRunner',
    'batch_size': 1024,
    'data_dir': 'corpus/*.txt',
    'output_dir': output_dir,
    'model_url': model_url,
    'random_projection_matrix': random_projection_matrix,
}

print("Pipeline args are set.")
args

A Gaussian random weight matrix was creates with shape of (128, 64)
Storing random projection matrix to disk...
Pipeline args are set.




{'job_name': 'hub2emb-221214-215229',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus/*.txt',
 'output_dir': '/tmpfs/tmp/tmp8mwf4nm1',
 'model_url': 'https://tfhub.dev/google/nnlm-en-dim128/2',
 'random_projection_matrix': array([[ 0.04652396,  0.08763902,  0.23433171, ..., -0.27297541,
          0.03003381,  0.03201392],
        [ 0.03699847,  0.13023345,  0.21451349, ..., -0.00677001,
         -0.14845663,  0.04364772],
        [-0.21373535, -0.05965295,  0.08734345, ...,  0.1977847 ,
         -0.19067334, -0.07191868],
        ...,
        [ 0.16077312,  0.11088097, -0.11435093, ..., -0.27603424,
          0.01509658, -0.01286358],
        [-0.06534427,  0.31156683,  0.05260638, ...,  0.15128775,
          0.04908765,  0.06651652],
        [ 0.10027883,  0.18285818,  0.00615748, ..., -0.08596848,
          0.18062248, -0.03001226]])}

In [14]:
print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")



Running pipeline...




         0.03003381,  0.03201392],
       [ 0.03699847,  0.13023345,  0.21451349, ..., -0.00677001,
        -0.14845663,  0.04364772],
       [-0.21373535, -0.05965295,  0.08734345, ...,  0.1977847 ,
        -0.19067334, -0.07191868],
       ...,
       [ 0.16077312,  0.11088097, -0.11435093, ..., -0.27603424,
         0.01509658, -0.01286358],
       [-0.06534427,  0.31156683,  0.05260638, ...,  0.15128775,
         0.04908765,  0.06651652],
       [ 0.10027883,  0.18285818,  0.00615748, ..., -0.08596848,
         0.18062248, -0.03001226]])}




CPU times: user 32min 6s, sys: 35min 44s, total: 1h 7min 50s
Wall time: 2min 37s
Pipeline is done.


In [15]:
!ls {output_dir}

emb-00000-of-00001.tfrecords


读取生成的部分嵌入向量…

In [16]:
embed_file = os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
sample = 5

# Create a description of the features.
feature_description = {
    'text': tf.io.FixedLenFeature([], tf.string),
    'embedding': tf.io.FixedLenFeature([projected_dim], tf.float32)
}

def _parse_example(example):
  # Parse the input `tf.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example, feature_description)

dataset = tf.data.TFRecordDataset(embed_file)
for record in dataset.take(sample).map(_parse_example):
  print("{}: {}".format(record['text'].numpy().decode('utf-8'), record['embedding'].numpy()[:10]))


headline_text: [ 0.2082475  -0.0758548  -0.31624907 -0.03358706  0.17254627  0.01766025
  0.12841697  0.04604644  0.09919017 -0.33217654]
aba decides against community broadcasting licence: [ 0.09103882  0.15510604  0.04679758 -0.034989   -0.09605023  0.2866751
  0.14425828  0.10986     0.23642978 -0.07701185]
act fire witnesses must be aware of defamation: [-0.0675336  -0.05577297  0.3920365  -0.13575198  0.15832202 -0.08768208
  0.20531057  0.09696919  0.20545112 -0.00981951]


a g calls for infrastructure protection summit: [ 0.02449628  0.11818574  0.06907529 -0.1773398  -0.01997611  0.12673114
  0.01616809 -0.01666613  0.23377973  0.00869564]
air nz staff in aust strike for pay rise: [-0.06299734  0.21203837  0.16526641  0.09273504 -0.04289411  0.24168277
 -0.01821963  0.24795456 -0.11762056 -0.13406506]


## 3. 为嵌入向量构建 ANN 索引

[ANNOY](https://github.com/spotify/annoy) (Approximate Nearest Neighbors Oh Yeah) 是一个包含 Python 绑定的 C++ 库，用于搜索空间中与给定查询点接近的点。此外，它还会创建基于文件的大型只读数据结构，这些数据结构会映射到内存中。它由 [Spotify](https://www.spotify.com) 构建并用于音乐推荐。如果您感兴趣，可以尝试使用 ANNOY 的其他替代库，例如 [NGT](https://github.com/yahoojapan/NGT)、[FAISS](https://github.com/facebookresearch/faiss) 等。 

In [17]:
def build_index(embedding_files_pattern, index_filename, vector_length, 
    metric='angular', num_trees=100):
  '''Builds an ANNOY index'''

  annoy_index = annoy.AnnoyIndex(vector_length, metric=metric)
  # Mapping between the item and its identifier in the index
  mapping = {}

  embed_files = tf.io.gfile.glob(embedding_files_pattern)
  num_files = len(embed_files)
  print('Found {} embedding file(s).'.format(num_files))

  item_counter = 0
  for i, embed_file in enumerate(embed_files):
    print('Loading embeddings in file {} of {}...'.format(i+1, num_files))
    dataset = tf.data.TFRecordDataset(embed_file)
    for record in dataset.map(_parse_example):
      text = record['text'].numpy().decode("utf-8")
      embedding = record['embedding'].numpy()
      mapping[item_counter] = text
      annoy_index.add_item(item_counter, embedding)
      item_counter += 1
      if item_counter % 100000 == 0:
        print('{} items loaded to the index'.format(item_counter))

  print('A total of {} items added to the index'.format(item_counter))

  print('Building the index with {} trees...'.format(num_trees))
  annoy_index.build(n_trees=num_trees)
  print('Index is successfully built.')
  
  print('Saving index to disk...')
  annoy_index.save(index_filename)
  print('Index is saved to disk.')
  print("Index file size: {} GB".format(
    round(os.path.getsize(index_filename) / float(1024 ** 3), 2)))
  annoy_index.unload()

  print('Saving mapping to disk...')
  with open(index_filename + '.mapping', 'wb') as handle:
    pickle.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
  print('Mapping is saved to disk.')
  print("Mapping file size: {} MB".format(
    round(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))

In [18]:
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
embedding_dimension = projected_dim
index_filename = "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)

rm: cannot remove 'index': No such file or directory


rm: cannot remove 'index.mapping': No such file or directory


Found 1 embedding file(s).
Loading embeddings in file 1 of 1...


100000 items loaded to the index


200000 items loaded to the index


300000 items loaded to the index


400000 items loaded to the index


500000 items loaded to the index


600000 items loaded to the index


700000 items loaded to the index


800000 items loaded to the index


900000 items loaded to the index


1000000 items loaded to the index


1100000 items loaded to the index


A total of 1103664 items added to the index
Building the index with 100 trees...


Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.6 GB
Saving mapping to disk...


Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 9min 31s, sys: 55.7 s, total: 10min 27s
Wall time: 3min 44s


In [19]:
!ls

corpus	       random_projection_matrix
index	       raw.tsv
index.mapping  tf2_semantic_approximate_nearest_neighbors.ipynb


## 4. 使用索引进行相似度匹配

现在，我们可以使用 ANN 索引查找与输入查询语义接近的新闻标题。

### 加载索引和映射文件

In [20]:
index = annoy.AnnoyIndex(embedding_dimension)
index.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
  mapping = pickle.load(handle)
print('Mapping file is loaded.')


Annoy index is loaded.


  index = annoy.AnnoyIndex(embedding_dimension)


Mapping file is loaded.


### 相似度匹配方法

In [21]:
def find_similar_items(embedding, num_matches=5):
  '''Finds similar items to a given embedding in the ANN index'''
  ids = index.get_nns_by_vector(
  embedding, num_matches, search_k=-1, include_distances=False)
  items = [mapping[i] for i in ids]
  return items

### 从给定查询中提取嵌入向量

In [22]:
# Load the TF-Hub model
print("Loading the TF-Hub model...")
%time embed_fn = hub.load(model_url)
print("TF-Hub model is loaded.")

random_projection_matrix = None
if os.path.exists('random_projection_matrix'):
  print("Loading random projection matrix...")
  with open('random_projection_matrix', 'rb') as handle:
    random_projection_matrix = pickle.load(handle)
  print('random projection matrix is loaded.')

def extract_embeddings(query):
  '''Generates the embedding for the query'''
  query_embedding =  embed_fn([query])[0].numpy()
  if random_projection_matrix is not None:
    query_embedding = query_embedding.dot(random_projection_matrix)
  return query_embedding


Loading the TF-Hub model...


CPU times: user 546 ms, sys: 329 ms, total: 875 ms
Wall time: 904 ms
TF-Hub model is loaded.
Loading random projection matrix...
random projection matrix is loaded.


In [23]:
extract_embeddings("Hello Machine Learning!")[:10]

array([ 0.035803  , -0.08583559, -0.35316336,  0.14571103, -0.02016695,
        0.07517727,  0.03061715, -0.21651856, -0.26817065, -0.05891859])

### 输入查询以查找最相似的条目

In [24]:
#@title { run: "auto" }
query = "confronting global challenges" #@param {type:"string"}

print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

print("")
print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

print("")
print("Results:")
print("=========")
for item in items:
  print(item)

Generating embedding for the query...
CPU times: user 4.96 ms, sys: 0 ns, total: 4.96 ms
Wall time: 2.27 ms

Finding relevant items in the index...
CPU times: user 475 µs, sys: 0 ns, total: 475 µs
Wall time: 488 µs

Results:
confronting global challenges
european leaders meet over global economic crisis
pacific island nations facing business challenges
conference examines challenges facing major cities
climate changing our landscape research
global warming carbon emissions emerging economies
the meat industry is facing a growing global push
global approach urged in tropics development
economic pressures on emerging markets


## 了解更多信息

您可以在 [tensorflow.org](https://tensorflow.google.cn/) 上详细了解 TensorFlow，并在 [tensorflow.org/hub](https://tensorflow.google.cn/hub/) 上查看 TF-Hub API 文档。此外，还可以在 [tfhub.dev](https://tfhub.dev/) 上找到可用的 TensorFlow Hub 模型，包括更多的文本嵌入向量模型和图像特征矢量模型。

另外，请查看[机器学习速成课程](https://developers.google.com/machine-learning/crash-course/)，这是 Google 提供的针对机器学习的快节奏实用介绍。