{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "IZBRUaiBBEpa" }, "source": [ "##### Copyright 2019 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2024-07-19T12:44:24.268611Z", "iopub.status.busy": "2024-07-19T12:44:24.268037Z", "iopub.status.idle": "2024-07-19T12:44:24.272037Z", "shell.execute_reply": "2024-07-19T12:44:24.271487Z" }, "id": "YS3NA-i6nAFC" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "7SN5USFEIIK3" }, "source": [ "# Word embeddings" ] }, { "cell_type": "markdown", "metadata": { "id": "Aojnnc7sXrab" }, "source": [ "
\n",
" ![]() | \n",
" \n",
" ![]() | \n",
" \n",
" ![]() | \n",
" \n",
" ![]() | \n",
"
Model: \"sequential\"\n",
"
\n"
],
"text/plain": [
"\u001b[1mModel: \"sequential\"\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n", "┃ Layer (type) ┃ Output Shape ┃ Param # ┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n", "│ text_vectorization │ (None, 100) │ 0 │\n", "│ (TextVectorization) │ │ │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ embedding (Embedding) │ (None, 100, 16) │ 160,000 │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ global_average_pooling1d │ (None, 16) │ 0 │\n", "│ (GlobalAveragePooling1D) │ │ │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ dense (Dense) │ (None, 16) │ 272 │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ dense_1 (Dense) │ (None, 1) │ 17 │\n", "└─────────────────────────────────┴────────────────────────┴───────────────┘\n", "\n" ], "text/plain": [ "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n", "┃\u001b[1m \u001b[0m\u001b[1mLayer (type) \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mOutput Shape \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m Param #\u001b[0m\u001b[1m \u001b[0m┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n", "│ text_vectorization │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m100\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │\n", "│ (\u001b[38;5;33mTextVectorization\u001b[0m) │ │ │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ embedding (\u001b[38;5;33mEmbedding\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m100\u001b[0m, \u001b[38;5;34m16\u001b[0m) │ \u001b[38;5;34m160,000\u001b[0m │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ global_average_pooling1d │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m16\u001b[0m) │ \u001b[38;5;34m0\u001b[0m │\n", "│ (\u001b[38;5;33mGlobalAveragePooling1D\u001b[0m) │ │ │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ dense (\u001b[38;5;33mDense\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m16\u001b[0m) │ \u001b[38;5;34m272\u001b[0m │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ dense_1 (\u001b[38;5;33mDense\u001b[0m) │ (\u001b[38;5;45mNone\u001b[0m, \u001b[38;5;34m1\u001b[0m) │ \u001b[38;5;34m17\u001b[0m │\n", "└─────────────────────────────────┴────────────────────────┴───────────────┘\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Total params: 480,869 (1.83 MB)\n", "\n" ], "text/plain": [ "\u001b[1m Total params: \u001b[0m\u001b[38;5;34m480,869\u001b[0m (1.83 MB)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Trainable params: 160,289 (626.13 KB)\n", "\n" ], "text/plain": [ "\u001b[1m Trainable params: \u001b[0m\u001b[38;5;34m160,289\u001b[0m (626.13 KB)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Non-trainable params: 0 (0.00 B)\n", "\n" ], "text/plain": [ "\u001b[1m Non-trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Optimizer params: 320,580 (1.22 MB)\n", "\n" ], "text/plain": [ "\u001b[1m Optimizer params: \u001b[0m\u001b[38;5;34m320,580\u001b[0m (1.22 MB)\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model.summary()" ] }, { "cell_type": "markdown", "metadata": { "id": "hiQbOJZ2WBFY" }, "source": [ "Visualize the model metrics in TensorBoard." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_Uanp2YH8RzU" }, "outputs": [], "source": [ "#docs_infra: no_execute\n", "%load_ext tensorboard\n", "%tensorboard --logdir logs" ] }, { "cell_type": "markdown", "metadata": { "id": "QvURkGVpXDOy" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "KCoA6qwqP836" }, "source": [ "## Retrieve the trained word embeddings and save them to disk\n", "\n", "Next, retrieve the word embeddings learned during training. The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape `(vocab_size, embedding_dimension)`." ] }, { "cell_type": "markdown", "metadata": { "id": "Zp5rv01WG2YA" }, "source": [ "Obtain the weights from the model using `get_layer()` and `get_weights()`. The `get_vocabulary()` function provides the vocabulary to build a metadata file with one token per line. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T12:45:19.105586Z", "iopub.status.busy": "2024-07-19T12:45:19.105246Z", "iopub.status.idle": "2024-07-19T12:45:19.132262Z", "shell.execute_reply": "2024-07-19T12:45:19.131674Z" }, "id": "_Uamp1YH8RzU" }, "outputs": [], "source": [ "weights = model.get_layer('embedding').get_weights()[0]\n", "vocab = vectorize_layer.get_vocabulary()" ] }, { "cell_type": "markdown", "metadata": { "id": "J8MiCA77X8B8" }, "source": [ "Write the weights to disk. To use the [Embedding Projector](http://projector.tensorflow.org), you will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words)." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T12:45:19.135514Z", "iopub.status.busy": "2024-07-19T12:45:19.135277Z", "iopub.status.idle": "2024-07-19T12:45:19.241723Z", "shell.execute_reply": "2024-07-19T12:45:19.241121Z" }, "id": "VLIahl9s53XT" }, "outputs": [], "source": [ "out_v = io.open('vectors.tsv', 'w', encoding='utf-8')\n", "out_m = io.open('metadata.tsv', 'w', encoding='utf-8')\n", "\n", "for index, word in enumerate(vocab):\n", " if index == 0:\n", " continue # skip 0, it's padding.\n", " vec = weights[index]\n", " out_v.write('\\t'.join([str(x) for x in vec]) + \"\\n\")\n", " out_m.write(word + \"\\n\")\n", "out_v.close()\n", "out_m.close()" ] }, { "cell_type": "markdown", "metadata": { "id": "JQyMZWyxYjMr" }, "source": [ "If you are running this tutorial in [Colaboratory](https://colab.research.google.com), you can use the following snippet to download these files to your local machine (or use the file browser, *View -> Table of contents -> File browser*)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-07-19T12:45:19.244794Z", "iopub.status.busy": "2024-07-19T12:45:19.244570Z", "iopub.status.idle": "2024-07-19T12:45:19.247926Z", "shell.execute_reply": "2024-07-19T12:45:19.247324Z" }, "id": "lUsjQOKMIV2z" }, "outputs": [], "source": [ "try:\n", " from google.colab import files\n", " files.download('vectors.tsv')\n", " files.download('metadata.tsv')\n", "except Exception:\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "id": "PXLfFA54Yz-o" }, "source": [ "## Visualize the embeddings\n", "\n", "To visualize the embeddings, upload them to the embedding projector.\n", "\n", "Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).\n", "\n", "* Click on \"Load data\".\n", "\n", "* Upload the two files you created above: `vecs.tsv` and `meta.tsv`.\n", "\n", "The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for \"beautiful\". You may see neighbors like \"wonderful\". \n", "\n", "Note: Experimentally, you may be able to produce more interpretable embeddings by using a simpler model. Try deleting the `Dense(16)` layer, retraining the model, and visualizing the embeddings again.\n", "\n", "Note: Typically, a much larger dataset is needed to train more interpretable word embeddings. This tutorial uses a small IMDb dataset for the purpose of demonstration.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "wvKiEHjramNh" }, "source": [ "## Next Steps" ] }, { "cell_type": "markdown", "metadata": { "id": "BSgAZpwF5xF_" }, "source": [ "This tutorial has shown you how to train and visualize word embeddings from scratch on a small dataset.\n", "\n", "* To train word embeddings using Word2Vec algorithm, try the [Word2Vec](https://www.tensorflow.org/tutorials/text/word2vec) tutorial. \n", "\n", "* To learn more about advanced text processing, read the [Transformer model for language understanding](https://www.tensorflow.org/text/tutorials/transformer)." ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "word_embeddings.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 0 }