{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "X80i_girFR2o" }, "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2022-12-14T12:28:39.659452Z", "iopub.status.busy": "2022-12-14T12:28:39.658897Z", "iopub.status.idle": "2022-12-14T12:28:39.662685Z", "shell.execute_reply": "2022-12-14T12:28:39.662094Z" }, "id": "bB8gHCR3FVC0" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "kCeYA79m1DEX" }, "source": [ "# Using side features: feature preprocessing\n", "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "TFJUp0Vdu-TG" }, "source": [ "One of the great advantages of using a deep learning framework to build recommender models is the freedom to build rich, flexible feature representations.\n", "\n", "The first step in doing so is preparing the features, as raw features will usually not be immediately usable in a model.\n", "\n", "For example:\n", "\n", "- User and item ids may be strings (titles, usernames) or large, noncontiguous integers (database IDs).\n", "- Item descriptions could be raw text.\n", "- Interaction timestamps could be raw Unix timestamps.\n", "\n", "These need to be appropriately transformed in order to be useful in building models:\n", "\n", "- User and item ids have to be translated into embedding vectors: high-dimensional numerical representations that are adjusted during training to help the model predict its objective better.\n", "- Raw text needs to be tokenized (split into smaller parts such as individual words) and translated into embeddings.\n", "- Numerical features need to be normalized so that their values lie in a small interval around 0.\n", "\n", "Fortunately, by using TensorFlow we can make such preprocessing part of our model rather than a separate preprocessing step. This is not only convenient, but also ensures that our pre-processing is exactly the same during training and during serving. This makes it safe and easy to deploy models that include even very sophisticated pre-processing.\n", "\n", "In this tutorial, we are going to focus on recommenders and the preprocessing we need to do on the [MovieLens dataset](https://grouplens.org/datasets/movielens/). If you're interested in a larger tutorial without a recommender system focus, have a look at the full [Keras preprocessing guide](https://www.tensorflow.org/guide/keras/preprocessing_layers). " ] }, { "cell_type": "markdown", "metadata": { "id": "dh8vCHpi52gD" }, "source": [ "## The MovieLens dataset\n", "\n", "Let's first have a look at what features we can use from the MovieLens dataset:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:28:39.666030Z", "iopub.status.busy": "2022-12-14T12:28:39.665501Z", "iopub.status.idle": "2022-12-14T12:28:41.251580Z", "shell.execute_reply": "2022-12-14T12:28:41.250528Z" }, "id": "N3oCG2SE-dgf" }, "outputs": [], "source": [ "!pip install -q --upgrade tensorflow-datasets" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:28:41.255797Z", "iopub.status.busy": "2022-12-14T12:28:41.255541Z", "iopub.status.idle": "2022-12-14T12:28:48.052434Z", "shell.execute_reply": "2022-12-14T12:28:48.051715Z" }, "id": "BxQ_hy7xPH3N" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 12:28:42.209347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory\n", "2022-12-14 12:28:42.209434: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory\n", "2022-12-14 12:28:42.209443: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'bucketized_user_age': 45.0,\n", " 'movie_genres': array([7]),\n", " 'movie_id': b'357',\n", " 'movie_title': b\"One Flew Over the Cuckoo's Nest (1975)\",\n", " 'raw_user_age': 46.0,\n", " 'timestamp': 879024327,\n", " 'user_gender': True,\n", " 'user_id': b'138',\n", " 'user_occupation_label': 4,\n", " 'user_occupation_text': b'doctor',\n", " 'user_rating': 4.0,\n", " 'user_zip_code': b'53211'}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2022-12-14 12:28:48.039834: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.\n" ] } ], "source": [ "import pprint\n", "\n", "import tensorflow_datasets as tfds\n", "\n", "ratings = tfds.load(\"movielens/100k-ratings\", split=\"train\")\n", "\n", "for x in ratings.take(1).as_numpy_iterator():\n", " pprint.pprint(x)" ] }, { "cell_type": "markdown", "metadata": { "id": "_6ypp_nVub8J" }, "source": [ "There are a couple of key features here:\n", "\n", "- Movie title is useful as a movie identifier.\n", "- User id is useful as a user identifier.\n", "- Timestamps will allow us to model the effect of time.\n", "\n", "The first two are categorical features; timestamps are a continuous feature." ] }, { "cell_type": "markdown", "metadata": { "id": "cp2rd--gvW9w" }, "source": [ "## Turning categorical features into embeddings\n", "\n", "A [categorical feature](https://en.wikipedia.org/wiki/Categorical_variable) is a feature that does not express a continuous quantity, but rather takes on one of a set of fixed values.\n", "\n", "Most deep learning models express these feature by turning them into high-dimensional vectors. During model training, the value of that vector is adjusted to help the model predict its objective better.\n", "\n", "For example, suppose that our goal is to predict which user is going to watch which movie. To do that, we represent each user and each movie by an embedding vector. Initially, these embeddings will take on random values - but during training, we will adjust them so that embeddings of users and the movies they watch end up closer together.\n", "\n", "Taking raw categorical features and turning them into embeddings is normally a two-step process:\n", "\n", "1. Firstly, we need to translate the raw values into a range of contiguous integers, normally by building a mapping (called a \"vocabulary\") that maps raw values (\"Star Wars\") to integers (say, 15).\n", "2. Secondly, we need to take these integers and turn them into embeddings." ] }, { "cell_type": "markdown", "metadata": { "id": "aa-7so1D_9B2" }, "source": [ "### Defining the vocabulary\n", "\n", "The first step is to define a vocabulary. We can do this easily using Keras preprocessing layers." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:28:48.056564Z", "iopub.status.busy": "2022-12-14T12:28:48.056078Z", "iopub.status.idle": "2022-12-14T12:28:48.068062Z", "shell.execute_reply": "2022-12-14T12:28:48.067464Z" }, "id": "IkA1HOXKyaEo" }, "outputs": [], "source": [ "import numpy as np\n", "import tensorflow as tf\n", "\n", "movie_title_lookup = tf.keras.layers.StringLookup()" ] }, { "cell_type": "markdown", "metadata": { "id": "7We60Iduy2SP" }, "source": [ "The layer itself does not have a vocabulary yet, but we can build it using our data." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:28:48.071422Z", "iopub.status.busy": "2022-12-14T12:28:48.070982Z", "iopub.status.idle": "2022-12-14T12:30:25.819901Z", "shell.execute_reply": "2022-12-14T12:30:25.819211Z" }, "id": "GKluOy3ly7Pg" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n", "Instructions for updating:\n", "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n", "Instructions for updating:\n", "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Vocabulary: ['[UNK]', 'Star Wars (1977)', 'Contact (1997)']\n" ] } ], "source": [ "movie_title_lookup.adapt(ratings.map(lambda x: x[\"movie_title\"]))\n", "\n", "print(f\"Vocabulary: {movie_title_lookup.get_vocabulary()[:3]}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "1cH2Je_KBQZy" }, "source": [ "Once we have this we can use the layer to translate raw tokens to embedding ids:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:30:25.823946Z", "iopub.status.busy": "2022-12-14T12:30:25.823350Z", "iopub.status.idle": "2022-12-14T12:30:25.832186Z", "shell.execute_reply": "2022-12-14T12:30:25.831634Z" }, "id": "zXYpfmWDBVOq" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_title_lookup([\"Star Wars (1977)\", \"One Flew Over the Cuckoo's Nest (1975)\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "PYXiq04dzTaq" }, "source": [ "Note that the layer's vocabulary includes one (or more!) unknown (or \"out of vocabulary\", OOV) tokens. This is really handy: it means that the layer can handle categorical values that are not in the vocabulary. In practical terms, this means that the model can continue to learn about and make recommendations even using features that have not been seen during vocabulary construction." ] }, { "cell_type": "markdown", "metadata": { "id": "qseZxzmBBJvv" }, "source": [ "### Using feature hashing\n", "\n", "In fact, the `StringLookup` layer allows us to configure multiple OOV indices. If we do that, any raw value that is not in the vocabulary will be deterministically hashed to one of the OOV indices. The more such indices we have, the less likley it is that two different raw feature values will hash to the same OOV index. Consequently, if we have enough such indices the model should be able to train about as well as a model with an explicit vocabulary without the disdvantage of having to maintain the token list." ] }, { "cell_type": "markdown", "metadata": { "id": "t0gOaMjJAC17" }, "source": [ "We can take this to its logical extreme and rely entirely on feature hashing, with no vocabulary at all. This is implemented in the `tf.keras.layers.Hashing` layer." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:30:25.835793Z", "iopub.status.busy": "2022-12-14T12:30:25.835346Z", "iopub.status.idle": "2022-12-14T12:30:25.839136Z", "shell.execute_reply": "2022-12-14T12:30:25.838569Z" }, "id": "1Os5gwGxzSaG" }, "outputs": [], "source": [ "# We set up a large number of bins to reduce the chance of hash collisions.\n", "num_hashing_bins = 200_000\n", "\n", "movie_title_hashing = tf.keras.layers.Hashing(\n", " num_bins=num_hashing_bins\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "rvcVNCzNB8GE" }, "source": [ "We can do the lookup as before without the need to build vocabularies:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:30:25.842321Z", "iopub.status.busy": "2022-12-14T12:30:25.841916Z", "iopub.status.idle": "2022-12-14T12:30:25.847845Z", "shell.execute_reply": "2022-12-14T12:30:25.847271Z" }, "id": "OkEWdeflCAY6" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_title_hashing([\"Star Wars (1977)\", \"One Flew Over the Cuckoo's Nest (1975)\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "-QFinPDA0LxM" }, "source": [ "### Defining the embeddings\n", "\n", "Now that we have integer ids, we can use the [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer to turn those into embeddings.\n", "\n", "An embedding layer has two dimensions: the first dimension tells us how many distinct categories we can embed; the second tells us how large the vector representing each of them can be.\n", "\n", "When creating the embedding layer for movie titles, we are going to set the first value to the size of our title vocabulary (or the number of hashing bins). The second is up to us: the larger it is, the higher the capacity of the model, but the slower it is to fit and serve." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:30:25.850537Z", "iopub.status.busy": "2022-12-14T12:30:25.850330Z", "iopub.status.idle": "2022-12-14T12:30:25.856258Z", "shell.execute_reply": "2022-12-14T12:30:25.855697Z" }, "id": "RUftFomv0nGO" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] } ], "source": [ "movie_title_embedding = tf.keras.layers.Embedding(\n", " # Let's use the explicit vocabulary lookup.\n", " input_dim=movie_title_lookup.vocab_size(),\n", " output_dim=32\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "8JNyTTQq1RIw" }, "source": [ "We can put the two together into a single layer which takes raw text in and yields embeddings." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:30:25.859395Z", "iopub.status.busy": "2022-12-14T12:30:25.859187Z", "iopub.status.idle": "2022-12-14T12:30:25.865126Z", "shell.execute_reply": "2022-12-14T12:30:25.864524Z" }, "id": "RSbQd_mn1YYe" }, "outputs": [], "source": [ "movie_title_model = tf.keras.Sequential([movie_title_lookup, movie_title_embedding])" ] }, { "cell_type": "markdown", "metadata": { "id": "4QoA9YHw1gQc" }, "source": [ "Just like that, we can directly get the embeddings for our movie titles:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:30:25.867882Z", "iopub.status.busy": "2022-12-14T12:30:25.867666Z", "iopub.status.idle": "2022-12-14T12:30:25.889261Z", "shell.execute_reply": "2022-12-14T12:30:25.888714Z" }, "id": "T-s6uPqM1fZz" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=['Star Wars (1977)']. Consider rewriting this model with the Functional API.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor. Received: inputs=['Star Wars (1977)']. Consider rewriting this model with the Functional API.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_title_model([\"Star Wars (1977)\"])" ] }, { "cell_type": "markdown", "metadata": { "id": "2chJv4jTSg04" }, "source": [ "We can do the same with user embeddings:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:30:25.892479Z", "iopub.status.busy": "2022-12-14T12:30:25.891927Z", "iopub.status.idle": "2022-12-14T12:32:01.167602Z", "shell.execute_reply": "2022-12-14T12:32:01.167001Z" }, "id": "3ot3bfX8SgWT" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] } ], "source": [ "user_id_lookup = tf.keras.layers.StringLookup()\n", "user_id_lookup.adapt(ratings.map(lambda x: x[\"user_id\"]))\n", "\n", "user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32)\n", "\n", "user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])" ] }, { "cell_type": "markdown", "metadata": { "id": "abZNsN3oDf1F" }, "source": [ "## Normalizing continuous features\n", "\n", "Continuous features also need normalization. For example, the `timestamp` feature is far too large to be used directly in a deep model:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:32:01.171495Z", "iopub.status.busy": "2022-12-14T12:32:01.170836Z", "iopub.status.idle": "2022-12-14T12:32:01.275583Z", "shell.execute_reply": "2022-12-14T12:32:01.274895Z" }, "id": "GGcKKOyLDsEY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Timestamp: 879024327.\n", "Timestamp: 875654590.\n", "Timestamp: 882075110.\n" ] } ], "source": [ "for x in ratings.take(3).as_numpy_iterator():\n", " print(f\"Timestamp: {x['timestamp']}.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "4aL_GMuaEBy0" }, "source": [ "We need to process it before we can use it. While there are many ways in which we can do this, discretization and standardization are two common ones." ] }, { "cell_type": "markdown", "metadata": { "id": "iCe-ch7eENNR" }, "source": [ "### Standardization\n", "\n", "[Standardization](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)) rescales features to normalize their range by subtracting the feature's mean and dividing by its standard deviation. It is a common preprocessing transformation.\n", "\n", "This can be easily accomplished using the [`tf.keras.layers.Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/Normalization) layer:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:32:01.279139Z", "iopub.status.busy": "2022-12-14T12:32:01.278862Z", "iopub.status.idle": "2022-12-14T12:32:03.001557Z", "shell.execute_reply": "2022-12-14T12:32:03.000780Z" }, "id": "WxPsx6iSLGrp" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Normalized timestamp: [-0.8429372].\n", "Normalized timestamp: [-1.4735202].\n", "Normalized timestamp: [-0.27203265].\n" ] } ], "source": [ "timestamp_normalization = tf.keras.layers.Normalization(\n", " axis=None\n", ")\n", "timestamp_normalization.adapt(ratings.map(lambda x: x[\"timestamp\"]).batch(1024))\n", "\n", "for x in ratings.take(3).as_numpy_iterator():\n", " print(f\"Normalized timestamp: {timestamp_normalization(x['timestamp'])}.\")" ] }, { "cell_type": "markdown", "metadata": { "id": "zW1B974ZPn71" }, "source": [ "### Discretization\n", "\n", "Another common transformation is to turn a continuous feature into a number of categorical features. This makes good sense if we have reasons to suspect that a feature's effect is non-continuous.\n", "\n", "To do this, we first need to establish the boundaries of the buckets we will use for discretization. The easiest way is to identify the minimum and maximum value of the feature, and divide the resulting interval equally:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:32:03.005174Z", "iopub.status.busy": "2022-12-14T12:32:03.004645Z", "iopub.status.idle": "2022-12-14T12:32:36.064747Z", "shell.execute_reply": "2022-12-14T12:32:36.063988Z" }, "id": "YlJK0rYyQGEf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Buckets: [8.74724710e+08 8.74743291e+08 8.74761871e+08]\n" ] } ], "source": [ "max_timestamp = ratings.map(lambda x: x[\"timestamp\"]).reduce(\n", " tf.cast(0, tf.int64), tf.maximum).numpy().max()\n", "min_timestamp = ratings.map(lambda x: x[\"timestamp\"]).reduce(\n", " np.int64(1e9), tf.minimum).numpy().min()\n", "\n", "timestamp_buckets = np.linspace(\n", " min_timestamp, max_timestamp, num=1000)\n", "\n", "print(f\"Buckets: {timestamp_buckets[:3]}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "iPS3fh5JQhkO" }, "source": [ "Given the bucket boundaries we can transform timestamps into embeddings:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:32:36.068608Z", "iopub.status.busy": "2022-12-14T12:32:36.067971Z", "iopub.status.idle": "2022-12-14T12:32:36.250735Z", "shell.execute_reply": "2022-12-14T12:32:36.250012Z" }, "id": "VCizNzPkQmwK" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Timestamp embedding: [[-0.00684877 -0.00895538 0.00957515 -0.01513529 0.0387269 0.00683295\n", " 0.02544342 -0.02451316 -0.04256866 0.01276667 -0.04428785 0.02050248\n", " -0.01246307 0.0345451 0.02073885 -0.01192726 -0.01197611 0.0368802\n", " 0.01981271 0.04876235 -0.00646602 -0.01923322 -0.00054507 0.03711143\n", " 0.04613707 0.0188375 0.04404927 -0.00687717 0.01918397 0.03958556\n", " -0.01230479 -0.02550827]].\n" ] } ], "source": [ "timestamp_embedding_model = tf.keras.Sequential([\n", " tf.keras.layers.Discretization(timestamp_buckets.tolist()),\n", " tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32)\n", "])\n", "\n", "for timestamp in ratings.take(1).map(lambda x: x[\"timestamp\"]).batch(1).as_numpy_iterator():\n", " print(f\"Timestamp embedding: {timestamp_embedding_model(timestamp)}.\") " ] }, { "cell_type": "markdown", "metadata": { "id": "BWOg0NlGEeWh" }, "source": [ "## Processing text features\n", "\n", "We may also want to add text features to our model. Usually, things like product descriptions are free form text, and we can hope that our model can learn to use the information they contain to make better recommendations, especially in a cold-start or long tail scenario.\n", "\n", "While the MovieLens dataset does not give us rich textual features, we can still use movie titles. This may help us capture the fact that movies with very similar titles are likely to belong to the same series.\n", "\n", "The first transformation we need to apply to text is tokenization (splitting into constituent words or word-pieces), followed by vocabulary learning, followed by an embedding.\n", "\n", "The Keras [`tf.keras.layers.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) layer can do the first two steps for us:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:32:36.254287Z", "iopub.status.busy": "2022-12-14T12:32:36.253666Z", "iopub.status.idle": "2022-12-14T12:35:15.929596Z", "shell.execute_reply": "2022-12-14T12:35:15.928655Z" }, "id": "TdRa-_BXF7IJ" }, "outputs": [], "source": [ "title_text = tf.keras.layers.TextVectorization()\n", "title_text.adapt(ratings.map(lambda x: x[\"movie_title\"]))" ] }, { "cell_type": "markdown", "metadata": { "id": "rJkYkgMQGxHL" }, "source": [ "Let's try it out:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:35:15.934252Z", "iopub.status.busy": "2022-12-14T12:35:15.933696Z", "iopub.status.idle": "2022-12-14T12:35:16.108329Z", "shell.execute_reply": "2022-12-14T12:35:16.107627Z" }, "id": "YAIj7TGOHAXs" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor([[ 32 266 162 2 267 265 53]], shape=(1, 7), dtype=int64)\n" ] } ], "source": [ "for row in ratings.batch(1).map(lambda x: x[\"movie_title\"]).take(1):\n", " print(title_text(row))" ] }, { "cell_type": "markdown", "metadata": { "id": "CsQi_QGSH0it" }, "source": [ "Each title is translated into a sequence of tokens, one for each piece we've tokenized.\n", "\n", "We can check the learned vocabulary to verify that the layer is using the correct tokenization:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:35:16.112368Z", "iopub.status.busy": "2022-12-14T12:35:16.111715Z", "iopub.status.idle": "2022-12-14T12:35:16.120517Z", "shell.execute_reply": "2022-12-14T12:35:16.119913Z" }, "id": "0gkJtiNyHzKq" }, "outputs": [ { "data": { "text/plain": [ "['first', '1998', '1977', '1971', 'monty']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title_text.get_vocabulary()[40:45]" ] }, { "cell_type": "markdown", "metadata": { "id": "V_v-HFg0ICQS" }, "source": [ "This looks correct: the layer is tokenizing titles into individual words.\n", "\n", "To finish the processing, we now need to embed the text. Because each title contains multiple words, we will get multiple embeddings for each title. For use in a donwstream model these are usually compressed into a single embedding. Models like RNNs or Transformers are useful here, but averaging all the words' embeddings together is a good starting point." ] }, { "cell_type": "markdown", "metadata": { "id": "RomTZJ6N-z3Y" }, "source": [ "## Putting it all together\n", "\n", "With these components in place, we can build a model that does all the preprocessing together." ] }, { "cell_type": "markdown", "metadata": { "id": "HMukupD2ggQh" }, "source": [ "### User model\n", "\n", "The full user model may look like the following:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:35:16.124094Z", "iopub.status.busy": "2022-12-14T12:35:16.123649Z", "iopub.status.idle": "2022-12-14T12:35:16.129122Z", "shell.execute_reply": "2022-12-14T12:35:16.128555Z" }, "id": "TL_eYNyD-80t" }, "outputs": [], "source": [ "class UserModel(tf.keras.Model):\n", " \n", " def __init__(self):\n", " super().__init__()\n", "\n", " self.user_embedding = tf.keras.Sequential([\n", " user_id_lookup,\n", " tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32),\n", " ])\n", " self.timestamp_embedding = tf.keras.Sequential([\n", " tf.keras.layers.Discretization(timestamp_buckets.tolist()),\n", " tf.keras.layers.Embedding(len(timestamp_buckets) + 2, 32)\n", " ])\n", " self.normalized_timestamp = tf.keras.layers.Normalization(\n", " axis=None\n", " )\n", "\n", " def call(self, inputs):\n", "\n", " # Take the input dictionary, pass it through each input layer,\n", " # and concatenate the result.\n", " return tf.concat([\n", " self.user_embedding(inputs[\"user_id\"]),\n", " self.timestamp_embedding(inputs[\"timestamp\"]),\n", " tf.reshape(self.normalized_timestamp(inputs[\"timestamp\"]), (-1, 1))\n", " ], axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "6brsz6mnDZV2" }, "source": [ "Let's try it out:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:35:16.132299Z", "iopub.status.busy": "2022-12-14T12:35:16.131868Z", "iopub.status.idle": "2022-12-14T12:35:17.938711Z", "shell.execute_reply": "2022-12-14T12:35:17.937975Z" }, "id": "LJlCFMgTDdC4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Computed representations: [-0.02608593 0.0144566 0.01150807]\n" ] } ], "source": [ "user_model = UserModel()\n", "\n", "user_model.normalized_timestamp.adapt(\n", " ratings.map(lambda x: x[\"timestamp\"]).batch(128))\n", "\n", "for row in ratings.batch(1).take(1):\n", " print(f\"Computed representations: {user_model(row)[0, :3]}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "F_-_kmurEN4E" }, "source": [ "### Movie model\n", "We can do the same for the movie model:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:35:17.942451Z", "iopub.status.busy": "2022-12-14T12:35:17.941822Z", "iopub.status.idle": "2022-12-14T12:35:17.947189Z", "shell.execute_reply": "2022-12-14T12:35:17.946561Z" }, "id": "n5k7fGmcETPl" }, "outputs": [], "source": [ "class MovieModel(tf.keras.Model):\n", " \n", " def __init__(self):\n", " super().__init__()\n", "\n", " max_tokens = 10_000\n", "\n", " self.title_embedding = tf.keras.Sequential([\n", " movie_title_lookup,\n", " tf.keras.layers.Embedding(movie_title_lookup.vocab_size(), 32)\n", " ])\n", " self.title_text_embedding = tf.keras.Sequential([\n", " tf.keras.layers.TextVectorization(max_tokens=max_tokens),\n", " tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),\n", " # We average the embedding of individual words to get one embedding vector\n", " # per title.\n", " tf.keras.layers.GlobalAveragePooling1D(),\n", " ])\n", "\n", " def call(self, inputs):\n", " return tf.concat([\n", " self.title_embedding(inputs[\"movie_title\"]),\n", " self.title_text_embedding(inputs[\"movie_title\"]),\n", " ], axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "QzXelC5kJbsQ" }, "source": [ "Let's try it out:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2022-12-14T12:35:17.950310Z", "iopub.status.busy": "2022-12-14T12:35:17.949846Z", "iopub.status.idle": "2022-12-14T12:37:56.827713Z", "shell.execute_reply": "2022-12-14T12:37:56.827001Z" }, "id": "Tq3BWpzhJapY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:vocab_size is deprecated, please use vocabulary_size.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Computed representations: [ 0.00771372 -0.03204167 -0.02703585]\n" ] } ], "source": [ "movie_model = MovieModel()\n", "\n", "movie_model.title_text_embedding.layers[0].adapt(\n", " ratings.map(lambda x: x[\"movie_title\"]))\n", "\n", "for row in ratings.batch(1).take(1):\n", " print(f\"Computed representations: {movie_model(row)[0, :3]}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "-2dK71mPKoTw" }, "source": [ "## Next steps\n", "\n", "With the two models above we've taken the first steps to representing rich features in a recommender model: to take this further and explore how these can be used to build an effective deep recomender model, take a look at our Deep Recommenders tutorial." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "featurization.ipynb", "private_outputs": true, "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 0 }