{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "l-23gBrt4x2B" }, "source": [ "##### Copyright 2021 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2024-01-11T18:05:45.764481Z", "iopub.status.busy": "2024-01-11T18:05:45.763856Z", "iopub.status.idle": "2024-01-11T18:05:45.767505Z", "shell.execute_reply": "2024-01-11T18:05:45.766949Z" }, "id": "HMUDt0CiUJk9" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "77z2OchJTk0l" }, "source": [ "# `tf.feature_column` を Keras 前処理レイヤーに移行する\n", "\n", "\n", " \n", " \n", " \n", " \n", "
TensorFlow.org で表示 Google Colab で実行 GitHub でソースを表示ノートブックをダウンロード
" ] }, { "cell_type": "markdown", "metadata": { "id": "-5jGPDA2PDPI" }, "source": [ "通常、モデルのトレーニングには、特に構造化データを扱う場合に、特徴量の前処理が必要となることがあります。TensorFlow 1 で `tf.estimator.Estimator` をトレーニングする場合、通常、`tf.feature_column` API を使用して特徴量の前処理を実行します。TensorFlow 2 では、Keras 前処理レイヤーで直接実行できます。\n", "\n", "この移行ガイドでは、特徴量カラムと前処理レイヤーの両方を使用した一般的な特徴量変換を紹介し、両方の API を使用して完全なモデルをトレーニングします。\n", "\n", "まず、必要なものをインポートします。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:45.771011Z", "iopub.status.busy": "2024-01-11T18:05:45.770443Z", "iopub.status.idle": "2024-01-11T18:05:48.121107Z", "shell.execute_reply": "2024-01-11T18:05:48.120251Z" }, "id": "iE0vSfMXumKI" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-01-11 18:05:46.202664: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2024-01-11 18:05:46.202711: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2024-01-11 18:05:46.204283: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" ] } ], "source": [ "import tensorflow as tf\n", "import tensorflow.compat.v1 as tf1\n", "import math" ] }, { "cell_type": "markdown", "metadata": { "id": "NVPYTQAWtDwH" }, "source": [ "次に、デモのために特徴量カラムを呼び出すためのユーティリティ関数を追加します。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:48.125470Z", "iopub.status.busy": "2024-01-11T18:05:48.125073Z", "iopub.status.idle": "2024-01-11T18:05:48.129007Z", "shell.execute_reply": "2024-01-11T18:05:48.128401Z" }, "id": "LAaifuuytJjM" }, "outputs": [], "source": [ "def call_feature_columns(feature_columns, inputs):\n", " # This is a convenient way to call a `feature_column` outside of an estimator\n", " # to display its output.\n", " feature_layer = tf1.keras.layers.DenseFeatures(feature_columns)\n", " return feature_layer(inputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZJnw07hYDGYt" }, "source": [ "## 入力処理\n", "\n", "Estimator で特徴量カラムを使用するには、モデル入力は常にテンソルのディクショナリであることが期待されます。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:48.132393Z", "iopub.status.busy": "2024-01-11T18:05:48.131949Z", "iopub.status.idle": "2024-01-11T18:05:50.374346Z", "shell.execute_reply": "2024-01-11T18:05:50.373315Z" }, "id": "y0WUpQxsKEzf" }, "outputs": [], "source": [ "input_dict = {\n", " 'foo': tf.constant([1]),\n", " 'bar': tf.constant([0]),\n", " 'baz': tf.constant([-1])\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "xYsC6H_BJ8l3" }, "source": [ "各特徴量カラムは、ソースデータにインデックスを付けるためのキーを使用して作成する必要があります。すべての特徴量カラムの出力は連結され、Estimator モデルによって使用されます。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:50.377998Z", "iopub.status.busy": "2024-01-11T18:05:50.377732Z", "iopub.status.idle": "2024-01-11T18:05:50.424070Z", "shell.execute_reply": "2024-01-11T18:05:50.423431Z" }, "id": "3fvIe3V8Ffjt" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/3124623333.py:2: numeric_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = [\n", " tf1.feature_column.numeric_column('foo'),\n", " tf1.feature_column.numeric_column('bar'),\n", " tf1.feature_column.numeric_column('baz'),\n", "]\n", "call_feature_columns(columns, input_dict)" ] }, { "cell_type": "markdown", "metadata": { "id": "hvPfCK2XGTyl" }, "source": [ "Keras では、モデル入力はより柔軟です。`tf.keras.Model` は、単一のテンソル入力、テンソル特徴量のリスト、またはテンソル特徴量のディクショナリを処理できます。モデルの作成時に `tf.keras.Input` のディクショナリを渡すことで、ディクショナリの入力を処理できます。入力は自動的に連結されないため、より柔軟な方法で使用できます。これらは `tf.keras.layers.Concatenate` で連結できます。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:50.427278Z", "iopub.status.busy": "2024-01-11T18:05:50.427031Z", "iopub.status.idle": "2024-01-11T18:05:50.461759Z", "shell.execute_reply": "2024-01-11T18:05:50.461057Z" }, "id": "5sYWENkgLWJ2" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs = {\n", " 'foo': tf.keras.Input(shape=()),\n", " 'bar': tf.keras.Input(shape=()),\n", " 'baz': tf.keras.Input(shape=()),\n", "}\n", "# Inputs are typically transformed by preprocessing layers before concatenation.\n", "outputs = tf.keras.layers.Concatenate()(inputs.values())\n", "model = tf.keras.Model(inputs=inputs, outputs=outputs)\n", "model(input_dict)" ] }, { "cell_type": "markdown", "metadata": { "id": "GXkmiuwXTS-B" }, "source": [ "## One-hot エンコーディングの整数 ID\n", "\n", "一般的に、既知の範囲の整数入力を One-hot エンコードすることにより特徴量を変換できます。特徴量カラムを使用した例を次に示します。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:50.465346Z", "iopub.status.busy": "2024-01-11T18:05:50.464673Z", "iopub.status.idle": "2024-01-11T18:05:50.507583Z", "shell.execute_reply": "2024-01-11T18:05:50.506926Z" }, "id": "XasXzOgatgRF" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/1369923821.py:1: categorical_column_with_identity (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/1369923821.py:3: indicator_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical_col = tf1.feature_column.categorical_column_with_identity(\n", " 'type', num_buckets=3)\n", "indicator_col = tf1.feature_column.indicator_column(categorical_col)\n", "call_feature_columns(indicator_col, {'type': [0, 1, 2]})" ] }, { "cell_type": "markdown", "metadata": { "id": "iSCkJEQ6U-ru" }, "source": [ "Keras 前処理レイヤーを使用すると、これらのカラムを `output_mode` を `'one_hot'` に設定した単一の `tf.keras.layers.CategoryEncoding` レイヤーに置き換えることができます。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:50.511027Z", "iopub.status.busy": "2024-01-11T18:05:50.510435Z", "iopub.status.idle": "2024-01-11T18:05:50.868218Z", "shell.execute_reply": "2024-01-11T18:05:50.867564Z" }, "id": "799lbMNNuAVz" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot_layer = tf.keras.layers.CategoryEncoding(\n", " num_tokens=3, output_mode='one_hot')\n", "one_hot_layer([0, 1, 2])" ] }, { "cell_type": "markdown", "metadata": { "id": "kNzRtESU7tga" }, "source": [ "注意: 大規模な One-hot エンコーディングの場合、出力のスパース表現を使用する方がはるかに効率的です。`sparse=True` を `CategoryEncoding` レイヤーに渡すと、レイヤーの出力は `tf.sparse.SparseTensor` になり、効率的に `tf.keras.layers.Dense` レイヤーへの入力として処理されます。" ] }, { "cell_type": "markdown", "metadata": { "id": "Zf7kjhTiAErK" }, "source": [ "## 数値的特徴量の正規化\n", "\n", "特徴量カラムを持つ連続浮動小数点特徴量を処理する場合、`tf.feature_column.numeric_column` を使用する必要があります。入力が既に正規化されている場合、これを Keras に変換するのは簡単です。上記のように、`tf.keras.Input` をモデルに直接使用するだけです。\n", "\n", "`numeric_column` を使用して入力を正規化することもできます。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:50.871831Z", "iopub.status.busy": "2024-01-11T18:05:50.871561Z", "iopub.status.idle": "2024-01-11T18:05:50.883812Z", "shell.execute_reply": "2024-01-11T18:05:50.883188Z" }, "id": "HbTMGB9XctGx" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def normalize(x):\n", " mean, variance = (2.0, 1.0)\n", " return (x - mean) / math.sqrt(variance)\n", "numeric_col = tf1.feature_column.numeric_column('col', normalizer_fn=normalize)\n", "call_feature_columns(numeric_col, {'col': tf.constant([[0.], [1.], [2.]])})" ] }, { "cell_type": "markdown", "metadata": { "id": "M9cyhPR_drOz" }, "source": [ "対照的に、Keras では、この正規化は `tf.keras.layers.Normalization` で実行できます。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:50.887116Z", "iopub.status.busy": "2024-01-11T18:05:50.886886Z", "iopub.status.idle": "2024-01-11T18:05:51.307106Z", "shell.execute_reply": "2024-01-11T18:05:51.306357Z" }, "id": "8bcgG-yOdqUH" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "normalization_layer = tf.keras.layers.Normalization(mean=2.0, variance=1.0)\n", "normalization_layer(tf.constant([[0.], [1.], [2.]]))" ] }, { "cell_type": "markdown", "metadata": { "id": "d1InD_4QLKU-" }, "source": [ "## 数値特徴量のバケット化と One-hot エンコーディング" ] }, { "cell_type": "markdown", "metadata": { "id": "k5e0b8iOLRzd" }, "source": [ "連続する浮動小数点の入力を変換するもう 1 つの一般的な方法は、固定範囲の整数にバケット化することです。\n", "\n", "特徴量カラムでは、`tf.feature_column.bucketized_column` を使用します。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:51.310854Z", "iopub.status.busy": "2024-01-11T18:05:51.310592Z", "iopub.status.idle": "2024-01-11T18:05:51.323208Z", "shell.execute_reply": "2024-01-11T18:05:51.322578Z" }, "id": "_rbx6qQ-LQx7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/3043215186.py:2: bucketized_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numeric_col = tf1.feature_column.numeric_column('col')\n", "bucketized_col = tf1.feature_column.bucketized_column(numeric_col, [1, 4, 5])\n", "call_feature_columns(bucketized_col, {'col': tf.constant([1., 2., 3., 4., 5.])})\n" ] }, { "cell_type": "markdown", "metadata": { "id": "PCYu-XtwXahx" }, "source": [ "Keras では、これを `tf.keras.layers.Discretization` に置き換えます。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:51.327033Z", "iopub.status.busy": "2024-01-11T18:05:51.326397Z", "iopub.status.idle": "2024-01-11T18:05:51.991159Z", "shell.execute_reply": "2024-01-11T18:05:51.990478Z" }, "id": "QK1WOG2uVVsL" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "discretization_layer = tf.keras.layers.Discretization(bin_boundaries=[1, 4, 5])\n", "one_hot_layer = tf.keras.layers.CategoryEncoding(\n", " num_tokens=4, output_mode='one_hot')\n", "one_hot_layer(discretization_layer([1., 2., 3., 4., 5.]))" ] }, { "cell_type": "markdown", "metadata": { "id": "5bm9tJZAgpt4" }, "source": [ "## 語彙を使用した文字列データの One-hot エンコーディング\n", "\n", "文字列の特徴量を処理するには、多くの場合、文字列をインデックスに変換するために語彙の検索が必要です。特徴量カラムを使用して文字列を検索し、インデックスを One-hot エンコードする例を次に示します。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:51.994654Z", "iopub.status.busy": "2024-01-11T18:05:51.994375Z", "iopub.status.idle": "2024-01-11T18:05:52.018917Z", "shell.execute_reply": "2024-01-11T18:05:52.018294Z" }, "id": "3fG_igjhukCO" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/2845961037.py:1: categorical_column_with_vocabulary_list (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n", " 'sizes',\n", " vocabulary_list=['small', 'medium', 'large'],\n", " num_oov_buckets=0)\n", "indicator_col = tf1.feature_column.indicator_column(vocab_col)\n", "call_feature_columns(indicator_col, {'sizes': ['small', 'medium', 'large']})" ] }, { "cell_type": "markdown", "metadata": { "id": "8rBgllRtY738" }, "source": [ "Keras 前処理レイヤーを使用して、`output_mode` を `'one_hot'` に設定して `tf.keras.layers.StringLookup` レイヤーを使用します。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.022060Z", "iopub.status.busy": "2024-01-11T18:05:52.021834Z", "iopub.status.idle": "2024-01-11T18:05:52.040228Z", "shell.execute_reply": "2024-01-11T18:05:52.039571Z" }, "id": "arnPlSrWvDMe" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_lookup_layer = tf.keras.layers.StringLookup(\n", " vocabulary=['small', 'medium', 'large'],\n", " num_oov_indices=0,\n", " output_mode='one_hot')\n", "string_lookup_layer(['small', 'medium', 'large'])" ] }, { "cell_type": "markdown", "metadata": { "id": "f76MVVYO8LB5" }, "source": [ "注意: 大規模な One-hot エンコーディングの場合、出力のスパース表現を使用する方がはるかに効率的です。`sparse=True` を `StringLookup` レイヤーに渡すと、レイヤーの出力は `tf.sparse.SparseTensor` になり、効率的に `tf.keras.layers.Dense` レイヤーへの入力として処理されます。" ] }, { "cell_type": "markdown", "metadata": { "id": "c1CmfSXQZHE5" }, "source": [ "## 語彙を使用した文字列データの埋め込み\n", "\n", "より大きな語彙の場合、パフォーマンスを向上させるために埋め込みが必要になることがよくあります。特徴量カラムを使用して文字列特徴量を埋め込む例を次に示します。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.043576Z", "iopub.status.busy": "2024-01-11T18:05:52.043113Z", "iopub.status.idle": "2024-01-11T18:05:52.236862Z", "shell.execute_reply": "2024-01-11T18:05:52.236198Z" }, "id": "C3RK4HFazxlU" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/999372599.py:5: embedding_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n", " 'col',\n", " vocabulary_list=['small', 'medium', 'large'],\n", " num_oov_buckets=0)\n", "embedding_col = tf1.feature_column.embedding_column(vocab_col, 4)\n", "call_feature_columns(embedding_col, {'col': ['small', 'medium', 'large']})" ] }, { "cell_type": "markdown", "metadata": { "id": "3aTRVJ6qZZH0" }, "source": [ "これは、Keras 前処理レイヤーを使用して、`tf.keras.layers.StringLookup` レイヤーと `tf.keras.layers.Embedding` レイヤーを組み合わせることで実現できます。`StringLookup` のデフォルトの出力は、埋め込みに直接入力できる整数インデックスになります。\n", "\n", "注意: `Embedding` レイヤーには、トレーニング可能なパラメータが含まれています。`StringLookup` レイヤーはモデルの内部または外部のデータに適用できますが、正しくトレーニングするには、`Embedding` が常にトレーニング可能な Keras モデルの一部である必要があります。" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.240365Z", "iopub.status.busy": "2024-01-11T18:05:52.240103Z", "iopub.status.idle": "2024-01-11T18:05:52.260883Z", "shell.execute_reply": "2024-01-11T18:05:52.260208Z" }, "id": "8resGZPo0Fho" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string_lookup_layer = tf.keras.layers.StringLookup(\n", " vocabulary=['small', 'medium', 'large'], num_oov_indices=0)\n", "embedding = tf.keras.layers.Embedding(3, 4)\n", "embedding(string_lookup_layer(['small', 'medium', 'large']))" ] }, { "cell_type": "markdown", "metadata": { "id": "UwqvADV6HRdC" }, "source": [ "## 重み付きカテゴリカルデータの和\n", "\n", "場合によっては、重みが関連付けられているカテゴリが出現するたびにカテゴリカルデータを処理する必要があります。特徴量カラムでは、これは `tf.feature_column.weighted_categorical_column` で処理されます。`indicator_column` と組み合わせると、カテゴリごとの重みの和を計算できます。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.264369Z", "iopub.status.busy": "2024-01-11T18:05:52.263753Z", "iopub.status.idle": "2024-01-11T18:05:52.328897Z", "shell.execute_reply": "2024-01-11T18:05:52.328285Z" }, "id": "02HqjPLMRxWn" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/3529191023.py:6: weighted_categorical_column (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4033: sparse_merge (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "No similar op available at this time.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ids = tf.constant([[5, 11, 5, 17, 17]])\n", "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", "\n", "categorical_col = tf1.feature_column.categorical_column_with_identity(\n", " 'ids', num_buckets=20)\n", "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n", " categorical_col, 'weights')\n", "indicator_col = tf1.feature_column.indicator_column(weighted_categorical_col)\n", "call_feature_columns(indicator_col, {'ids': ids, 'weights': weights})" ] }, { "cell_type": "markdown", "metadata": { "id": "98jaq7Q3S9aG" }, "source": [ "Keras では、これは `output_mode='count'` で `count_weights` 入力を `tf.keras.layers.CategoryEncoding` に渡すことで実行できます。" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.332223Z", "iopub.status.busy": "2024-01-11T18:05:52.331834Z", "iopub.status.idle": "2024-01-11T18:05:52.349397Z", "shell.execute_reply": "2024-01-11T18:05:52.348742Z" }, "id": "JsoYUUgRS7hu" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ids = tf.constant([[5, 11, 5, 17, 17]])\n", "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", "\n", "# Using sparse output is more efficient when `num_tokens` is large.\n", "count_layer = tf.keras.layers.CategoryEncoding(\n", " num_tokens=20, output_mode='count', sparse=True)\n", "tf.sparse.to_dense(count_layer(ids, count_weights=weights))" ] }, { "cell_type": "markdown", "metadata": { "id": "gBJxb6y2GasI" }, "source": [ "## 重み付きカテゴリカルデータの埋め込み\n", "\n", "または、重み付きカテゴリカル入力を埋め込みたい場合もあります。特徴量カラムでは、`embedding_column` に `combiner` 引数が含まれています。サンプルにカテゴリの複数のエントリが含まれている場合、それらは引数の設定(デフォルトでは `'mean'`)に従って結合されます。" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.352589Z", "iopub.status.busy": "2024-01-11T18:05:52.352354Z", "iopub.status.idle": "2024-01-11T18:05:52.424190Z", "shell.execute_reply": "2024-01-11T18:05:52.423415Z" }, "id": "AjOt1wgmT5mM" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ids = tf.constant([[5, 11, 5, 17, 17]])\n", "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", "\n", "categorical_col = tf1.feature_column.categorical_column_with_identity(\n", " 'ids', num_buckets=20)\n", "weighted_categorical_col = tf1.feature_column.weighted_categorical_column(\n", " categorical_col, 'weights')\n", "embedding_col = tf1.feature_column.embedding_column(\n", " weighted_categorical_col, 4, combiner='mean')\n", "call_feature_columns(embedding_col, {'ids': ids, 'weights': weights})" ] }, { "cell_type": "markdown", "metadata": { "id": "fd6eluARXndC" }, "source": [ "Keras では、`tf.keras.layers.Embedding` に対する `combiner` オプションはありませんが、`tf.keras.layers.Dense` で同じ効果を実現できます。上記の `embedding_column` は、カテゴリの重みに従って埋め込みベクトルを単純に線形結合しています。一見明らかではありませんが、カテゴリカル入力をサイズ `(num_tokens)` の疎な重みベクトルとして表し、形状 `(embedding_size, num_tokens)` の `Dense` カーネルを掛けるのとまったく同じです。" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.427560Z", "iopub.status.busy": "2024-01-11T18:05:52.427326Z", "iopub.status.idle": "2024-01-11T18:05:52.447926Z", "shell.execute_reply": "2024-01-11T18:05:52.447351Z" }, "id": "Y-vZvPyiYilE" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ids = tf.constant([[5, 11, 5, 17, 17]])\n", "weights = tf.constant([[0.5, 1.5, 0.7, 1.8, 0.2]])\n", "\n", "# For `combiner='mean'`, normalize your weights to sum to 1. Removing this line\n", "# would be equivalent to an `embedding_column` with `combiner='sum'`.\n", "weights = weights / tf.reduce_sum(weights, axis=-1, keepdims=True)\n", "\n", "count_layer = tf.keras.layers.CategoryEncoding(\n", " num_tokens=20, output_mode='count', sparse=True)\n", "embedding_layer = tf.keras.layers.Dense(4, use_bias=False)\n", "embedding_layer(count_layer(ids, count_weights=weights))" ] }, { "cell_type": "markdown", "metadata": { "id": "3I5loEx80MVm" }, "source": [ "## 完全なトレーニングサンプル\n", "\n", "完全なトレーニングワークフローでは、まず、異なる型の 3 つの特徴量を含むいくつかのデータを準備します。" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.450941Z", "iopub.status.busy": "2024-01-11T18:05:52.450721Z", "iopub.status.idle": "2024-01-11T18:05:52.454604Z", "shell.execute_reply": "2024-01-11T18:05:52.453887Z" }, "id": "D_7nyBee0ZBV" }, "outputs": [], "source": [ "features = {\n", " 'type': [0, 1, 1],\n", " 'size': ['small', 'small', 'medium'],\n", " 'weight': [2.7, 1.8, 1.6],\n", "}\n", "labels = [1, 1, 0]\n", "predict_features = {'type': [0], 'size': ['foo'], 'weight': [-0.7]}" ] }, { "cell_type": "markdown", "metadata": { "id": "e_4Xx2c37lqD" }, "source": [ "TensorFlow 1 と TensorFlow 2 の両方のワークフローに共通する定数をいくつか定義します。" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.457593Z", "iopub.status.busy": "2024-01-11T18:05:52.457346Z", "iopub.status.idle": "2024-01-11T18:05:52.460640Z", "shell.execute_reply": "2024-01-11T18:05:52.460023Z" }, "id": "3cyfQZ7z8jZh" }, "outputs": [], "source": [ "vocab = ['small', 'medium', 'large']\n", "one_hot_dims = 3\n", "embedding_dims = 4\n", "weight_mean = 2.0\n", "weight_variance = 1.0" ] }, { "cell_type": "markdown", "metadata": { "id": "ywCgU7CMIfTH" }, "source": [ "### 特徴量カラムを使用する\n", "\n", "特徴量カラムは、作成時に Estimator にリストとして渡す必要があり、トレーニング中に暗黙的に呼び出されます。" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:52.463569Z", "iopub.status.busy": "2024-01-11T18:05:52.463349Z", "iopub.status.idle": "2024-01-11T18:05:54.999273Z", "shell.execute_reply": "2024-01-11T18:05:54.998487Z" }, "id": "Wsdhlm-uipr1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/tmp/ipykernel_36989/1892339471.py:17: DNNClassifier.__init__ (from tensorflow_estimator.python.estimator.canned.dnn) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/dnn.py:807: Estimator.__init__ (from tensorflow_estimator.python.estimator.estimator) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1844: RunConfig.__init__ (from tensorflow_estimator.python.estimator.run_config) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Using default config.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Using temporary folder as model directory: /tmpfs/tmp/tmpf3rx3u6r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Using config: {'_model_dir': '/tmpfs/tmp/tmpf3rx3u6r', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true\n", "graph_options {\n", " rewrite_options {\n", " meta_optimizer_iterations: ONE\n", " }\n", "}\n", ", '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/dnn.py:446: dnn_logit_fn_builder (from tensorflow_estimator.python.estimator.canned.dnn) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/adagrad.py:138: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Call initializer instance with the dtype argument instead of passing it to the constructor\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/model_fn.py:250: EstimatorSpec.__new__ (from tensorflow_estimator.python.estimator.model_fn) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1416: NanTensorHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1419: LoggingTensorHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/basic_session_run_hooks.py:232: SecondOrStepTimer.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py:1456: CheckpointSaverHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Create CheckpointSaverHook.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:579: StepCounterHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:586: SummarySaverHook.__init__ (from tensorflow.python.training.basic_session_run_hooks) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Graph was finalized.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Running local_init_op.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done running local_init_op.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2024-01-11 18:05:53.152605: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:\n", "type_id: TFT_OPTIONAL\n", "args {\n", " type_id: TFT_PRODUCT\n", " args {\n", " type_id: TFT_TENSOR\n", " args {\n", " type_id: TFT_INT64\n", " }\n", " }\n", "}\n", " is neither a subtype nor a supertype of the combined inputs preceding it:\n", "type_id: TFT_OPTIONAL\n", "args {\n", " type_id: TFT_PRODUCT\n", " args {\n", " type_id: TFT_TENSOR\n", " args {\n", " type_id: TFT_INT32\n", " }\n", " }\n", "}\n", "\n", "\tfor Tuple type infernce function 0\n", "\twhile inferring type of node 'dnn/zero_fraction/cond/output/_18'\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saving checkpoints for 0 into /tmpfs/tmp/tmpf3rx3u6r/model.ckpt.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1455: SessionRunArgs.__new__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1454: SessionRunContext.__init__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py:1474: SessionRunValues.__new__ (from tensorflow.python.training.session_run_hook) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:loss = 1.0011392, step = 0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 3...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Saving checkpoints for 3 into /tmpfs/tmp/tmpf3rx3u6r/model.ckpt.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 3...\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Loss for final step: 0.73061395.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical_col = tf1.feature_column.categorical_column_with_identity(\n", " 'type', num_buckets=one_hot_dims)\n", "# Convert index to one-hot; e.g. [2] -> [0,0,1].\n", "indicator_col = tf1.feature_column.indicator_column(categorical_col)\n", "\n", "# Convert strings to indices; e.g. ['small'] -> [1].\n", "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n", " 'size', vocabulary_list=vocab, num_oov_buckets=1)\n", "# Embed the indices.\n", "embedding_col = tf1.feature_column.embedding_column(vocab_col, embedding_dims)\n", "\n", "normalizer_fn = lambda x: (x - weight_mean) / math.sqrt(weight_variance)\n", "# Normalize the numeric inputs; e.g. [2.0] -> [0.0].\n", "numeric_col = tf1.feature_column.numeric_column(\n", " 'weight', normalizer_fn=normalizer_fn)\n", "\n", "estimator = tf1.estimator.DNNClassifier(\n", " feature_columns=[indicator_col, embedding_col, numeric_col],\n", " hidden_units=[1])\n", "\n", "def _input_fn():\n", " return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n", "\n", "estimator.train(_input_fn)" ] }, { "cell_type": "markdown", "metadata": { "id": "qPIeG_YtfNV1" }, "source": [ "また、特徴量カラムは、モデルで推論を実行するときに入力データを変換するためにも使用されます。" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:55.003488Z", "iopub.status.busy": "2024-01-11T18:05:55.003220Z", "iopub.status.idle": "2024-01-11T18:05:56.306394Z", "shell.execute_reply": "2024-01-11T18:05:56.305589Z" }, "id": "K-AIIB8CfSqt" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:596: ClassificationOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:1307: RegressionOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/canned/head.py:1309: PredictOutput.__init__ (from tensorflow.python.saved_model.model_utils.export_output) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.keras instead.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done calling model_fn.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Graph was finalized.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Restoring parameters from /tmpfs/tmp/tmpf3rx3u6r/model.ckpt-3\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Running local_init_op.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Done running local_init_op.\n" ] }, { "data": { "text/plain": [ "{'logits': array([-2.0540094], dtype=float32),\n", " 'logistic': array([0.11364788], dtype=float32),\n", " 'probabilities': array([0.88635206, 0.11364787], dtype=float32),\n", " 'class_ids': array([0]),\n", " 'classes': array([b'0'], dtype=object),\n", " 'all_class_ids': array([0, 1], dtype=int32),\n", " 'all_classes': array([b'0', b'1'], dtype=object)}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def _predict_fn():\n", " return tf1.data.Dataset.from_tensor_slices(predict_features).batch(1)\n", "\n", "next(estimator.predict(_predict_fn))" ] }, { "cell_type": "markdown", "metadata": { "id": "baMA01cBIivo" }, "source": [ "### Keras 前処理レイヤーを使用する\n", "\n", "Keras の前処理レイヤーは、より柔軟に呼び出せます。レイヤーはテンソルに直接適用したり、`tf.data` 入力パイプライン内で使用したり、トレーニング可能な Keras モデルに直接構築したりできます。\n", "\n", "この例では、`tf.data` 入力パイプライン内に前処理レイヤーを適用します。これを行うには、別の `tf.keras.Model` を定義して、入力する特徴量を前処理します。このモデルはトレーニング可能ではありませんが、前処理レイヤーをグループ化する便利な方法です。" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:56.310258Z", "iopub.status.busy": "2024-01-11T18:05:56.309979Z", "iopub.status.idle": "2024-01-11T18:05:56.354830Z", "shell.execute_reply": "2024-01-11T18:05:56.354156Z" }, "id": "NMz8RfMQdCZf" }, "outputs": [], "source": [ "inputs = {\n", " 'type': tf.keras.Input(shape=(), dtype='int64'),\n", " 'size': tf.keras.Input(shape=(), dtype='string'),\n", " 'weight': tf.keras.Input(shape=(), dtype='float32'),\n", "}\n", "# Convert index to one-hot; e.g. [2] -> [0,0,1].\n", "type_output = tf.keras.layers.CategoryEncoding(\n", " one_hot_dims, output_mode='one_hot')(inputs['type'])\n", "# Convert size strings to indices; e.g. ['small'] -> [1].\n", "size_output = tf.keras.layers.StringLookup(vocabulary=vocab)(inputs['size'])\n", "# Normalize the numeric inputs; e.g. [2.0] -> [0.0].\n", "weight_output = tf.keras.layers.Normalization(\n", " axis=None, mean=weight_mean, variance=weight_variance)(inputs['weight'])\n", "outputs = {\n", " 'type': type_output,\n", " 'size': size_output,\n", " 'weight': weight_output,\n", "}\n", "preprocessing_model = tf.keras.Model(inputs, outputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "NRfISnj3NGlW" }, "source": [ "注意: レイヤー作成時に語彙と正規化統計を提供する代わりに、多くの前処理レイヤーは、入力データからレイヤーの状態を直接学習するための `adapt()` メソッドを提供します。詳細については、[前処理ガイド](https://www.tensorflow.org/guide/keras/preprocessing_layers#the_adapt_method)を参照してください。\n", "\n", "`tf.data.Dataset.map` への呼び出し内でこのモデルを適用できるようになりました。`map` に渡される関数は自動的に `tf.function` に変換され、`tf.function` コードを記述する際の通常の注意事項が適用されることに注意してください(副作用はありません)。" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:56.358254Z", "iopub.status.busy": "2024-01-11T18:05:56.358013Z", "iopub.status.idle": "2024-01-11T18:05:56.450047Z", "shell.execute_reply": "2024-01-11T18:05:56.449323Z" }, "id": "c_6xAUnbNREh" }, "outputs": [ { "data": { "text/plain": [ "({'type': array([[1., 0., 0.]], dtype=float32),\n", " 'size': array([1]),\n", " 'weight': array([0.70000005], dtype=float32)},\n", " array([1], dtype=int32))" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Apply the preprocessing in tf.data.Dataset.map.\n", "dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n", "dataset = dataset.map(lambda x, y: (preprocessing_model(x), y),\n", " num_parallel_calls=tf.data.AUTOTUNE)\n", "# Display a preprocessed input sample.\n", "next(dataset.take(1).as_numpy_iterator())" ] }, { "cell_type": "markdown", "metadata": { "id": "8_4u3J4NdJ8R" }, "source": [ "次に、トレーニング可能なレイヤーを含む別の `Model` を定義します。このモデルへの入力が、前処理された特徴量の型と形状をどのように反映しているかに注目してください。" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:56.453119Z", "iopub.status.busy": "2024-01-11T18:05:56.452865Z", "iopub.status.idle": "2024-01-11T18:05:56.487526Z", "shell.execute_reply": "2024-01-11T18:05:56.486907Z" }, "id": "kC9OZO5ldmP-" }, "outputs": [], "source": [ "inputs = {\n", " 'type': tf.keras.Input(shape=(one_hot_dims,), dtype='float32'),\n", " 'size': tf.keras.Input(shape=(), dtype='int64'),\n", " 'weight': tf.keras.Input(shape=(), dtype='float32'),\n", "}\n", "# Since the embedding is trainable, it needs to be part of the training model.\n", "embedding = tf.keras.layers.Embedding(len(vocab), embedding_dims)\n", "outputs = tf.keras.layers.Concatenate()([\n", " inputs['type'],\n", " embedding(inputs['size']),\n", " tf.expand_dims(inputs['weight'], -1),\n", "])\n", "outputs = tf.keras.layers.Dense(1)(outputs)\n", "training_model = tf.keras.Model(inputs, outputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "ir-cn2H_d5R7" }, "source": [ "`training_model` を `tf.keras.Model.fit` でトレーニングできるようになりました。" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:56.490589Z", "iopub.status.busy": "2024-01-11T18:05:56.490333Z", "iopub.status.idle": "2024-01-11T18:05:57.610793Z", "shell.execute_reply": "2024-01-11T18:05:57.610056Z" }, "id": "6TS3YJ2vnvlW" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/3 [=========>....................] - ETA: 2s - loss: 0.7099" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "3/3 [==============================] - 1s 5ms/step - loss: 0.7808\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1704996357.417202 37160 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Train on the preprocessed data.\n", "training_model.compile(\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))\n", "training_model.fit(dataset)" ] }, { "cell_type": "markdown", "metadata": { "id": "pSaEbOE4ecsy" }, "source": [ "最後に、推論時に、これらの個別の段階を組み合わせて、生の特徴量入力を処理する単一のモデルにすると便利です。" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:57.614160Z", "iopub.status.busy": "2024-01-11T18:05:57.613888Z", "iopub.status.idle": "2024-01-11T18:05:57.771865Z", "shell.execute_reply": "2024-01-11T18:05:57.771193Z" }, "id": "QHjbIZYneboO" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/1 [==============================] - ETA: 0s" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "1/1 [==============================] - 0s 103ms/step\n" ] }, { "data": { "text/plain": [ "array([[1.0852278]], dtype=float32)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs = preprocessing_model.input\n", "outputs = training_model(preprocessing_model(inputs))\n", "inference_model = tf.keras.Model(inputs, outputs)\n", "\n", "predict_dataset = tf.data.Dataset.from_tensor_slices(predict_features).batch(1)\n", "inference_model.predict(predict_dataset)" ] }, { "cell_type": "markdown", "metadata": { "id": "O01VQIxCWBxU" }, "source": [ "この合成モデルは、後で使用するために `.keras` ファイルとして保存できます。" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2024-01-11T18:05:57.775309Z", "iopub.status.busy": "2024-01-11T18:05:57.775039Z", "iopub.status.idle": "2024-01-11T18:05:58.043735Z", "shell.execute_reply": "2024-01-11T18:05:58.043132Z" }, "id": "6tsyVZgh7Pve" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r", "1/1 [==============================] - ETA: 0s" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "1/1 [==============================] - 0s 80ms/step\n" ] }, { "data": { "text/plain": [ "array([[1.0852278]], dtype=float32)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inference_model.save('model.keras')\n", "restored_model = tf.keras.models.load_model('model.keras')\n", "restored_model.predict(predict_dataset)" ] }, { "cell_type": "markdown", "metadata": { "id": "IXMBwzggwUjI" }, "source": [ "注意: 前処理レイヤーはトレーニングできないため、`tf.data` を使用して*非同期*で適用できます。これには、前処理されたバッチをプリフェッチし、アクセラレータを解放してモデルの微分可能な部分に集中できるため、パフォーマンス上の利点があります(詳細については、`tf.data` API によるパフォーマンスの向上ガイドの*プリフェッチ*セクションを参照してください)。このガイドが示すように、トレーニング中に前処理を分離し、推論中にそれを構成することは、これらのパフォーマンスの向上を活用する柔軟な方法です。ただし、モデルが小さい場合や前処理時間を無視できる場合は、最初から完全なモデルに前処理を組み込む方が簡単な場合があります。これを行うには、`tf.keras.Input` で始まる単一のモデルを構築し、その後に前処理レイヤー、その後にトレーニング可能なレイヤーを構築します。" ] }, { "cell_type": "markdown", "metadata": { "id": "2pjp7Z18gRCQ" }, "source": [ "## 特徴量カラムに対応する Keras レイヤー\n", "\n", "参考までに、特徴量カラムにほぼ対応する Keras 前処理レイヤーを次に示します。\n", "\n", "\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
特徴量カラムKeras レイヤー
`tf.feature_column.bucketized_column``tf.keras.layers.Discretization`
`tf.feature_column.categorical_column_with_hash_bucket``tf.keras.layers.Hashing`
`tf.feature_column.categorical_column_with_identity``tf.keras.layers.CategoryEncoding`
`tf.feature_column.categorical_column_with_vocabulary_file``tf.keras.layers.StringLookup` または `tf.keras.layers.IntegerLookup`
`tf.feature_column.categorical_column_with_vocabulary_list``tf.keras.layers.StringLookup` または `tf.keras.layers.IntegerLookup`
`tf.feature_column.crossed_column``tf.keras.layers.experimental.preprocessing.HashedCrossing`
`tf.feature_column.embedding_column``tf.keras.layers.Embedding`
`tf.feature_column.indicator_column``output_mode='one_hot'` または `output_mode='multi_hot'`*
`tf.feature_column.numeric_column``tf.keras.layers.Normalization`
`tf.feature_column.sequence_categorical_column_with_hash_bucket``tf.keras.layers.Hashing`
`tf.feature_column.sequence_categorical_column_with_identity``tf.keras.layers.CategoryEncoding`
`tf.feature_column.sequence_categorical_column_with_vocabulary_file``tf.keras.layers.StringLookup`、`tf.keras.layers.IntegerLookup`、または `tf.keras.layer.TextVectorization`†
`tf.feature_column.sequence_categorical_column_with_vocabulary_list``tf.keras.layers.StringLookup`、`tf.keras.layers.IntegerLookup`、または `tf.keras.layer.TextVectorization`†
`tf.feature_column.sequence_numeric_column``tf.keras.layers.Normalization`
`tf.feature_column.weighted_categorical_column``tf.keras.layers.CategoryEncoding`
\n", "\n", "† `tf.keras.layers.TextVectorization` は、自由形式のテキスト入力 (文全体または段落全体など)を直接処理できます。これは、TensorFlow 1 でのカテゴリカルシーケンス処理の 1 対 1 の置き換えではありませんが、アドホックテキスト前処理の便利な置き換えを提供します。\n", "\n", "† `tf.keras.layers.TextVectorization` は、自由形式のテキスト入力 (文全体または段落全体など)を直接処理できます。これは、TensorFlow 1 でのカテゴリカルシーケンス処理の 1 対 1 の置き換えではありませんが、アドホックテキスト前処理の便利な置き換えを提供します。\n", "\n", "注意: `tf.estimator.LinearClassifier` などの線形 Estimator は、`embedding_column` または `indicator_column` なしで直接のカテゴリカル入力(整数インデックス)を処理できます。ただし、整数インデックスを `tf.keras.layers.Dense` または `tf.keras.experimental.LinearModel` に直接渡すことはできません。これらの入力は、 `Dense` または `LinearModel` を呼び出す前に最初に `tf.layers.CategoryEncoding` で `output_mode='count'`(カテゴリサイズが大きい場合は `sparse=True`)でエンコードする必要があります)。" ] }, { "cell_type": "markdown", "metadata": { "id": "AQCJ6lM3YDq_" }, "source": [ "## 次のステップ\n", "\n", "- Keras 前処理レイヤーの詳細については、[前処理レイヤーの操作](https://www.tensorflow.org/guide/keras/preprocessing_layers)ガイドを参照してください。\n", "- 前処理レイヤーを構造化データに適用する詳細な例については、[Keras 前処理レイヤーを使用して構造化データを分類する](../../tutorials/structured_data/preprocessing_layers.ipynb)チュートリアルを参照してください。" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "migrating_feature_columns.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 0 }