##### Copyright 2020 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Making predictions

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/decision_forests/tutorials/predict_colab"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/decision-forests/blob/main/documentation/tutorials/predict_colab.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/decision-forests/blob/main/documentation/tutorials/predict_colab.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/decision-forests/documentation/tutorials/predict_colab.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>




Welcome to the **Prediction Colab** for **TensorFlow Decision Forests** (**TF-DF**).
In this colab, you will learn about different ways to generate predictions with a previously trained **TF-DF** model using the **Python API**.

<i><b>Remark:</b> The Python API shown in this Colab is simple to use and well-suited for experimentation. However, other APIs, such as TensorFlow Serving and the C++ API are better suited for production systems as they are faster and more stable. The exhaustive list of all Serving APIs is available [here](https://ydf.readthedocs.io/en/latest/serving_apis.html).</i>

In this colab, you will:

1. Use the `model.predict()` function on a TensorFlow Dataset created with `pd_dataframe_to_tf_dataset`.
1. Use the `model.predict()` function on a TensorFlow Dataset created manually.
1. Use the `model.predict()` function on Numpy arrays.
1. Make predictions with the CLI API.
1. Benchmark the inference speed of a model with the CLI API.




## Important remark

The dataset used for predictions should have the **same feature names and types** as the dataset used for training. Failing to do so, will likely raise errors.

For example, training a model with two features `f1` and `f2`, and trying to generate predictions on a dataset without `f2` will fail. Note that it is okay to set (some or all) feature values as "missing". Similarly, training a model where `f2` is a numerical feature (e.g., float32), and applying this model on a dataset where `f2` is a text (e.g., string) feature will fail. 

While abstracted by the Keras API, a model instantiated in Python (e.g., with
`tfdf.keras.RandomForestModel()`) and a model loaded from disk (e.g., with
`tf_keras.models.load_model()`) can behave differently. Notably, a Python
instantiated model automatically applies necessary type conversions. For
example, if a `float64` feature is fed to a model expecting a `float32` feature,
this conversion is performed implicitly. However, such a conversion is not
possible for models loaded from disk. It is therefore important that the
training data and the inference data always have the exact same type.

## Setup

First, we install TensorFlow Dececision Forests...

In [2]:
!pip install tensorflow_decision_forests

Collecting tensorflow_decision_forests
  Using cached tensorflow_decision_forests-1.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.0 kB)
Collecting wurlitzer (from tensorflow_decision_forests)
  Using cached wurlitzer-3.0.3-py3-none-any.whl.metadata (1.9 kB)












Using cached tensorflow_decision_forests-1.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.5 MB)


Using cached wurlitzer-3.0.3-py3-none-any.whl (7.3 kB)


Installing collected packages: wurlitzer, tensorflow_decision_forests


Successfully installed tensorflow_decision_forests-1.9.0 wurlitzer-3.0.3


... , and import the libraries used in this example.

In [3]:
import tensorflow_decision_forests as tfdf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

## `model.predict(...)` and `pd_dataframe_to_tf_dataset` function

TensorFlow Decision Forests implements the [Keras](https://keras.io/) model API.
As such, TF-DF models have a `predict` function to make predictions. This function  takes as input a [TensorFlow Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) and outputs a prediction array.
The simplest way to create a TensorFlow dataset is to use [Pandas](https://pandas.pydata.org/) and the the `tfdf.keras.pd_dataframe_to_tf_dataset(...)` function.

The next example shows how to create a TensorFlow dataset using `pd_dataframe_to_tf_dataset`.

In [4]:
pd_dataset = pd.DataFrame({
    "feature_1": [1,2,3],
    "feature_2": ["a", "b", "c"],
    "label": [0, 1, 0],
})

pd_dataset

Unnamed: 0,feature_1,feature_2,label
0,1,a,0
1,2,b,1
2,3,c,0


In [5]:
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_dataset, label="label")

for features, label in tf_dataset:
  print("Features:",features)
  print("label:", label)

Features: {'feature_1': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1, 2, 3])>, 'feature_2': <tf.Tensor: shape=(3,), dtype=string, numpy=array([b'a', b'b', b'c'], dtype=object)>}
label: tf.Tensor([0 1 0], shape=(3,), dtype=int64)


2024-04-20 11:14:51.301980: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


<i>**Note:** "pd_" stands for "pandas". "tf_" stands for "TensorFlow".</i>

A TensorFlow Dataset is a function that outputs a sequence of values. Those values can be simple arrays (called Tensors) or arrays organized into a structure (for example, arrays organized in a dictionary).


The following example shows the training and inference (using `predict`) on a toy dataset:

In [6]:
# Creating a training dataset in Pandas
pd_train_dataset = pd.DataFrame({
    "feature_1": np.random.rand(1000),
    "feature_2": np.random.rand(1000),
})
pd_train_dataset["label"] = pd_train_dataset["feature_1"] > pd_train_dataset["feature_2"] 

pd_train_dataset

Unnamed: 0,feature_1,feature_2,label
0,0.008157,0.222233,False
1,0.449456,0.972803,False
2,0.508560,0.140373,True
3,0.729689,0.163511,True
4,0.418973,0.910654,False
...,...,...,...
995,0.509904,0.179585,True
996,0.027352,0.224753,False
997,0.637239,0.129554,True
998,0.336822,0.213333,True


In [7]:
# Creating a serving dataset with Pandas
pd_serving_dataset = pd.DataFrame({
    "feature_1": np.random.rand(500),
    "feature_2": np.random.rand(500),
})

pd_serving_dataset

Unnamed: 0,feature_1,feature_2
0,0.766815,0.783505
1,0.665322,0.696025
2,0.173268,0.150162
3,0.008918,0.415814
4,0.147896,0.553871
...,...,...
495,0.176380,0.067891
496,0.685740,0.411424
497,0.213908,0.715414
498,0.630376,0.720302


Let's convert the Pandas dataframes into TensorFlow datasets:

In [8]:
tf_train_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_train_dataset, label="label")
tf_serving_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(pd_serving_dataset)

We can now train a model on `tf_train_dataset`:

In [9]:
model = tfdf.keras.RandomForestModel(verbose=0)
model.fit(tf_train_dataset)

[INFO 24-04-20 11:14:55.1176 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmpdosbv775/model/ with prefix 951a85e27c8d4048
[INFO 24-04-20 11:14:55.1550 UTC decision_forest.cc:734] Model loaded with 300 root(s), 12674 node(s), and 2 input feature(s).
[INFO 24-04-20 11:14:55.1551 UTC abstract_model.cc:1344] Engine "RandomForestOptPred" built
[INFO 24-04-20 11:14:55.1551 UTC kernel.cc:1061] Use fast generic engine


<tf_keras.src.callbacks.History at 0x7f96c017a7f0>

And then generate predictions on `tf_serving_dataset`:

In [10]:
# Print the first 10 predictions.
model.predict(tf_serving_dataset, verbose=0)[:10]

array([[0.57999957],
       [0.13666661],
       [0.68666613],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00333333]], dtype=float32)

## `model.predict(...)` and manual TF datasets

In the previous section, we showed how to create a TF dataset using the `pd_dataframe_to_tf_dataset` function. This option is simple but poorly suited for large datasets. Instead, TensorFlow offers [several options](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) to create a TensorFlow dataset.
The next examples shows how to create a dataset using the `tf.data.Dataset.from_tensor_slices()` function.

In [11]:
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4,5])

for value in dataset:
  print("value:", value.numpy())

value: 1
value: 2
value: 3
value: 4
value: 5


2024-04-20 11:14:59.117255: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


TensorFlow models are trained with mini-batching: Instead of being fed one at a time, examples are grouped in "batches". For Neural Networks, the batch size impacts the quality of the model, and the optimal value needs to be determined by the user during training. For Decision Forests, the batch size has no impact on the model. However, for compatibility reasons, **TensorFlow Decision Forests expects the dataset to be batched**. Batching is done with the `batch()` function.

In [12]:
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4,5]).batch(2)

for value in dataset:
  print("value:", value.numpy())

value: [1 2]
value: [3 4]
value: [5]


2024-04-20 11:14:59.134734: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


TensorFlow Decision Forests expects the dataset to be of one of two structures:

- features, label
- features, label, weights

The features can be a single 2 dimensional array (where each column is a feature and each row is an example), or a dictionary of arrays.

Following is an example of a dataset compatible with TensorFlow Decision Forests:

In [13]:
# A dataset with a single 2d array.
tf_dataset = tf.data.Dataset.from_tensor_slices(
    ([[1,2],[3,4],[5,6]], # Features
    [0,1,0], # Label
    )).batch(2)

for features, label in tf_dataset:
  print("features:", features)
  print("label:", label)

features: tf.Tensor(
[[1 2]
 [3 4]], shape=(2, 2), dtype=int32)
label: tf.Tensor([0 1], shape=(2,), dtype=int32)
features: tf.Tensor([[5 6]], shape=(1, 2), dtype=int32)
label: tf.Tensor([0], shape=(1,), dtype=int32)


2024-04-20 11:14:59.152655: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [14]:
# A dataset with a dictionary of features.
tf_dataset = tf.data.Dataset.from_tensor_slices(
    ({
    "feature_1": [1,2,3],
    "feature_2": [4,5,6],
    },
    [0,1,0], # Label
    )).batch(2)

for features, label in tf_dataset:
  print("features:", features)
  print("label:", label)

features: {'feature_1': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, 'feature_2': <tf.Tensor: shape=(2,), dtype=int32, numpy=array([4, 5], dtype=int32)>}
label: tf.Tensor([0 1], shape=(2,), dtype=int32)
features: {'feature_1': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([3], dtype=int32)>, 'feature_2': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([6], dtype=int32)>}
label: tf.Tensor([0], shape=(1,), dtype=int32)


2024-04-20 11:14:59.171912: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Let's train a model with this second option.

In [15]:
tf_dataset = tf.data.Dataset.from_tensor_slices(
    ({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    },
    np.random.rand(100) >= 0.5, # Label
    )).batch(2)

model = tfdf.keras.RandomForestModel(verbose=0)
model.fit(tf_dataset)

[INFO 24-04-20 11:14:59.3750 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp208me4tj/model/ with prefix b7fe9aaae54944c5
[INFO 24-04-20 11:14:59.3979 UTC decision_forest.cc:734] Model loaded with 300 root(s), 7574 node(s), and 2 input feature(s).
[INFO 24-04-20 11:14:59.3979 UTC kernel.cc:1061] Use fast generic engine


<tf_keras.src.callbacks.History at 0x7f968c13e8e0>

The `predict` function can be used directly on the training dataset:

In [16]:
# The first 10 predictions.
model.predict(tf_dataset, verbose=0)[:10]

array([[0.9366659 ],
       [0.42999968],
       [0.9266659 ],
       [0.31999978],
       [0.70999944],
       [0.2133332 ],
       [0.13333328],
       [0.836666  ],
       [0.10666663],
       [0.53333294]], dtype=float32)

## `model.predict(...)` and `model.predict_on_batch()` on dictionaries

In some cases, the `predict` function can be used with an array (or dictionaries of arrays) instead of TensorFlow Dataset.

The following example uses the previously trained model with a dictionary of NumPy arrays.

In [17]:
# The first 10 predictions.
model.predict({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    }, verbose=0)[:10]

array([[0.5366663 ],
       [0.19666655],
       [0.2233332 ],
       [0.99999917],
       [0.3233331 ],
       [0.3866664 ],
       [0.71999943],
       [0.40666637],
       [0.73333275],
       [0.10999996]], dtype=float32)

In the previous example, the arrays are automatically batched. Alternatively, the `predict_on_batch` function can be used to make sure that all the examples are run in the same batch.

In [18]:
# The first 10 predictions.
model.predict_on_batch({
    "feature_1": np.random.rand(100),
    "feature_2": np.random.rand(100),
    })[:10]

array([[0.3433331 ],
       [0.42333302],
       [0.9466659 ],
       [0.38333306],
       [0.21666653],
       [0.10999996],
       [0.09333331],
       [0.23999985],
       [0.13999994],
       [0.36999974]], dtype=float32)


**Note:** If `predict` does not work on raw data such as in the example above, try to use the `predict_on_batch` function or convert the raw data into a TensorFlow Dataset. 

## Inference with the YDF format

This example shows how to run a TF-DF model trained with the CLI API ([one of the other Serving APIs](https://ydf.readthedocs.io/en/latest/serving_apis.html)). We will also use the Benchmark tool to measure the inference speed of the model.

Let's start by training and saving a model:

In [19]:
model = tfdf.keras.GradientBoostedTreesModel(verbose=0)
model.fit(tfdf.keras.pd_dataframe_to_tf_dataset(pd_train_dataset, label="label"))
model.save("my_model")



[INFO 24-04-20 11:15:00.4645 UTC kernel.cc:1233] Loading model from path /tmpfs/tmp/tmp_gpxt9u3/model/ with prefix 307d0dfd7bcd4058
[INFO 24-04-20 11:15:00.4725 UTC quick_scorer_extended.cc:911] The binary was compiled without AVX2 support, but your CPU supports it. Enable it for faster model inference.
[INFO 24-04-20 11:15:00.4729 UTC kernel.cc:1061] Use fast generic engine


INFO:tensorflow:Assets written to: my_model/assets


INFO:tensorflow:Assets written to: my_model/assets


Let's also export the dataset to a csv file:

In [20]:
pd_serving_dataset.to_csv("dataset.csv")

Let's download and extract the [Yggdrasil Decision Forests](https://ydf.readthedocs.io/en/latest/index.html) CLI tools. 

In [21]:
!wget https://github.com/google/yggdrasil-decision-forests/releases/download/1.0.0/cli_linux.zip
!unzip cli_linux.zip

--2024-04-20 11:15:01--  https://github.com/google/yggdrasil-decision-forests/releases/download/1.0.0/cli_linux.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.


HTTP request sent, awaiting response... 

302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/360444739/bfcd0b9d-5cbc-42a8-be0a-02131875f9a6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240420T111501Z&X-Amz-Expires=300&X-Amz-Signature=01381b3c5a69d831a4be54e2fef635b848ca9b5aaeeac6822698c6acf5f93240&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=360444739&response-content-disposition=attachment%3B%20filename%3Dcli_linux.zip&response-content-type=application%2Foctet-stream [following]
--2024-04-20 11:15:01--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/360444739/bfcd0b9d-5cbc-42a8-be0a-02131875f9a6?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240420T111501Z&X-Amz-Expires=300&X-Amz-Signature=01381b3c5a69d831a4be54e2fef635b848ca9b5aaeeac6822698c6acf5f93240&X-Amz-SignedHeaders=host&acto

200 OK
Length: 31516027 (30M) [application/octet-stream]
Saving to: ‘cli_linux.zip’

cli_linux.zip         0%[                    ]       0  --.-KB/s               


2024-04-20 11:15:01 (174 MB/s) - ‘cli_linux.zip’ saved [31516027/31516027]



Archive:  cli_linux.zip
  inflating: README                  
  inflating: cli.txt                 
  inflating: train                   


  inflating: show_model              


  inflating: show_dataspec           


  inflating: predict                 


  inflating: infer_dataspec          


  inflating: evaluate                


  inflating: convert_dataset         


  inflating: benchmark_inference     


  inflating: edit_model              


  inflating: synthetic_dataset       


  inflating: grpc_worker_main        


  inflating: LICENSE                 
  inflating: CHANGELOG.md            


Finally, let's make predictions:

**Remarks:**


- TensorFlow Decision Forests (TF-DF) is based on the [Yggdrasil Decision Forests](https://ydf.readthedocs.io/en/latest/index.html) (YDF) library, and  TF-DF model always contains a YDF model internally. When saving a TF-DF model to disk, the TF-DF model directory contains an `assets` sub-directory containing the YDF model. This YDF model can be used with all [YDF tools](https://ydf.readthedocs.io/en/latest/cli_commands.html). In the next example, we will use the `predict` and `benchmark_inference` tools. See the [model format documentation](https://ydf.readthedocs.io/en/latest/convert_model.html) for more details.
- YDF tools assume that the type of the dataset is specified using a prefix, e.g. `csv:`. See the [YDF user manual](https://ydf.readthedocs.io/en/latest/cli_user_manual.html#dataset-path-and-format) for more details.

In [22]:
!./predict --model=my_model/assets --dataset=csv:dataset.csv --output=csv:predictions.csv

[INFO abstract_model.cc:1296] Engine "GradientBoostedTreesQuickScorerExtended" built
[INFO predict.cc:133] Run predictions with semi-fast engine


We can now look at the predictions:

In [23]:
pd.read_csv("predictions.csv")

Unnamed: 0,1,2
0,0.644487,0.355513
1,0.957421,0.042579
2,0.037222,0.962777
3,0.995186,0.004814
4,0.994248,0.005752
...,...,...
495,0.021515,0.978485
496,0.003711,0.996289
497,0.995865,0.004135
498,0.992290,0.007710


The speed of inference of a model can be measured with the [benchmark inference](https://ydf.readthedocs.io/en/latest/benchmark_inference.html) tool.

**Note:** Prior to YDF version 1.1.0, the dataset used in the benchmark inference needs to have a `__LABEL` column.

In [24]:
# Create the empty label column.
pd_serving_dataset["__LABEL"] = 0
pd_serving_dataset.to_csv("dataset.csv")

In [25]:
!./benchmark_inference \
  --model=my_model/assets \
  --dataset=csv:dataset.csv \
  --batch_size=100 \
  --warmup_runs=10 \
  --num_runs=50

[INFO benchmark_inference.cc:245] Loading model
[INFO benchmark_inference.cc:248] The model is of type: GRADIENT_BOOSTED_TREES
[INFO benchmark_inference.cc:250] Loading dataset
[INFO benchmark_inference.cc:259] Found 3 compatible fast engines.
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesGeneric
[INFO decision_forest.cc:639] Model loaded with 49 root(s), 2661 node(s), and 2 input feature(s).


[INFO benchmark_inference.cc:262] Running GradientBoostedTreesQuickScorerExtended
[INFO benchmark_inference.cc:262] Running GradientBoostedTreesOptPred
[INFO decision_forest.cc:639] Model loaded with 49 root(s), 2661 node(s), and 2 input feature(s).


[INFO benchmark_inference.cc:268] Running the slow generic engine


batch_size : 100  num_runs : 50
time/example(us)  time/batch(us)  method
----------------------------------------
         0.44275          44.275  GradientBoostedTreesQuickScorerExtended [virtual interface]
         0.79825          79.825  GradientBoostedTreesOptPred [virtual interface]
           1.877           187.7  GradientBoostedTreesGeneric [virtual interface]
          4.4463          444.62  Generic slow engine
----------------------------------------


In this benchmark, we see the inference speed for different inference engines. For example, "time/example(us) = 0.6315" (can change in different runs) indicates that the inference of one example takes 0.63 micro-seconds. That is, the model can be run ~1.6 millions of times per seconds.

**Note:** TF-DF and the other API always automatically select the fastest inference engine available.