{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "DjUA6S30k52h" }, "source": [ "##### Copyright 2021 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:35:53.439441Z", "iopub.status.busy": "2024-05-08T09:35:53.438951Z", "iopub.status.idle": "2024-05-08T09:35:53.443052Z", "shell.execute_reply": "2024-05-08T09:35:53.442432Z" }, "id": "SpNWyqewk8fE" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "6x1ypzczQCwy" }, "source": [ "# Data validation using TFX Pipeline and TensorFlow Data Validation" ] }, { "cell_type": "markdown", "metadata": { "id": "HU9YYythm0dx" }, "source": [ "Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click \"Run in Google Colab\".\n", "\n", "
\n", "\n", "\n", "\n", "\n", "
\n", "View on TensorFlow.org\n", "Run in Google Colab\n", "View source on GitHubDownload notebook
" ] }, { "cell_type": "markdown", "metadata": { "id": "_VuwrlnvQJ5k" }, "source": [ "In this notebook-based tutorial, we will create and run TFX pipelines\n", "to validate input data and create an ML model. This notebook is based on the\n", "TFX pipeline we built in\n", "[Simple TFX Pipeline Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple).\n", "If you have not read that tutorial yet, you should read it before proceeding\n", "with this notebook.\n", "\n", "The first task in any data science or ML project is to understand and clean\n", "the data, which includes:\n", "- Understanding the data types, distributions, and other information (e.g.,\n", "mean value, or number of uniques) about each feature\n", "- Generating a preliminary schema that describes the data\n", "- Identifying anomalies and missing values in the data with respect to given\n", "schema\n", "\n", "In this tutorial, we will create two TFX pipelines.\n", "\n", "First, we will create a pipeline to analyze the dataset and generate a\n", "preliminary schema of the given dataset. This pipeline will include two new\n", "components, `StatisticsGen` and `SchemaGen`.\n", "\n", "Once we have a proper schema of the data, we will create a pipeline to train\n", "an ML classification model based on the pipeline from the previous tutorial.\n", "In this pipeline, we will use the schema from the first pipeline and a\n", "new component, `ExampleValidator`, to validate the input data.\n", "\n", "The three new components, StatisticsGen, SchemaGen and ExampleValidator, are\n", "TFX components for data analysis and validation, and they are implemented\n", "using the\n", "[TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) library.\n", "\n", "Please see\n", "[Understanding TFX Pipelines](https://www.tensorflow.org/tfx/guide/understanding_tfx_pipelines)\n", "to learn more about various concepts in TFX." ] }, { "cell_type": "markdown", "metadata": { "id": "Fmgi8ZvQkScg" }, "source": [ "## Set Up\n", "We first need to install the TFX Python package and download\n", "the dataset which we will use for our model.\n", "\n", "### Upgrade Pip\n", "\n", "To avoid upgrading Pip in a system when running locally,\n", "check to make sure that we are running in Colab.\n", "Local systems can of course be upgraded separately." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:35:53.446819Z", "iopub.status.busy": "2024-05-08T09:35:53.446572Z", "iopub.status.idle": "2024-05-08T09:35:53.455206Z", "shell.execute_reply": "2024-05-08T09:35:53.454582Z" }, "id": "as4OTe2ukSqm" }, "outputs": [], "source": [ "try:\n", " import colab\n", " !pip install --upgrade pip\n", "except:\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "id": "MZOYTt1RW4TK" }, "source": [ "### Install TFX\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:35:53.458666Z", "iopub.status.busy": "2024-05-08T09:35:53.458125Z", "iopub.status.idle": "2024-05-08T09:36:04.224297Z", "shell.execute_reply": "2024-05-08T09:36:04.223343Z" }, "id": "iyQtljP-qPHY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: tfx in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (1.15.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: ml-pipelines-sdk==1.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.15.0)\r\n", "Requirement already satisfied: absl-py<2.0.0,>=0.9 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.4.0)\r\n", "Requirement already satisfied: ml-metadata<1.16.0,>=1.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.15.0)\r\n", "Requirement already satisfied: packaging>=22 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (24.0)\r\n", "Requirement already satisfied: portpicker<2,>=1.3.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.6.0)\r\n", "Requirement already satisfied: protobuf<5,>=3.20.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (3.20.3)\r\n", "Requirement already satisfied: docker<5,>=4.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (4.4.4)\r\n", "Requirement already satisfied: google-apitools<1,>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (0.5.31)\r\n", "Requirement already satisfied: google-api-python-client<2,>=1.8 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.12.11)\r\n", "Requirement already satisfied: jinja2<4,>=2.7.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (3.1.4)\r\n", "Requirement already satisfied: typing-extensions<5,>=3.10.0.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (4.11.0)\r\n", "Requirement already satisfied: apache-beam<3,>=2.47 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.56.0)\r\n", "Requirement already satisfied: attrs<24,>=19.3.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (23.2.0)\r\n", "Requirement already satisfied: click<9,>=7 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (8.1.7)\r\n", "Requirement already satisfied: google-api-core<3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (2.19.0)\r\n", "Requirement already satisfied: google-cloud-aiplatform<2,>=1.6.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.50.0)\r\n", "Requirement already satisfied: google-cloud-bigquery<4,>=3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (3.22.0)\r\n", "Requirement already satisfied: grpcio<2,>=1.28.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.63.0)\r\n", "Requirement already satisfied: keras-tuner!=1.4.0,!=1.4.1,<2,>=1.0.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.4.7)\r\n", "Requirement already satisfied: kubernetes<13,>=10.0.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (12.0.1)\r\n", "Requirement already satisfied: numpy<2,>=1.16 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.26.4)\r\n", "Requirement already satisfied: pyarrow<11,>=10 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (10.0.1)\r\n", "Requirement already satisfied: scipy<1.13 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.12.0)\r\n", "Requirement already satisfied: pyyaml<7,>=6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (6.0.1)\r\n", "Requirement already satisfied: tensorflow<2.16,>=2.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (2.15.1)\r\n", "Requirement already satisfied: tensorflow-hub<0.16,>=0.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (0.15.0)\r\n", "Requirement already satisfied: tensorflow-data-validation<1.16.0,>=1.15.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.15.1)\r\n", "Requirement already satisfied: tensorflow-model-analysis<0.47.0,>=0.46.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (0.46.0)\r\n", "Requirement already satisfied: tensorflow-serving-api<2.16,>=2.15 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (2.15.1)\r\n", "Requirement already satisfied: tensorflow-transform<1.16.0,>=1.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.15.0)\r\n", "Requirement already satisfied: tfx-bsl<1.16.0,>=1.15.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tfx) (1.15.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: crcmod<2.0,>=1.7 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (1.7)\r\n", "Requirement already satisfied: orjson<4,>=3.9.7 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (3.10.3)\r\n", "Requirement already satisfied: dill<0.3.2,>=0.3.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.3.1.1)\r\n", "Requirement already satisfied: cloudpickle~=2.2.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2.2.1)\r\n", "Requirement already satisfied: fastavro<2,>=0.23.6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (1.9.4)\r\n", "Requirement already satisfied: fasteners<1.0,>=0.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.19)\r\n", "Requirement already satisfied: hdfs<3.0.0,>=2.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2.7.3)\r\n", "Requirement already satisfied: httplib2<0.23.0,>=0.8 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.22.0)\r\n", "Requirement already satisfied: jsonschema<5.0.0,>=4.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (4.22.0)\r\n", "Requirement already satisfied: jsonpickle<4.0.0,>=3.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (3.0.4)\r\n", "Requirement already satisfied: objsize<0.8.0,>=0.6.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.7.0)\r\n", "Requirement already satisfied: pymongo<5.0.0,>=3.8.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (4.7.2)\r\n", "Requirement already satisfied: proto-plus<2,>=1.7.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (1.23.0)\r\n", "Requirement already satisfied: pydot<2,>=1.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (1.4.2)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: python-dateutil<3,>=2.8.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2.9.0.post0)\r\n", "Requirement already satisfied: pytz>=2018.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2024.1)\r\n", "Requirement already satisfied: redis<6,>=5.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (5.0.4)\r\n", "Requirement already satisfied: regex>=2020.6.8 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2024.4.28)\r\n", "Requirement already satisfied: requests<3.0.0,>=2.24.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2.31.0)\r\n", "Requirement already satisfied: zstandard<1,>=0.18.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.22.0)\r\n", "Requirement already satisfied: pyarrow-hotfix<1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.6)\r\n", "Requirement already satisfied: js2py<1,>=0.74 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.74)\r\n", "Requirement already satisfied: cachetools<6,>=3.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (5.3.3)\r\n", "Requirement already satisfied: google-auth<3,>=1.18.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.29.0)\r\n", "Requirement already satisfied: google-auth-httplib2<0.3.0,>=0.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (0.2.0)\r\n", "Requirement already satisfied: google-cloud-datastore<3,>=2.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.19.0)\r\n", "Requirement already satisfied: google-cloud-pubsub<3,>=2.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.21.1)\r\n", "Requirement already satisfied: google-cloud-pubsublite<2,>=1.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (1.10.0)\r\n", "Requirement already satisfied: google-cloud-storage<3,>=2.14.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.16.0)\r\n", "Requirement already satisfied: google-cloud-bigquery-storage<3,>=2.6.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.25.0)\r\n", "Requirement already satisfied: google-cloud-core<3,>=2.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.4.1)\r\n", "Requirement already satisfied: google-cloud-bigtable<3,>=2.19.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.23.1)\r\n", "Requirement already satisfied: google-cloud-spanner<4,>=3.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (3.46.0)\r\n", "Requirement already satisfied: google-cloud-dlp<4,>=3.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (3.17.0)\r\n", "Requirement already satisfied: google-cloud-language<3,>=2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.13.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: google-cloud-videointelligence<3,>=2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (2.13.3)\r\n", "Requirement already satisfied: google-cloud-vision<4,>=2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (3.7.2)\r\n", "Requirement already satisfied: google-cloud-recommendations-ai<0.11.0,>=0.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from apache-beam[gcp]<3,>=2.47->tfx) (0.10.10)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: six>=1.4.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from docker<5,>=4.1->tfx) (1.16.0)\r\n", "Requirement already satisfied: websocket-client>=0.32.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from docker<5,>=4.1->tfx) (1.8.0)\r\n", "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-api-core<3->tfx) (1.63.0)\r\n", "Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-api-python-client<2,>=1.8->tfx) (3.0.1)\r\n", "Requirement already satisfied: oauth2client>=1.4.12 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-apitools<1,>=0.5->tfx) (4.1.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: google-cloud-resource-manager<3.0.0dev,>=1.3.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-aiplatform<2,>=1.6.2->tfx) (1.12.3)\r\n", "Requirement already satisfied: shapely<3.0.0dev in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-aiplatform<2,>=1.6.2->tfx) (2.0.4)\r\n", "Requirement already satisfied: pydantic<3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-aiplatform<2,>=1.6.2->tfx) (1.10.15)\r\n", "Requirement already satisfied: docstring-parser<1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-aiplatform<2,>=1.6.2->tfx) (0.16)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-bigquery<4,>=3->tfx) (2.7.0)\r\n", "Requirement already satisfied: MarkupSafe>=2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jinja2<4,>=2.7.3->tfx) (2.1.5)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: keras in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from keras-tuner!=1.4.0,!=1.4.1,<2,>=1.0.4->tfx) (2.15.0)\r\n", "Requirement already satisfied: kt-legacy in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from keras-tuner!=1.4.0,!=1.4.1,<2,>=1.0.4->tfx) (1.0.5)\r\n", "Requirement already satisfied: certifi>=14.05.14 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from kubernetes<13,>=10.0.1->tfx) (2024.2.2)\r\n", "Requirement already satisfied: setuptools>=21.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from kubernetes<13,>=10.0.1->tfx) (69.5.1)\r\n", "Requirement already satisfied: requests-oauthlib in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from kubernetes<13,>=10.0.1->tfx) (2.0.0)\r\n", "Requirement already satisfied: urllib3>=1.24.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from kubernetes<13,>=10.0.1->tfx) (1.26.18)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: psutil in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from portpicker<2,>=1.3.1->tfx) (5.9.8)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: astunparse>=1.6.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (1.6.3)\r\n", "Requirement already satisfied: flatbuffers>=23.5.26 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (24.3.25)\r\n", "Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (0.5.4)\r\n", "Requirement already satisfied: google-pasta>=0.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (0.2.0)\r\n", "Requirement already satisfied: h5py>=2.9.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (3.11.0)\r\n", "Requirement already satisfied: libclang>=13.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (18.1.1)\r\n", "Requirement already satisfied: ml-dtypes~=0.3.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (0.3.2)\r\n", "Requirement already satisfied: opt-einsum>=2.3.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (3.3.0)\r\n", "Requirement already satisfied: termcolor>=1.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (2.4.0)\r\n", "Requirement already satisfied: wrapt<1.15,>=1.11.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (1.14.1)\r\n", "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (0.37.0)\r\n", "Requirement already satisfied: tensorboard<2.16,>=2.15 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (2.15.2)\r\n", "Requirement already satisfied: tensorflow-estimator<2.16,>=2.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow<2.16,>=2.15.0->tfx) (2.15.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: joblib>=1.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-data-validation<1.16.0,>=1.15.1->tfx) (1.4.2)\r\n", "Requirement already satisfied: pandas<2,>=1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-data-validation<1.16.0,>=1.15.1->tfx) (1.5.3)\r\n", "Requirement already satisfied: pyfarmhash<0.4,>=0.2.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-data-validation<1.16.0,>=1.15.1->tfx) (0.3.2)\r\n", "Requirement already satisfied: tensorflow-metadata<1.16,>=1.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-data-validation<1.16.0,>=1.15.1->tfx) (1.15.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: ipython<8,>=7 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (7.34.0)\r\n", "Requirement already satisfied: ipywidgets<8,>=7 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (7.8.1)\r\n", "Requirement already satisfied: pillow>=9.4.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (10.3.0)\r\n", "Requirement already satisfied: rouge-score<2,>=0.1.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.1.2)\r\n", "Requirement already satisfied: sacrebleu<4,>=2.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.4.2)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: wheel<1.0,>=0.23.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from astunparse>=1.6.0->tensorflow<2.16,>=2.15.0->tfx) (0.43.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: grpcio-status<2.0.dev0,>=1.33.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform<2,>=1.6.2->tfx) (1.48.2)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pyasn1-modules>=0.2.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-auth<3,>=1.18.0->apache-beam[gcp]<3,>=2.47->tfx) (0.4.0)\r\n", "Requirement already satisfied: rsa<5,>=3.1.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-auth<3,>=1.18.0->apache-beam[gcp]<3,>=2.47->tfx) (4.9)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: grpc-google-iam-v1<1.0.0dev,>=0.12.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-bigtable<3,>=2.19.0->apache-beam[gcp]<3,>=2.47->tfx) (0.13.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: overrides<8.0.0,>=6.0.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-pubsublite<2,>=1.2.0->apache-beam[gcp]<3,>=2.47->tfx) (7.7.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: sqlparse>=0.4.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-spanner<4,>=3.0.0->apache-beam[gcp]<3,>=2.47->tfx) (0.5.0)\r\n", "Requirement already satisfied: grpc-interceptor>=0.15.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-spanner<4,>=3.0.0->apache-beam[gcp]<3,>=2.47->tfx) (0.15.4)\r\n", "Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from google-cloud-storage<3,>=2.14.0->apache-beam[gcp]<3,>=2.47->tfx) (1.5.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: docopt in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.6.2)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from httplib2<0.23.0,>=0.8->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (3.1.2)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: jedi>=0.16 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.19.1)\r\n", "Requirement already satisfied: decorator in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (5.1.1)\r\n", "Requirement already satisfied: pickleshare in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.7.5)\r\n", "Requirement already satisfied: traitlets>=4.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (5.14.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (3.0.43)\r\n", "Requirement already satisfied: pygments in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.18.0)\r\n", "Requirement already satisfied: backcall in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.2.0)\r\n", "Requirement already satisfied: matplotlib-inline in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.1.7)\r\n", "Requirement already satisfied: pexpect>4.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (4.9.0)\r\n", "Requirement already satisfied: comm>=0.1.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.2.2)\r\n", "Requirement already satisfied: ipython-genutils~=0.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.2.0)\r\n", "Requirement already satisfied: widgetsnbextension~=3.6.6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (3.6.6)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: jupyterlab-widgets<3,>=1.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.1.7)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: tzlocal>=1.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from js2py<1,>=0.74->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (5.2)\r\n", "Requirement already satisfied: pyjsparser>=2.5.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from js2py<1,>=0.74->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2.7.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema<5.0.0,>=4.0.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2023.12.1)\r\n", "Requirement already satisfied: referencing>=0.28.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema<5.0.0,>=4.0.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.35.1)\r\n", "Requirement already satisfied: rpds-py>=0.7.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema<5.0.0,>=4.0.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (0.18.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pyasn1>=0.1.7 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from oauth2client>=1.4.12->google-apitools<1,>=0.5->tfx) (0.6.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: dnspython<3.0.0,>=1.16.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pymongo<5.0.0,>=3.8.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (2.6.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: async-timeout>=4.0.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from redis<6,>=5.0.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (4.0.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: charset-normalizer<4,>=2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3.0.0,>=2.24.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (3.3.2)\r\n", "Requirement already satisfied: idna<4,>=2.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3.0.0,>=2.24.0->apache-beam<3,>=2.47->apache-beam[gcp]<3,>=2.47->tfx) (3.7)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: nltk in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from rouge-score<2,>=0.1.2->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (3.8.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: portalocker in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from sacrebleu<4,>=2.3->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.8.2)\r\n", "Requirement already satisfied: tabulate>=0.8.9 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from sacrebleu<4,>=2.3->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.9.0)\r\n", "Requirement already satisfied: colorama in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from sacrebleu<4,>=2.3->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.4.6)\r\n", "Requirement already satisfied: lxml in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from sacrebleu<4,>=2.3->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (5.2.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow<2.16,>=2.15.0->tfx) (1.2.0)\r\n", "Requirement already satisfied: markdown>=2.6.8 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow<2.16,>=2.15.0->tfx) (3.6)\r\n", "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow<2.16,>=2.15.0->tfx) (0.7.2)\r\n", "Requirement already satisfied: werkzeug>=1.0.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.16,>=2.15->tensorflow<2.16,>=2.15.0->tfx) (3.0.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: oauthlib>=3.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests-oauthlib->kubernetes<13,>=10.0.1->tfx) (3.2.2)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: parso<0.9.0,>=0.8.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jedi>=0.16->ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.8.4)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: importlib-metadata>=4.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from markdown>=2.6.8->tensorboard<2.16,>=2.15->tensorflow<2.16,>=2.15.0->tfx) (7.1.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: ptyprocess>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pexpect>4.3->ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.7.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: wcwidth in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.2.13)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: notebook>=4.4.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (7.1.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: tqdm in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nltk->rouge-score<2,>=0.1.2->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (4.66.4)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: zipp>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.16,>=2.15->tensorflow<2.16,>=2.15.0->tfx) (3.18.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: jupyter-server<3,>=2.4.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.14.0)\r\n", "Requirement already satisfied: jupyterlab-server<3,>=2.22.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.27.1)\r\n", "Requirement already satisfied: jupyterlab<4.2,>=4.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (4.1.8)\r\n", "Requirement already satisfied: notebook-shim<0.3,>=0.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.2.4)\r\n", "Requirement already satisfied: tornado>=6.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (6.4)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: anyio>=3.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (4.3.0)\r\n", "Requirement already satisfied: argon2-cffi>=21.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (23.1.0)\r\n", "Requirement already satisfied: jupyter-client>=7.4.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (8.6.1)\r\n", "Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (5.7.2)\r\n", "Requirement already satisfied: jupyter-events>=0.9.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.10.0)\r\n", "Requirement already satisfied: jupyter-server-terminals>=0.4.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.5.3)\r\n", "Requirement already satisfied: nbconvert>=6.4.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (7.16.4)\r\n", "Requirement already satisfied: nbformat>=5.3.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (5.10.4)\r\n", "Requirement already satisfied: prometheus-client>=0.9 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.20.0)\r\n", "Requirement already satisfied: pyzmq>=24 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (26.0.3)\r\n", "Requirement already satisfied: send2trash>=1.8.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.8.3)\r\n", "Requirement already satisfied: terminado>=0.8.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.18.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: async-lru>=1.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.0.4)\r\n", "Requirement already satisfied: httpx>=0.25.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.27.0)\r\n", "Requirement already satisfied: ipykernel>=6.5.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (6.29.4)\r\n", "Requirement already satisfied: jupyter-lsp>=2.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.2.5)\r\n", "Requirement already satisfied: tomli>=1.2.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.0.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: babel>=2.10 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyterlab-server<3,>=2.22.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.15.0)\r\n", "Requirement already satisfied: json5>=0.9.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyterlab-server<3,>=2.22.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.9.25)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: sniffio>=1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from anyio>=3.1.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.3.1)\r\n", "Requirement already satisfied: exceptiongroup>=1.0.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from anyio>=3.1.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.2.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: argon2-cffi-bindings in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (21.2.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: httpcore==1.* in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from httpx>=0.25.0->jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.0.5)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: h11<0.15,>=0.13 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from httpcore==1.*->httpx>=0.25.0->jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.14.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: debugpy>=1.6.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipykernel>=6.5.0->jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.8.1)\r\n", "Requirement already satisfied: nest-asyncio in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from ipykernel>=6.5.0->jupyterlab<4.2,>=4.1.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.6.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: platformdirs>=2.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-core!=5.0.*,>=4.12->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (4.2.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: python-json-logger>=2.0.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.0.7)\r\n", "Requirement already satisfied: rfc3339-validator in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.1.4)\r\n", "Requirement already satisfied: rfc3986-validator>=0.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.1.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: beautifulsoup4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (4.12.3)\r\n", "Requirement already satisfied: bleach!=5.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (6.1.0)\r\n", "Requirement already satisfied: defusedxml in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.7.1)\r\n", "Requirement already satisfied: jupyterlab-pygments in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.3.0)\r\n", "Requirement already satisfied: mistune<4,>=2.0.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (3.0.2)\r\n", "Requirement already satisfied: nbclient>=0.5.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.10.0)\r\n", "Requirement already satisfied: pandocfilters>=1.4.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.5.1)\r\n", "Requirement already satisfied: tinycss2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.3.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: fastjsonschema>=2.15 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from nbformat>=5.3.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.19.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: webencodings in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from bleach!=5.0.0->nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (0.5.1)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: fqdn in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.5.1)\r\n", "Requirement already satisfied: isoduration in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (20.11.0)\r\n", "Requirement already satisfied: jsonpointer>1.13 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.4)\r\n", "Requirement already satisfied: uri-template in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.3.0)\r\n", "Requirement already satisfied: webcolors>=1.11 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.13)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: cffi>=1.0.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.16.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: soupsieve>1.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from beautifulsoup4->nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.5)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pycparser in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.22)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: arrow>=0.15.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from isoduration->jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (1.3.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: types-python-dateutil>=2.8.10 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from arrow>=0.15.0->isoduration->jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets<8,>=7->tensorflow-model-analysis<0.47.0,>=0.46.0->tfx) (2.9.0.20240316)\r\n" ] } ], "source": [ "!pip install -U tfx" ] }, { "cell_type": "markdown", "metadata": { "id": "EwT0nov5QO1M" }, "source": [ "### Did you restart the runtime?\n", "\n", "If you are using Google Colab, the first time that you run\n", "the cell above, you must restart the runtime by clicking\n", "above \"RESTART RUNTIME\" button or using \"Runtime > Restart\n", "runtime ...\" menu. This is because of the way that Colab\n", "loads packages." ] }, { "cell_type": "markdown", "metadata": { "id": "BDnPgN8UJtzN" }, "source": [ "Check the TensorFlow and TFX versions." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:04.228851Z", "iopub.status.busy": "2024-05-08T09:36:04.228555Z", "iopub.status.idle": "2024-05-08T09:36:10.264916Z", "shell.execute_reply": "2024-05-08T09:36:10.264118Z" }, "id": "6jh7vKSRqPHb" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-05-08 09:36:04.670322: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2024-05-08 09:36:04.670389: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2024-05-08 09:36:04.671916: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TensorFlow version: 2.15.1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "TFX version: 1.15.0\n" ] } ], "source": [ "import tensorflow as tf\n", "print('TensorFlow version: {}'.format(tf.__version__))\n", "from tfx import v1 as tfx\n", "print('TFX version: {}'.format(tfx.__version__))" ] }, { "cell_type": "markdown", "metadata": { "id": "aDtLdSkvqPHe" }, "source": [ "### Set up variables\n", "\n", "There are some variables used to define a pipeline. You can customize these\n", "variables as you want. By default all output from the pipeline will be\n", "generated under the current directory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:10.268791Z", "iopub.status.busy": "2024-05-08T09:36:10.268375Z", "iopub.status.idle": "2024-05-08T09:36:10.274079Z", "shell.execute_reply": "2024-05-08T09:36:10.273392Z" }, "id": "EcUseqJaE2XN" }, "outputs": [], "source": [ "import os\n", "\n", "# We will create two pipelines. One for schema generation and one for training.\n", "SCHEMA_PIPELINE_NAME = \"penguin-tfdv-schema\"\n", "PIPELINE_NAME = \"penguin-tfdv\"\n", "\n", "# Output directory to store artifacts generated from the pipeline.\n", "SCHEMA_PIPELINE_ROOT = os.path.join('pipelines', SCHEMA_PIPELINE_NAME)\n", "PIPELINE_ROOT = os.path.join('pipelines', PIPELINE_NAME)\n", "# Path to a SQLite DB file to use as an MLMD storage.\n", "SCHEMA_METADATA_PATH = os.path.join('metadata', SCHEMA_PIPELINE_NAME,\n", " 'metadata.db')\n", "METADATA_PATH = os.path.join('metadata', PIPELINE_NAME, 'metadata.db')\n", "\n", "# Output directory where created models from the pipeline will be exported.\n", "SERVING_MODEL_DIR = os.path.join('serving_model', PIPELINE_NAME)\n", "\n", "from absl import logging\n", "logging.set_verbosity(logging.INFO) # Set default logging level." ] }, { "cell_type": "markdown", "metadata": { "id": "qsO0l5F3dzOr" }, "source": [ "### Prepare example data\n", "We will download the example dataset for use in our TFX pipeline. The dataset\n", "we are using is\n", "[Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/articles/intro.html)\n", "which is also used in other\n", "[TFX examples](https://github.com/tensorflow/tfx/tree/master/tfx/examples/penguin).\n", "\n", "There are four numeric features in this dataset:\n", "\n", "- culmen_length_mm\n", "- culmen_depth_mm\n", "- flipper_length_mm\n", "- body_mass_g\n", "\n", "All features were already normalized to have range [0,1]. We will build a\n", "classification model which predicts the `species` of penguins." ] }, { "cell_type": "markdown", "metadata": { "id": "IjE8MkZidzO0" }, "source": [ "Because the TFX ExampleGen component reads inputs from a directory, we need\n", "to create a directory and copy the dataset to it." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:10.277427Z", "iopub.status.busy": "2024-05-08T09:36:10.276833Z", "iopub.status.idle": "2024-05-08T09:36:10.393409Z", "shell.execute_reply": "2024-05-08T09:36:10.392804Z" }, "id": "ZSfs6qFgdzO1" }, "outputs": [ { "data": { "text/plain": [ "('/tmpfs/tmp/tfx-dataj_6ovg52/data.csv',\n", " )" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import urllib.request\n", "import tempfile\n", "\n", "DATA_ROOT = tempfile.mkdtemp(prefix='tfx-data') # Create a temporary directory.\n", "_data_url = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/penguin/data/labelled/penguins_processed.csv'\n", "_data_filepath = os.path.join(DATA_ROOT, \"data.csv\")\n", "urllib.request.urlretrieve(_data_url, _data_filepath)" ] }, { "cell_type": "markdown", "metadata": { "id": "n5s3wGpndzO1" }, "source": [ "Take a quick look at the CSV file." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:10.396861Z", "iopub.status.busy": "2024-05-08T09:36:10.396145Z", "iopub.status.idle": "2024-05-08T09:36:10.532791Z", "shell.execute_reply": "2024-05-08T09:36:10.531931Z" }, "id": "nLn9ith2dzO1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g\r\n", "0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667\r\n", "0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556\r\n", "0,0.29818181818181805,0.5833333333333334,0.3898305084745763,0.1527777777777778\r\n", "0,0.16727272727272732,0.7380952380952381,0.3559322033898305,0.20833333333333334\r\n", "0,0.26181818181818167,0.892857142857143,0.3050847457627119,0.2638888888888889\r\n", "0,0.24727272727272717,0.5595238095238096,0.15254237288135594,0.2569444444444444\r\n", "0,0.25818181818181823,0.773809523809524,0.3898305084745763,0.5486111111111112\r\n", "0,0.32727272727272727,0.5357142857142859,0.1694915254237288,0.1388888888888889\r\n", "0,0.23636363636363636,0.9642857142857142,0.3220338983050847,0.3055555555555556\r\n" ] } ], "source": [ "!head {_data_filepath}" ] }, { "cell_type": "markdown", "metadata": { "id": "z8EOfCy1dzO2" }, "source": [ "You should be able to see five feature columns. `species` is one of 0, 1 or 2,\n", "and all other features should have values between 0 and 1. We will create a TFX\n", "pipeline to analyze this dataset." ] }, { "cell_type": "markdown", "metadata": { "id": "ePhfeYv0fVu1" }, "source": [ "## Generate a preliminary schema\n", "\n", "TFX pipelines are defined using Python APIs. We will create a pipeline to\n", "generate a schema from the input examples automatically. This schema can be\n", "reviewed by a human and adjusted as needed. Once the schema is finalized it can\n", "be used for training and example validation in later tasks.\n", "\n", "In addition to `CsvExampleGen` which is used in\n", "[Simple TFX Pipeline Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple),\n", "we will use `StatisticsGen` and `SchemaGen`:\n", "\n", "- [StatisticsGen](https://www.tensorflow.org/tfx/guide/statsgen) calculates\n", "statistics for the dataset.\n", "- [SchemaGen](https://www.tensorflow.org/tfx/guide/schemagen) examines the\n", "statistics and creates an initial data schema.\n", "\n", "See the guides for each component or\n", "[TFX components tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/components_keras)\n", "to learn more on these components." ] }, { "cell_type": "markdown", "metadata": { "id": "JUFq55kCgwsm" }, "source": [ "### Write a pipeline definition\n", "\n", "We define a function to create a TFX pipeline. A `Pipeline` object\n", "represents a TFX pipeline which can be run using one of pipeline\n", "orchestration systems that TFX supports." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:10.536692Z", "iopub.status.busy": "2024-05-08T09:36:10.536403Z", "iopub.status.idle": "2024-05-08T09:36:10.542512Z", "shell.execute_reply": "2024-05-08T09:36:10.541918Z" }, "id": "GfQ6FAk9gxJ2" }, "outputs": [], "source": [ "def _create_schema_pipeline(pipeline_name: str,\n", " pipeline_root: str,\n", " data_root: str,\n", " metadata_path: str) -> tfx.dsl.Pipeline:\n", " \"\"\"Creates a pipeline for schema generation.\"\"\"\n", " # Brings data into the pipeline.\n", " example_gen = tfx.components.CsvExampleGen(input_base=data_root)\n", "\n", " # NEW: Computes statistics over data for visualization and schema generation.\n", " statistics_gen = tfx.components.StatisticsGen(\n", " examples=example_gen.outputs['examples'])\n", "\n", " # NEW: Generates schema based on the generated statistics.\n", " schema_gen = tfx.components.SchemaGen(\n", " statistics=statistics_gen.outputs['statistics'], infer_feature_shape=True)\n", "\n", " components = [\n", " example_gen,\n", " statistics_gen,\n", " schema_gen,\n", " ]\n", "\n", " return tfx.dsl.Pipeline(\n", " pipeline_name=pipeline_name,\n", " pipeline_root=pipeline_root,\n", " metadata_connection_config=tfx.orchestration.metadata\n", " .sqlite_metadata_connection_config(metadata_path),\n", " components=components)" ] }, { "cell_type": "markdown", "metadata": { "id": "RuKFLI_Og2xr" }, "source": [ "### Run the pipeline\n", "\n", "We will use `LocalDagRunner` as in the previous tutorial." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:10.545637Z", "iopub.status.busy": "2024-05-08T09:36:10.545386Z", "iopub.status.idle": "2024-05-08T09:36:15.188385Z", "shell.execute_reply": "2024-05-08T09:36:15.187625Z" }, "id": "BQspf0ajg9AO" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Excluding no splits because exclude_splits is not set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Excluding no splits because exclude_splits is not set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Using deployment config:\n", " executor_specs {\n", " key: \"CsvExampleGen\"\n", " value {\n", " beam_executable_spec {\n", " python_executor_spec {\n", " class_path: \"tfx.components.example_gen.csv_example_gen.executor.Executor\"\n", " }\n", " }\n", " }\n", "}\n", "executor_specs {\n", " key: \"SchemaGen\"\n", " value {\n", " python_class_executable_spec {\n", " class_path: \"tfx.components.schema_gen.executor.Executor\"\n", " }\n", " }\n", "}\n", "executor_specs {\n", " key: \"StatisticsGen\"\n", " value {\n", " beam_executable_spec {\n", " python_executor_spec {\n", " class_path: \"tfx.components.statistics_gen.executor.Executor\"\n", " }\n", " }\n", " }\n", "}\n", "custom_driver_specs {\n", " key: \"CsvExampleGen\"\n", " value {\n", " python_class_executable_spec {\n", " class_path: \"tfx.components.example_gen.driver.FileBasedDriver\"\n", " }\n", " }\n", "}\n", "metadata_connection_config {\n", " database_connection_config {\n", " sqlite {\n", " filename_uri: \"metadata/penguin-tfdv-schema/metadata.db\"\n", " connection_mode: READWRITE_OPENCREATE\n", " }\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Using connection config:\n", " sqlite {\n", " filename_uri: \"metadata/penguin-tfdv-schema/metadata.db\"\n", " connection_mode: READWRITE_OPENCREATE\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component CsvExampleGen is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.example_gen.csv_example_gen.component.CsvExampleGen\"\n", " }\n", " id: \"CsvExampleGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.CsvExampleGen\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"examples\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Examples\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " properties {\n", " key: \"version\"\n", " value: INT\n", " }\n", " base_type: DATASET\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"input_base\"\n", " value {\n", " field_value {\n", " string_value: \"/tmpfs/tmp/tfx-dataj_6ovg52\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"input_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"splits\\\": [\\n {\\n \\\"name\\\": \\\"single_split\\\",\\n \\\"pattern\\\": \\\"*\\\"\\n }\\n ]\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"split_config\\\": {\\n \\\"splits\\\": [\\n {\\n \\\"hash_buckets\\\": 2,\\n \\\"name\\\": \\\"train\\\"\\n },\\n {\\n \\\"hash_buckets\\\": 1,\\n \\\"name\\\": \\\"eval\\\"\\n }\\n ]\\n }\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_data_format\"\n", " value {\n", " field_value {\n", " int_value: 6\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_file_format\"\n", " value {\n", " field_value {\n", " int_value: 5\n", " }\n", " }\n", " }\n", "}\n", "downstream_nodes: \"StatisticsGen\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[CsvExampleGen] Resolved inputs: ({},)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:select span and version = (0, None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:latest span and version = (0, None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=1, input_dict={}, output_dict=defaultdict(, {'examples': [Artifact(artifact: uri: \"pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1\"\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", ", artifact_type: name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]}), exec_properties={'output_data_format': 6, 'output_file_format': 5, 'input_config': '{\\n \"splits\": [\\n {\\n \"name\": \"single_split\",\\n \"pattern\": \"*\"\\n }\\n ]\\n}', 'output_config': '{\\n \"split_config\": {\\n \"splits\": [\\n {\\n \"hash_buckets\": 2,\\n \"name\": \"train\"\\n },\\n {\\n \"hash_buckets\": 1,\\n \"name\": \"eval\"\\n }\\n ]\\n }\\n}', 'input_base': '/tmpfs/tmp/tfx-dataj_6ovg52', 'span': 0, 'version': None, 'input_fingerprint': 'split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970'}, execution_output_uri='pipelines/penguin-tfdv-schema/CsvExampleGen/.system/executor_execution/1/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv-schema/CsvExampleGen/.system/stateful_working_dir/d65151e8-a6c7-4b12-8076-f56938dd89f4', tmp_dir='pipelines/penguin-tfdv-schema/CsvExampleGen/.system/executor_execution/1/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.example_gen.csv_example_gen.component.CsvExampleGen\"\n", " }\n", " id: \"CsvExampleGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.CsvExampleGen\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"examples\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Examples\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " properties {\n", " key: \"version\"\n", " value: INT\n", " }\n", " base_type: DATASET\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"input_base\"\n", " value {\n", " field_value {\n", " string_value: \"/tmpfs/tmp/tfx-dataj_6ovg52\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"input_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"splits\\\": [\\n {\\n \\\"name\\\": \\\"single_split\\\",\\n \\\"pattern\\\": \\\"*\\\"\\n }\\n ]\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"split_config\\\": {\\n \\\"splits\\\": [\\n {\\n \\\"hash_buckets\\\": 2,\\n \\\"name\\\": \\\"train\\\"\\n },\\n {\\n \\\"hash_buckets\\\": 1,\\n \\\"name\\\": \\\"eval\\\"\\n }\\n ]\\n }\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_data_format\"\n", " value {\n", " field_value {\n", " int_value: 6\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_file_format\"\n", " value {\n", " field_value {\n", " int_value: 5\n", " }\n", " }\n", " }\n", "}\n", "downstream_nodes: \"StatisticsGen\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv-schema\"\n", ", pipeline_run_id='2024-05-08T09:36:10.555564', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Generating examples.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.\n" ] }, { "data": { "application/javascript": [ "\n", " if (typeof window.interactive_beam_jquery == 'undefined') {\n", " var jqueryScript = document.createElement('script');\n", " jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n", " jqueryScript.type = 'text/javascript';\n", " jqueryScript.onload = function() {\n", " var datatableScript = document.createElement('script');\n", " datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n", " datatableScript.type = 'text/javascript';\n", " datatableScript.onload = function() {\n", " window.interactive_beam_jquery = jQuery.noConflict(true);\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }\n", " document.head.appendChild(datatableScript);\n", " };\n", " document.head.appendChild(jqueryScript);\n", " } else {\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Processing input csv data /tmpfs/tmp/tfx-dataj_6ovg52/* to TFExample.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Examples generated.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Value type of key version in exec_properties is not supported, going to drop it\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Value type of key _beam_pipeline_args in exec_properties is not supported, going to drop it\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 1 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv-schema/CsvExampleGen/.system/stateful_working_dir/d65151e8-a6c7-4b12-8076-f56938dd89f4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'examples': [Artifact(artifact: uri: \"pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1\"\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", ", artifact_type: name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]}) for execution 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component CsvExampleGen is finished.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component StatisticsGen is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.statistics_gen.component.StatisticsGen\"\n", " base_type: PROCESS\n", " }\n", " id: \"StatisticsGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.StatisticsGen\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"examples\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"CsvExampleGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.CsvExampleGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Examples\"\n", " base_type: DATASET\n", " }\n", " }\n", " output_key: \"examples\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"statistics\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ExampleStatistics\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " base_type: STATISTICS\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"CsvExampleGen\"\n", "downstream_nodes: \"SchemaGen\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[StatisticsGen] Resolved inputs: ({'examples': [Artifact(artifact: id: 1\n", "type_id: 15\n", "uri: \"pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1\"\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"file_format\"\n", " value {\n", " string_value: \"tfrecords_gzip\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"payload_format\"\n", " value {\n", " string_value: \"FORMAT_TF_EXAMPLE\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Examples\"\n", "create_time_since_epoch: 1715160971690\n", "last_update_time_since_epoch: 1715160971690\n", ", artifact_type: id: 15\n", "name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]},)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=2, input_dict={'examples': [Artifact(artifact: id: 1\n", "type_id: 15\n", "uri: \"pipelines/penguin-tfdv-schema/CsvExampleGen/examples/1\"\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"file_format\"\n", " value {\n", " string_value: \"tfrecords_gzip\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"payload_format\"\n", " value {\n", " string_value: \"FORMAT_TF_EXAMPLE\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Examples\"\n", "create_time_since_epoch: 1715160971690\n", "last_update_time_since_epoch: 1715160971690\n", ", artifact_type: id: 15\n", "name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]}, output_dict=defaultdict(, {'statistics': [Artifact(artifact: uri: \"pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2\"\n", ", artifact_type: name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]}), exec_properties={'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv-schema/StatisticsGen/.system/executor_execution/2/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv-schema/StatisticsGen/.system/stateful_working_dir/3dc1ed50-c155-41f6-8457-b71a0b0ebe51', tmp_dir='pipelines/penguin-tfdv-schema/StatisticsGen/.system/executor_execution/2/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.statistics_gen.component.StatisticsGen\"\n", " base_type: PROCESS\n", " }\n", " id: \"StatisticsGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.StatisticsGen\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"examples\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"CsvExampleGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.CsvExampleGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Examples\"\n", " base_type: DATASET\n", " }\n", " }\n", " output_key: \"examples\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"statistics\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ExampleStatistics\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " base_type: STATISTICS\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"CsvExampleGen\"\n", "downstream_nodes: \"SchemaGen\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv-schema\"\n", ", pipeline_run_id='2024-05-08T09:36:10.555564', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Generating statistics for split train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Statistics for split train written to pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2/Split-train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Generating statistics for split eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Statistics for split eval written to pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2/Split-eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 2 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv-schema/StatisticsGen/.system/stateful_working_dir/3dc1ed50-c155-41f6-8457-b71a0b0ebe51\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'statistics': [Artifact(artifact: uri: \"pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2\"\n", ", artifact_type: name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]}) for execution 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component StatisticsGen is finished.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component SchemaGen is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.schema_gen.component.SchemaGen\"\n", " base_type: PROCESS\n", " }\n", " id: \"SchemaGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.SchemaGen\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"statistics\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"StatisticsGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.StatisticsGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"ExampleStatistics\"\n", " base_type: STATISTICS\n", " }\n", " }\n", " output_key: \"statistics\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"schema\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Schema\"\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"infer_feature_shape\"\n", " value {\n", " field_value {\n", " int_value: 1\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"StatisticsGen\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[SchemaGen] Resolved inputs: ({'statistics': [Artifact(artifact: id: 2\n", "type_id: 17\n", "uri: \"pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2\"\n", "properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"stats_dashboard_link\"\n", " value {\n", " string_value: \"\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"ExampleStatistics\"\n", "create_time_since_epoch: 1715160975131\n", "last_update_time_since_epoch: 1715160975131\n", ", artifact_type: id: 17\n", "name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]},)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=3, input_dict={'statistics': [Artifact(artifact: id: 2\n", "type_id: 17\n", "uri: \"pipelines/penguin-tfdv-schema/StatisticsGen/statistics/2\"\n", "properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"stats_dashboard_link\"\n", " value {\n", " string_value: \"\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"ExampleStatistics\"\n", "create_time_since_epoch: 1715160975131\n", "last_update_time_since_epoch: 1715160975131\n", ", artifact_type: id: 17\n", "name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]}, output_dict=defaultdict(, {'schema': [Artifact(artifact: uri: \"pipelines/penguin-tfdv-schema/SchemaGen/schema/3\"\n", ", artifact_type: name: \"Schema\"\n", ")]}), exec_properties={'infer_feature_shape': 1, 'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv-schema/SchemaGen/.system/executor_execution/3/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv-schema/SchemaGen/.system/stateful_working_dir/9cb63ad1-17a3-4aaa-a3d3-5059f958bf6f', tmp_dir='pipelines/penguin-tfdv-schema/SchemaGen/.system/executor_execution/3/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.schema_gen.component.SchemaGen\"\n", " base_type: PROCESS\n", " }\n", " id: \"SchemaGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.SchemaGen\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"statistics\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"StatisticsGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:10.555564\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv-schema.StatisticsGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"ExampleStatistics\"\n", " base_type: STATISTICS\n", " }\n", " }\n", " output_key: \"statistics\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"schema\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Schema\"\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"infer_feature_shape\"\n", " value {\n", " field_value {\n", " int_value: 1\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"StatisticsGen\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv-schema\"\n", ", pipeline_run_id='2024-05-08T09:36:10.555564', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Processing schema from statistics for split train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Processing schema from statistics for split eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Schema written to pipelines/penguin-tfdv-schema/SchemaGen/schema/3/schema.pbtxt.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 3 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv-schema/SchemaGen/.system/stateful_working_dir/9cb63ad1-17a3-4aaa-a3d3-5059f958bf6f\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'schema': [Artifact(artifact: uri: \"pipelines/penguin-tfdv-schema/SchemaGen/schema/3\"\n", ", artifact_type: name: \"Schema\"\n", ")]}) for execution 3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component SchemaGen is finished.\n" ] } ], "source": [ "tfx.orchestration.LocalDagRunner().run(\n", " _create_schema_pipeline(\n", " pipeline_name=SCHEMA_PIPELINE_NAME,\n", " pipeline_root=SCHEMA_PIPELINE_ROOT,\n", " data_root=DATA_ROOT,\n", " metadata_path=SCHEMA_METADATA_PATH))" ] }, { "cell_type": "markdown", "metadata": { "id": "VD4LsLHBi2O4" }, "source": [ "You should see \"INFO:absl:Component SchemaGen is finished.\" if the pipeline\n", "finished successfully.\n", "\n", "We will examine the output of the pipeline to understand our dataset." ] }, { "cell_type": "markdown", "metadata": { "id": "lWpckstgg9Zs" }, "source": [ "### Review outputs of the pipeline" ] }, { "cell_type": "markdown", "metadata": { "id": "tL1wWoDh5wkj" }, "source": [ "As explained in the previous tutorial, a TFX pipeline produces two kinds of\n", "outputs, artifacts and a\n", "[metadata DB(MLMD)](https://www.tensorflow.org/tfx/guide/mlmd) which contains\n", "metadata of artifacts and pipeline executions. We defined the location of \n", "these outputs in the above cells. By default, artifacts are stored under\n", "the `pipelines` directory and metadata is stored as a sqlite database\n", "under the `metadata` directory.\n", "\n", "You can use MLMD APIs to locate these outputs programatically. First, we will\n", "define some utility functions to search for the output artifacts that were just\n", "produced.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.192511Z", "iopub.status.busy": "2024-05-08T09:36:15.191794Z", "iopub.status.idle": "2024-05-08T09:36:15.199393Z", "shell.execute_reply": "2024-05-08T09:36:15.198794Z" }, "id": "K0i_jTvOI8mv" }, "outputs": [], "source": [ "from ml_metadata.proto import metadata_store_pb2\n", "# Non-public APIs, just for showcase.\n", "from tfx.orchestration.portable.mlmd import execution_lib\n", "\n", "# TODO(b/171447278): Move these functions into the TFX library.\n", "\n", "def get_latest_artifacts(metadata, pipeline_name, component_id):\n", " \"\"\"Output artifacts of the latest run of the component.\"\"\"\n", " context = metadata.store.get_context_by_type_and_name(\n", " 'node', f'{pipeline_name}.{component_id}')\n", " executions = metadata.store.get_executions_by_context(context.id)\n", " latest_execution = max(executions,\n", " key=lambda e:e.last_update_time_since_epoch)\n", " return execution_lib.get_output_artifacts(metadata, latest_execution.id)\n", "\n", "# Non-public APIs, just for showcase.\n", "from tfx.orchestration.experimental.interactive import visualizations\n", "\n", "def visualize_artifacts(artifacts):\n", " \"\"\"Visualizes artifacts using standard visualization modules.\"\"\"\n", " for artifact in artifacts:\n", " visualization = visualizations.get_registry().get_visualization(\n", " artifact.type_name)\n", " if visualization:\n", " visualization.display(artifact)\n", "\n", "from tfx.orchestration.experimental.interactive import standard_visualizations\n", "standard_visualizations.register_standard_visualizations()" ] }, { "cell_type": "markdown", "metadata": { "id": "2CE1dk_3irPL" }, "source": [ "Now we can examine the outputs from the pipeline execution." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.203046Z", "iopub.status.busy": "2024-05-08T09:36:15.202475Z", "iopub.status.idle": "2024-05-08T09:36:15.213054Z", "shell.execute_reply": "2024-05-08T09:36:15.212464Z" }, "id": "hRKSjXzsiqh0" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] } ], "source": [ "# Non-public APIs, just for showcase.\n", "from tfx.orchestration.metadata import Metadata\n", "from tfx.types import standard_component_specs\n", "\n", "metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(\n", " SCHEMA_METADATA_PATH)\n", "\n", "with Metadata(metadata_connection_config) as metadata_handler:\n", " # Find output artifacts from MLMD.\n", " stat_gen_output = get_latest_artifacts(metadata_handler, SCHEMA_PIPELINE_NAME,\n", " 'StatisticsGen')\n", " stats_artifacts = stat_gen_output[standard_component_specs.STATISTICS_KEY]\n", "\n", " schema_gen_output = get_latest_artifacts(metadata_handler,\n", " SCHEMA_PIPELINE_NAME, 'SchemaGen')\n", " schema_artifacts = schema_gen_output[standard_component_specs.SCHEMA_KEY]" ] }, { "cell_type": "markdown", "metadata": { "id": "9e8i0K-Aiqh-" }, "source": [ "It is time to examine the outputs from each component. As described above,\n", "[Tensorflow Data Validation(TFDV)](https://www.tensorflow.org/tfx/data_validation/get_started)\n", "is used in `StatisticsGen` and `SchemaGen`, and TFDV also\n", "provides visualization of the outputs from these components.\n", "\n", "In this tutorial, we will use the visualization helper methods in TFX which\n", "use TFDV internally to show the visualization." ] }, { "cell_type": "markdown", "metadata": { "id": "GRGC4X1Ziqh-" }, "source": [ "#### Examine the output from StatisticsGen\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3StnKm04iqh-", "scrolled": true }, "outputs": [], "source": [ "# docs-infra: no-execute\n", "visualize_artifacts(stats_artifacts)" ] }, { "cell_type": "markdown", "metadata": { "id": "JPfVPFTW0Jh2" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "yS1XXFtfiqh-" }, "source": [ "You can see various stats for the input data. These statistics are supplied to\n", "`SchemaGen` to construct an initial schema of data automatically.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "20HK9JS7iqh-" }, "source": [ "#### Examine the output from SchemaGen\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.217148Z", "iopub.status.busy": "2024-05-08T09:36:15.216531Z", "iopub.status.idle": "2024-05-08T09:36:15.230856Z", "shell.execute_reply": "2024-05-08T09:36:15.230241Z" }, "id": "MVmlot5ziqh_" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypePresenceValencyDomain
Feature name
'body_mass_g'FLOATrequired-
'culmen_depth_mm'FLOATrequired-
'culmen_length_mm'FLOATrequired-
'flipper_length_mm'FLOATrequired-
'species'INTrequired-
\n", "
" ], "text/plain": [ " Type Presence Valency Domain\n", "Feature name \n", "'body_mass_g' FLOAT required -\n", "'culmen_depth_mm' FLOAT required -\n", "'culmen_length_mm' FLOAT required -\n", "'flipper_length_mm' FLOAT required -\n", "'species' INT required -" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "visualize_artifacts(schema_artifacts)" ] }, { "cell_type": "markdown", "metadata": { "id": "8ldXsv2iiqh_" }, "source": [ "This schema is automatically inferred from the output of StatisticsGen. You\n", "should be able to see 4 FLOAT features and 1 INT feature." ] }, { "cell_type": "markdown", "metadata": { "id": "bKpFPwEWhCoB" }, "source": [ "### Export the schema for future use\n", "\n", "We need to review and refine the generated schema. The reviewed schema needs\n", "to be persisted to be used in subsequent pipelines for ML model training. In\n", "other words, you might want to add the schema file to your version control\n", "system for actual use cases. In this tutorial, we will just copy the schema\n", "to a predefined filesystem path for simplicity.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.234486Z", "iopub.status.busy": "2024-05-08T09:36:15.233933Z", "iopub.status.idle": "2024-05-08T09:36:15.239486Z", "shell.execute_reply": "2024-05-08T09:36:15.238813Z" }, "id": "0Pyi0oaKmRTg" }, "outputs": [ { "data": { "text/plain": [ "'schema/schema.pbtxt'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import shutil\n", "\n", "_schema_filename = 'schema.pbtxt'\n", "SCHEMA_PATH = 'schema'\n", "\n", "os.makedirs(SCHEMA_PATH, exist_ok=True)\n", "_generated_path = os.path.join(schema_artifacts[0].uri, _schema_filename)\n", "\n", "# Copy the 'schema.pbtxt' file from the artifact uri to a predefined path.\n", "shutil.copy(_generated_path, SCHEMA_PATH)" ] }, { "cell_type": "markdown", "metadata": { "id": "05U8uQ6dnlB4" }, "source": [ "The schema file uses\n", "[Protocol Buffer text format](https://googleapis.dev/python/protobuf/latest/google/protobuf/text_format.html)\n", "and an instance of\n", "[TensorFlow Metadata Schema proto](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.242779Z", "iopub.status.busy": "2024-05-08T09:36:15.242507Z", "iopub.status.idle": "2024-05-08T09:36:15.383833Z", "shell.execute_reply": "2024-05-08T09:36:15.382703Z" }, "id": "uwHO7-HfnlWs" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Schema at schema-----\n", "feature {\r\n", " name: \"body_mass_g\"\r\n", " type: FLOAT\r\n", " presence {\r\n", " min_fraction: 1.0\r\n", " min_count: 1\r\n", " }\r\n", " shape {\r\n", " dim {\r\n", " size: 1\r\n", " }\r\n", " }\r\n", "}\r\n", "feature {\r\n", " name: \"culmen_depth_mm\"\r\n", " type: FLOAT\r\n", " presence {\r\n", " min_fraction: 1.0\r\n", " min_count: 1\r\n", " }\r\n", " shape {\r\n", " dim {\r\n", " size: 1\r\n", " }\r\n", " }\r\n", "}\r\n", "feature {\r\n", " name: \"culmen_length_mm\"\r\n", " type: FLOAT\r\n", " presence {\r\n", " min_fraction: 1.0\r\n", " min_count: 1\r\n", " }\r\n", " shape {\r\n", " dim {\r\n", " size: 1\r\n", " }\r\n", " }\r\n", "}\r\n", "feature {\r\n", " name: \"flipper_length_mm\"\r\n", " type: FLOAT\r\n", " presence {\r\n", " min_fraction: 1.0\r\n", " min_count: 1\r\n", " }\r\n", " shape {\r\n", " dim {\r\n", " size: 1\r\n", " }\r\n", " }\r\n", "}\r\n", "feature {\r\n", " name: \"species\"\r\n", " type: INT\r\n", " presence {\r\n", " min_fraction: 1.0\r\n", " min_count: 1\r\n", " }\r\n", " shape {\r\n", " dim {\r\n", " size: 1\r\n", " }\r\n", " }\r\n", "}\r\n" ] } ], "source": [ "print(f'Schema at {SCHEMA_PATH}-----')\n", "!cat {SCHEMA_PATH}/*" ] }, { "cell_type": "markdown", "metadata": { "id": "BjKigLTNos4F" }, "source": [ "You should be sure to review and possibly edit the schema definition as\n", "needed. In this tutorial, we will just use the generated schema unchanged.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "nH6gizcpSwWV" }, "source": [ "## Validate input examples and train an ML model\n", "\n", "We will go back to the pipeline that we created in\n", "[Simple TFX Pipeline Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple),\n", "to train an ML model and use the generated schema for writing the model\n", "training code.\n", "\n", "We will also add an\n", "[ExampleValidator](https://www.tensorflow.org/tfx/guide/exampleval)\n", "component which will look for anomalies and missing values in the incoming\n", "dataset with respect to the schema.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lOjDv93eS5xV" }, "source": [ "### Write model training code\n", "\n", "We need to write the model code as we did in\n", "[Simple TFX Pipeline Tutorial](https://www.tensorflow.org/tfx/tutorials/tfx/penguin_simple).\n", "\n", "The model itself is the same as in the previous tutorial, but this time we will\n", "use the schema generated from the previous pipeline instead of specifying\n", "features manually. Most of the code was not changed. The only difference is\n", "that we do not need to specify the names and types of features in this file.\n", "Instead, we read them from the *schema* file." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.388383Z", "iopub.status.busy": "2024-05-08T09:36:15.387768Z", "iopub.status.idle": "2024-05-08T09:36:15.391739Z", "shell.execute_reply": "2024-05-08T09:36:15.391103Z" }, "id": "aES7Hv5QTDK3" }, "outputs": [], "source": [ "_trainer_module_file = 'penguin_trainer.py'" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.395022Z", "iopub.status.busy": "2024-05-08T09:36:15.394527Z", "iopub.status.idle": "2024-05-08T09:36:15.401498Z", "shell.execute_reply": "2024-05-08T09:36:15.400872Z" }, "id": "Gnc67uQNTDfW" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing penguin_trainer.py\n" ] } ], "source": [ "%%writefile {_trainer_module_file}\n", "\n", "from typing import List\n", "from absl import logging\n", "import tensorflow as tf\n", "from tensorflow import keras\n", "from tensorflow_transform.tf_metadata import schema_utils\n", "\n", "from tfx import v1 as tfx\n", "from tfx_bsl.public import tfxio\n", "from tensorflow_metadata.proto.v0 import schema_pb2\n", "\n", "# We don't need to specify _FEATURE_KEYS and _FEATURE_SPEC any more.\n", "# Those information can be read from the given schema file.\n", "\n", "_LABEL_KEY = 'species'\n", "\n", "_TRAIN_BATCH_SIZE = 20\n", "_EVAL_BATCH_SIZE = 10\n", "\n", "def _input_fn(file_pattern: List[str],\n", " data_accessor: tfx.components.DataAccessor,\n", " schema: schema_pb2.Schema,\n", " batch_size: int = 200) -> tf.data.Dataset:\n", " \"\"\"Generates features and label for training.\n", "\n", " Args:\n", " file_pattern: List of paths or patterns of input tfrecord files.\n", " data_accessor: DataAccessor for converting input to RecordBatch.\n", " schema: schema of the input data.\n", " batch_size: representing the number of consecutive elements of returned\n", " dataset to combine in a single batch\n", "\n", " Returns:\n", " A dataset that contains (features, indices) tuple where features is a\n", " dictionary of Tensors, and indices is a single Tensor of label indices.\n", " \"\"\"\n", " return data_accessor.tf_dataset_factory(\n", " file_pattern,\n", " tfxio.TensorFlowDatasetOptions(\n", " batch_size=batch_size, label_key=_LABEL_KEY),\n", " schema=schema).repeat()\n", "\n", "\n", "def _build_keras_model(schema: schema_pb2.Schema) -> tf.keras.Model:\n", " \"\"\"Creates a DNN Keras model for classifying penguin data.\n", "\n", " Returns:\n", " A Keras Model.\n", " \"\"\"\n", " # The model below is built with Functional API, please refer to\n", " # https://www.tensorflow.org/guide/keras/overview for all API options.\n", "\n", " # ++ Changed code: Uses all features in the schema except the label.\n", " feature_keys = [f.name for f in schema.feature if f.name != _LABEL_KEY]\n", " inputs = [keras.layers.Input(shape=(1,), name=f) for f in feature_keys]\n", " # ++ End of the changed code.\n", "\n", " d = keras.layers.concatenate(inputs)\n", " for _ in range(2):\n", " d = keras.layers.Dense(8, activation='relu')(d)\n", " outputs = keras.layers.Dense(3)(d)\n", "\n", " model = keras.Model(inputs=inputs, outputs=outputs)\n", " model.compile(\n", " optimizer=keras.optimizers.Adam(1e-2),\n", " loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),\n", " metrics=[keras.metrics.SparseCategoricalAccuracy()])\n", "\n", " model.summary(print_fn=logging.info)\n", " return model\n", "\n", "\n", "# TFX Trainer will call this function.\n", "def run_fn(fn_args: tfx.components.FnArgs):\n", " \"\"\"Train the model based on given args.\n", "\n", " Args:\n", " fn_args: Holds args used to train the model as name/value pairs.\n", " \"\"\"\n", "\n", " # ++ Changed code: Reads in schema file passed to the Trainer component.\n", " schema = tfx.utils.parse_pbtxt_file(fn_args.schema_path, schema_pb2.Schema())\n", " # ++ End of the changed code.\n", "\n", " train_dataset = _input_fn(\n", " fn_args.train_files,\n", " fn_args.data_accessor,\n", " schema,\n", " batch_size=_TRAIN_BATCH_SIZE)\n", " eval_dataset = _input_fn(\n", " fn_args.eval_files,\n", " fn_args.data_accessor,\n", " schema,\n", " batch_size=_EVAL_BATCH_SIZE)\n", "\n", " model = _build_keras_model(schema)\n", " model.fit(\n", " train_dataset,\n", " steps_per_epoch=fn_args.train_steps,\n", " validation_data=eval_dataset,\n", " validation_steps=fn_args.eval_steps)\n", "\n", " # The result of the training should be saved in `fn_args.serving_model_dir`\n", " # directory.\n", " model.save(fn_args.serving_model_dir, save_format='tf')" ] }, { "cell_type": "markdown", "metadata": { "id": "blaw0rs-emEf" }, "source": [ "Now you have completed all preparation steps to build a TFX pipeline for\n", "model training." ] }, { "cell_type": "markdown", "metadata": { "id": "w3OkNz3gTLwM" }, "source": [ "### Write a pipeline definition\n", "\n", "We will add two new components, `Importer` and `ExampleValidator`. Importer\n", "brings an external file into the TFX pipeline. In this case, it is a file\n", "containing schema definition. ExampleValidator will examine\n", "the input data and validate whether all input data conforms the data schema\n", "we provided.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.404961Z", "iopub.status.busy": "2024-05-08T09:36:15.404416Z", "iopub.status.idle": "2024-05-08T09:36:15.412284Z", "shell.execute_reply": "2024-05-08T09:36:15.411662Z" }, "id": "M49yYVNBTPd4" }, "outputs": [], "source": [ "def _create_pipeline(pipeline_name: str, pipeline_root: str, data_root: str,\n", " schema_path: str, module_file: str, serving_model_dir: str,\n", " metadata_path: str) -> tfx.dsl.Pipeline:\n", " \"\"\"Creates a pipeline using predefined schema with TFX.\"\"\"\n", " # Brings data into the pipeline.\n", " example_gen = tfx.components.CsvExampleGen(input_base=data_root)\n", "\n", " # Computes statistics over data for visualization and example validation.\n", " statistics_gen = tfx.components.StatisticsGen(\n", " examples=example_gen.outputs['examples'])\n", "\n", " # NEW: Import the schema.\n", " schema_importer = tfx.dsl.Importer(\n", " source_uri=schema_path,\n", " artifact_type=tfx.types.standard_artifacts.Schema).with_id(\n", " 'schema_importer')\n", "\n", " # NEW: Performs anomaly detection based on statistics and data schema.\n", " example_validator = tfx.components.ExampleValidator(\n", " statistics=statistics_gen.outputs['statistics'],\n", " schema=schema_importer.outputs['result'])\n", "\n", " # Uses user-provided Python function that trains a model.\n", " trainer = tfx.components.Trainer(\n", " module_file=module_file,\n", " examples=example_gen.outputs['examples'],\n", " schema=schema_importer.outputs['result'], # Pass the imported schema.\n", " train_args=tfx.proto.TrainArgs(num_steps=100),\n", " eval_args=tfx.proto.EvalArgs(num_steps=5))\n", "\n", " # Pushes the model to a filesystem destination.\n", " pusher = tfx.components.Pusher(\n", " model=trainer.outputs['model'],\n", " push_destination=tfx.proto.PushDestination(\n", " filesystem=tfx.proto.PushDestination.Filesystem(\n", " base_directory=serving_model_dir)))\n", "\n", " components = [\n", " example_gen,\n", "\n", " # NEW: Following three components were added to the pipeline.\n", " statistics_gen,\n", " schema_importer,\n", " example_validator,\n", "\n", " trainer,\n", " pusher,\n", " ]\n", "\n", " return tfx.dsl.Pipeline(\n", " pipeline_name=pipeline_name,\n", " pipeline_root=pipeline_root,\n", " metadata_connection_config=tfx.orchestration.metadata\n", " .sqlite_metadata_connection_config(metadata_path),\n", " components=components)" ] }, { "cell_type": "markdown", "metadata": { "id": "mJbq07THU2GV" }, "source": [ "### Run the pipeline\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:15.415259Z", "iopub.status.busy": "2024-05-08T09:36:15.414859Z", "iopub.status.idle": "2024-05-08T09:36:28.322883Z", "shell.execute_reply": "2024-05-08T09:36:28.322141Z" }, "id": "fAtfOZTYWJu-" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Excluding no splits because exclude_splits is not set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Excluding no splits because exclude_splits is not set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Generating ephemeral wheel package for '/tmpfs/src/temp/docs/tutorials/tfx/penguin_trainer.py' (including modules: ['penguin_trainer']).\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:User module package has hash fingerprint version 000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Executing: ['/tmpfs/src/tf_docs_env/bin/python', '/tmpfs/tmp/tmpw96a2pj7/_tfx_generated_setup.py', 'bdist_wheel', '--bdist-dir', '/tmpfs/tmp/tmp42iap5mu', '--dist-dir', '/tmpfs/tmp/tmpx8p04zcg']\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.\n", "!!\n", "\n", " ********************************************************************************\n", " Please avoid running ``setup.py`` directly.\n", " Instead, use pypa/build, pypa/installer or other\n", " standards-based tools.\n", "\n", " See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.\n", " ********************************************************************************\n", "\n", "!!\n", " self.initialize_options()\n", "INFO:absl:Successfully built user code wheel distribution at 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'; target user module is 'penguin_trainer'.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Full user module path is 'penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Using deployment config:\n", " executor_specs {\n", " key: \"CsvExampleGen\"\n", " value {\n", " beam_executable_spec {\n", " python_executor_spec {\n", " class_path: \"tfx.components.example_gen.csv_example_gen.executor.Executor\"\n", " }\n", " }\n", " }\n", "}\n", "executor_specs {\n", " key: \"ExampleValidator\"\n", " value {\n", " python_class_executable_spec {\n", " class_path: \"tfx.components.example_validator.executor.Executor\"\n", " }\n", " }\n", "}\n", "executor_specs {\n", " key: \"Pusher\"\n", " value {\n", " python_class_executable_spec {\n", " class_path: \"tfx.components.pusher.executor.Executor\"\n", " }\n", " }\n", "}\n", "executor_specs {\n", " key: \"StatisticsGen\"\n", " value {\n", " beam_executable_spec {\n", " python_executor_spec {\n", " class_path: \"tfx.components.statistics_gen.executor.Executor\"\n", " }\n", " }\n", " }\n", "}\n", "executor_specs {\n", " key: \"Trainer\"\n", " value {\n", " python_class_executable_spec {\n", " class_path: \"tfx.components.trainer.executor.GenericExecutor\"\n", " }\n", " }\n", "}\n", "custom_driver_specs {\n", " key: \"CsvExampleGen\"\n", " value {\n", " python_class_executable_spec {\n", " class_path: \"tfx.components.example_gen.driver.FileBasedDriver\"\n", " }\n", " }\n", "}\n", "metadata_connection_config {\n", " database_connection_config {\n", " sqlite {\n", " filename_uri: \"metadata/penguin-tfdv/metadata.db\"\n", " connection_mode: READWRITE_OPENCREATE\n", " }\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Using connection config:\n", " sqlite {\n", " filename_uri: \"metadata/penguin-tfdv/metadata.db\"\n", " connection_mode: READWRITE_OPENCREATE\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component CsvExampleGen is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.example_gen.csv_example_gen.component.CsvExampleGen\"\n", " }\n", " id: \"CsvExampleGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.CsvExampleGen\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"examples\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Examples\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " properties {\n", " key: \"version\"\n", " value: INT\n", " }\n", " base_type: DATASET\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"input_base\"\n", " value {\n", " field_value {\n", " string_value: \"/tmpfs/tmp/tfx-dataj_6ovg52\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"input_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"splits\\\": [\\n {\\n \\\"name\\\": \\\"single_split\\\",\\n \\\"pattern\\\": \\\"*\\\"\\n }\\n ]\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"split_config\\\": {\\n \\\"splits\\\": [\\n {\\n \\\"hash_buckets\\\": 2,\\n \\\"name\\\": \\\"train\\\"\\n },\\n {\\n \\\"hash_buckets\\\": 1,\\n \\\"name\\\": \\\"eval\\\"\\n }\\n ]\\n }\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_data_format\"\n", " value {\n", " field_value {\n", " int_value: 6\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_file_format\"\n", " value {\n", " field_value {\n", " int_value: 5\n", " }\n", " }\n", " }\n", "}\n", "downstream_nodes: \"StatisticsGen\"\n", "downstream_nodes: \"Trainer\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[CsvExampleGen] Resolved inputs: ({},)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "running bdist_wheel\n", "running build\n", "running build_py\n", "creating build\n", "creating build/lib\n", "copying penguin_trainer.py -> build/lib\n", "installing to /tmpfs/tmp/tmp42iap5mu\n", "running install\n", "running install_lib\n", "copying build/lib/penguin_trainer.py -> /tmpfs/tmp/tmp42iap5mu\n", "running install_egg_info\n", "running egg_info\n", "creating tfx_user_code_Trainer.egg-info\n", "writing tfx_user_code_Trainer.egg-info/PKG-INFO\n", "writing dependency_links to tfx_user_code_Trainer.egg-info/dependency_links.txt\n", "writing top-level names to tfx_user_code_Trainer.egg-info/top_level.txt\n", "writing manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt'\n", "reading manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt'\n", "writing manifest file 'tfx_user_code_Trainer.egg-info/SOURCES.txt'\n", "Copying tfx_user_code_Trainer.egg-info to /tmpfs/tmp/tmp42iap5mu/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3.9.egg-info\n", "running install_scripts\n", "creating /tmpfs/tmp/tmp42iap5mu/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/WHEEL\n", "creating '/tmpfs/tmp/tmpx8p04zcg/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl' and adding '/tmpfs/tmp/tmp42iap5mu' to it\n", "adding 'penguin_trainer.py'\n", "adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/METADATA'\n", "adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/WHEEL'\n", "adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/top_level.txt'\n", "adding 'tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2.dist-info/RECORD'\n", "removing /tmpfs/tmp/tmp42iap5mu\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:select span and version = (0, None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:latest span and version = (0, None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=1, input_dict={}, output_dict=defaultdict(, {'examples': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/CsvExampleGen/examples/1\"\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", ", artifact_type: name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]}), exec_properties={'output_file_format': 5, 'input_config': '{\\n \"splits\": [\\n {\\n \"name\": \"single_split\",\\n \"pattern\": \"*\"\\n }\\n ]\\n}', 'output_data_format': 6, 'output_config': '{\\n \"split_config\": {\\n \"splits\": [\\n {\\n \"hash_buckets\": 2,\\n \"name\": \"train\"\\n },\\n {\\n \"hash_buckets\": 1,\\n \"name\": \"eval\"\\n }\\n ]\\n }\\n}', 'input_base': '/tmpfs/tmp/tfx-dataj_6ovg52', 'span': 0, 'version': None, 'input_fingerprint': 'split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970'}, execution_output_uri='pipelines/penguin-tfdv/CsvExampleGen/.system/executor_execution/1/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/CsvExampleGen/.system/stateful_working_dir/0858d568-1a97-401a-afc6-a9932ff9a1e3', tmp_dir='pipelines/penguin-tfdv/CsvExampleGen/.system/executor_execution/1/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.example_gen.csv_example_gen.component.CsvExampleGen\"\n", " }\n", " id: \"CsvExampleGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.CsvExampleGen\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"examples\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Examples\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " properties {\n", " key: \"version\"\n", " value: INT\n", " }\n", " base_type: DATASET\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"input_base\"\n", " value {\n", " field_value {\n", " string_value: \"/tmpfs/tmp/tfx-dataj_6ovg52\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"input_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"splits\\\": [\\n {\\n \\\"name\\\": \\\"single_split\\\",\\n \\\"pattern\\\": \\\"*\\\"\\n }\\n ]\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_config\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"split_config\\\": {\\n \\\"splits\\\": [\\n {\\n \\\"hash_buckets\\\": 2,\\n \\\"name\\\": \\\"train\\\"\\n },\\n {\\n \\\"hash_buckets\\\": 1,\\n \\\"name\\\": \\\"eval\\\"\\n }\\n ]\\n }\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_data_format\"\n", " value {\n", " field_value {\n", " int_value: 6\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_file_format\"\n", " value {\n", " field_value {\n", " int_value: 5\n", " }\n", " }\n", " }\n", "}\n", "downstream_nodes: \"StatisticsGen\"\n", "downstream_nodes: \"Trainer\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv\"\n", ", pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Generating examples.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Processing input csv data /tmpfs/tmp/tfx-dataj_6ovg52/* to TFExample.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Examples generated.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Value type of key version in exec_properties is not supported, going to drop it\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Value type of key _beam_pipeline_args in exec_properties is not supported, going to drop it\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 1 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/CsvExampleGen/.system/stateful_working_dir/0858d568-1a97-401a-afc6-a9932ff9a1e3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'examples': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/CsvExampleGen/examples/1\"\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", ", artifact_type: name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]}) for execution 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component CsvExampleGen is finished.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component schema_importer is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.dsl.components.common.importer.Importer\"\n", " }\n", " id: \"schema_importer\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.schema_importer\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"result\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Schema\"\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"artifact_uri\"\n", " value {\n", " field_value {\n", " string_value: \"schema\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"output_key\"\n", " value {\n", " field_value {\n", " string_value: \"result\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"reimport\"\n", " value {\n", " field_value {\n", " int_value: 0\n", " }\n", " }\n", " }\n", "}\n", "downstream_nodes: \"ExampleValidator\"\n", "downstream_nodes: \"Trainer\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running as an importer node.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Processing source uri: schema, properties: {}, custom_properties: {}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component schema_importer is finished.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component StatisticsGen is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.statistics_gen.component.StatisticsGen\"\n", " base_type: PROCESS\n", " }\n", " id: \"StatisticsGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.StatisticsGen\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"examples\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"CsvExampleGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.CsvExampleGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Examples\"\n", " base_type: DATASET\n", " }\n", " }\n", " output_key: \"examples\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"statistics\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ExampleStatistics\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " base_type: STATISTICS\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"CsvExampleGen\"\n", "downstream_nodes: \"ExampleValidator\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[StatisticsGen] Resolved inputs: ({'examples': [Artifact(artifact: id: 1\n", "type_id: 15\n", "uri: \"pipelines/penguin-tfdv/CsvExampleGen/examples/1\"\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"file_format\"\n", " value {\n", " string_value: \"tfrecords_gzip\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"payload_format\"\n", " value {\n", " string_value: \"FORMAT_TF_EXAMPLE\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Examples\"\n", "create_time_since_epoch: 1715160976759\n", "last_update_time_since_epoch: 1715160976759\n", ", artifact_type: id: 15\n", "name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]},)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=3, input_dict={'examples': [Artifact(artifact: id: 1\n", "type_id: 15\n", "uri: \"pipelines/penguin-tfdv/CsvExampleGen/examples/1\"\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"file_format\"\n", " value {\n", " string_value: \"tfrecords_gzip\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"payload_format\"\n", " value {\n", " string_value: \"FORMAT_TF_EXAMPLE\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Examples\"\n", "create_time_since_epoch: 1715160976759\n", "last_update_time_since_epoch: 1715160976759\n", ", artifact_type: id: 15\n", "name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]}, output_dict=defaultdict(, {'statistics': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/StatisticsGen/statistics/3\"\n", ", artifact_type: name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]}), exec_properties={'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv/StatisticsGen/.system/executor_execution/3/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/StatisticsGen/.system/stateful_working_dir/110d150d-d7b8-4a54-9e4b-de96d2f275fe', tmp_dir='pipelines/penguin-tfdv/StatisticsGen/.system/executor_execution/3/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.statistics_gen.component.StatisticsGen\"\n", " base_type: PROCESS\n", " }\n", " id: \"StatisticsGen\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.StatisticsGen\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"examples\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"CsvExampleGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.CsvExampleGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Examples\"\n", " base_type: DATASET\n", " }\n", " }\n", " output_key: \"examples\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"statistics\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ExampleStatistics\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " base_type: STATISTICS\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"CsvExampleGen\"\n", "downstream_nodes: \"ExampleValidator\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv\"\n", ", pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Generating statistics for split train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Statistics for split train written to pipelines/penguin-tfdv/StatisticsGen/statistics/3/Split-train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Generating statistics for split eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Statistics for split eval written to pipelines/penguin-tfdv/StatisticsGen/statistics/3/Split-eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 3 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/StatisticsGen/.system/stateful_working_dir/110d150d-d7b8-4a54-9e4b-de96d2f275fe\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'statistics': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/StatisticsGen/statistics/3\"\n", ", artifact_type: name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]}) for execution 3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component StatisticsGen is finished.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component Trainer is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.trainer.component.Trainer\"\n", " base_type: TRAIN\n", " }\n", " id: \"Trainer\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.Trainer\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"examples\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"CsvExampleGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.CsvExampleGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Examples\"\n", " base_type: DATASET\n", " }\n", " }\n", " output_key: \"examples\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", " inputs {\n", " key: \"schema\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"schema_importer\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.schema_importer\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Schema\"\n", " }\n", " }\n", " output_key: \"result\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"model\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Model\"\n", " base_type: MODEL\n", " }\n", " }\n", " }\n", " }\n", " outputs {\n", " key: \"model_run\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ModelRun\"\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"custom_config\"\n", " value {\n", " field_value {\n", " string_value: \"null\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"eval_args\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"num_steps\\\": 5\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"module_path\"\n", " value {\n", " field_value {\n", " string_value: \"penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"train_args\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"num_steps\\\": 100\\n}\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"CsvExampleGen\"\n", "upstream_nodes: \"schema_importer\"\n", "downstream_nodes: \"Pusher\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[Trainer] Resolved inputs: ({'schema': [Artifact(artifact: id: 2\n", "type_id: 17\n", "uri: \"schema\"\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 1\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Schema\"\n", "create_time_since_epoch: 1715160976782\n", "last_update_time_since_epoch: 1715160976782\n", ", artifact_type: id: 17\n", "name: \"Schema\"\n", ")], 'examples': [Artifact(artifact: id: 1\n", "type_id: 15\n", "uri: \"pipelines/penguin-tfdv/CsvExampleGen/examples/1\"\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"file_format\"\n", " value {\n", " string_value: \"tfrecords_gzip\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"payload_format\"\n", " value {\n", " string_value: \"FORMAT_TF_EXAMPLE\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Examples\"\n", "create_time_since_epoch: 1715160976759\n", "last_update_time_since_epoch: 1715160976759\n", ", artifact_type: id: 15\n", "name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]},)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=4, input_dict={'schema': [Artifact(artifact: id: 2\n", "type_id: 17\n", "uri: \"schema\"\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 1\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Schema\"\n", "create_time_since_epoch: 1715160976782\n", "last_update_time_since_epoch: 1715160976782\n", ", artifact_type: id: 17\n", "name: \"Schema\"\n", ")], 'examples': [Artifact(artifact: id: 1\n", "type_id: 15\n", "uri: \"pipelines/penguin-tfdv/CsvExampleGen/examples/1\"\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"file_format\"\n", " value {\n", " string_value: \"tfrecords_gzip\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"input_fingerprint\"\n", " value {\n", " string_value: \"split:single_split,num_files:1,total_bytes:25648,xor_checksum:1715160970,sum_checksum:1715160970\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"payload_format\"\n", " value {\n", " string_value: \"FORMAT_TF_EXAMPLE\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Examples\"\n", "create_time_since_epoch: 1715160976759\n", "last_update_time_since_epoch: 1715160976759\n", ", artifact_type: id: 15\n", "name: \"Examples\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "properties {\n", " key: \"version\"\n", " value: INT\n", "}\n", "base_type: DATASET\n", ")]}, output_dict=defaultdict(, {'model': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/Trainer/model/4\"\n", ", artifact_type: name: \"Model\"\n", "base_type: MODEL\n", ")], 'model_run': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/Trainer/model_run/4\"\n", ", artifact_type: name: \"ModelRun\"\n", ")]}), exec_properties={'custom_config': 'null', 'train_args': '{\\n \"num_steps\": 100\\n}', 'eval_args': '{\\n \"num_steps\": 5\\n}', 'module_path': 'penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'}, execution_output_uri='pipelines/penguin-tfdv/Trainer/.system/executor_execution/4/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/Trainer/.system/stateful_working_dir/d3fcfa17-7c4e-4e63-a48c-deb3cc064ab5', tmp_dir='pipelines/penguin-tfdv/Trainer/.system/executor_execution/4/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.trainer.component.Trainer\"\n", " base_type: TRAIN\n", " }\n", " id: \"Trainer\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.Trainer\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"examples\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"CsvExampleGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.CsvExampleGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Examples\"\n", " base_type: DATASET\n", " }\n", " }\n", " output_key: \"examples\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", " inputs {\n", " key: \"schema\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"schema_importer\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.schema_importer\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Schema\"\n", " }\n", " }\n", " output_key: \"result\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"model\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"Model\"\n", " base_type: MODEL\n", " }\n", " }\n", " }\n", " }\n", " outputs {\n", " key: \"model_run\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ModelRun\"\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"custom_config\"\n", " value {\n", " field_value {\n", " string_value: \"null\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"eval_args\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"num_steps\\\": 5\\n}\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"module_path\"\n", " value {\n", " field_value {\n", " string_value: \"penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"train_args\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"num_steps\\\": 100\\n}\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"CsvExampleGen\"\n", "upstream_nodes: \"schema_importer\"\n", "downstream_nodes: \"Pusher\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv\"\n", ", pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Train on the 'train' split when train_args.splits is not set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Evaluate on the 'eval' split when eval_args.splits is not set.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:udf_utils.get_fn {'custom_config': 'null', 'train_args': '{\\n \"num_steps\": 100\\n}', 'eval_args': '{\\n \"num_steps\": 5\\n}', 'module_path': 'penguin_trainer@pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'} 'run_fn'\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Installing 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl' to a temporary directory.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Executing: ['/tmpfs/src/tf_docs_env/bin/python', '-m', 'pip', 'install', '--target', '/tmpfs/tmp/tmp8javf_ly', 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl']\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Processing ./pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Successfully installed 'pipelines/penguin-tfdv/_wheels/tfx_user_code_Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2-py3-none-any.whl'.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Training model.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature body_mass_g has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_depth_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature flipper_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature species has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: tfx-user-code-Trainer\n", "Successfully installed tfx-user-code-Trainer-0.0+000876a22093ec764e3751d5a3ed939f1b107d1d6ade133f954ea2a767b8dfb2\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx_bsl/tfxio/tf_example_record.py:343: parse_example_dataset (from tensorflow.python.data.experimental.ops.parsing_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use `tf.data.Dataset.map(tf.io.parse_example(...))` instead.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx_bsl/tfxio/tf_example_record.py:343: parse_example_dataset (from tensorflow.python.data.experimental.ops.parsing_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use `tf.data.Dataset.map(tf.io.parse_example(...))` instead.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature body_mass_g has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_depth_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature flipper_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature species has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature body_mass_g has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_depth_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature flipper_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature species has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature body_mass_g has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_depth_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature culmen_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature flipper_length_mm has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Feature species has a shape dim {\n", " size: 1\n", "}\n", ". Setting to DenseTensor.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Model: \"model\"\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:__________________________________________________________________________________________________\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: Layer (type) Output Shape Param # Connected to \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:==================================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: body_mass_g (InputLayer) [(None, 1)] 0 [] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: culmen_depth_mm (InputLaye [(None, 1)] 0 [] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: r) \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: culmen_length_mm (InputLay [(None, 1)] 0 [] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: er) \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: flipper_length_mm (InputLa [(None, 1)] 0 [] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: yer) \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: concatenate (Concatenate) (None, 4) 0 ['body_mass_g[0][0]', \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: 'culmen_depth_mm[0][0]', \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: 'culmen_length_mm[0][0]', \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: 'flipper_length_mm[0][0]'] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: dense (Dense) (None, 8) 40 ['concatenate[0][0]'] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: dense_1 (Dense) (None, 8) 72 ['dense[0][0]'] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: dense_2 (Dense) (None, 3) 27 ['dense_1[0][0]'] \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:==================================================================================================\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Total params: 139 (556.00 Byte)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Trainable params: 139 (556.00 Byte)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Non-trainable params: 0 (0.00 Byte)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:__________________________________________________________________________________________________\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1715160986.539652 29188 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", " 1/100 [..............................] - ETA: 2:21 - loss: 1.2532 - sparse_categorical_accuracy: 0.2000" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", " 19/100 [====>.........................] - ETA: 0s - loss: 1.0039 - sparse_categorical_accuracy: 0.4447 " ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", " 38/100 [==========>...................] - ETA: 0s - loss: 0.8593 - sparse_categorical_accuracy: 0.6526" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", " 58/100 [================>.............] - ETA: 0s - loss: 0.7123 - sparse_categorical_accuracy: 0.7379" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", " 77/100 [======================>.......] - ETA: 0s - loss: 0.6136 - sparse_categorical_accuracy: 0.7805" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", " 97/100 [============================>.] - ETA: 0s - loss: 0.5297 - sparse_categorical_accuracy: 0.8180" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "100/100 [==============================] - 2s 5ms/step - loss: 0.5183 - sparse_categorical_accuracy: 0.8225 - val_loss: 0.1883 - val_sparse_categorical_accuracy: 0.9000\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Function `_wrapped_model` contains input name(s) resource with unsupported characters which will be renamed to model_dense_2_biasadd_readvariableop_resource in the SavedModel.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: pipelines/penguin-tfdv/Trainer/model/4/Format-Serving/assets\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:tensorflow:Assets written to: pipelines/penguin-tfdv/Trainer/model/4/Format-Serving/assets\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Writing fingerprint to pipelines/penguin-tfdv/Trainer/model/4/Format-Serving/fingerprint.pb\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Training complete. Model written to pipelines/penguin-tfdv/Trainer/model/4/Format-Serving. ModelRun written to pipelines/penguin-tfdv/Trainer/model_run/4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 4 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/Trainer/.system/stateful_working_dir/d3fcfa17-7c4e-4e63-a48c-deb3cc064ab5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'model': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/Trainer/model/4\"\n", ", artifact_type: name: \"Model\"\n", "base_type: MODEL\n", ")], 'model_run': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/Trainer/model_run/4\"\n", ", artifact_type: name: \"ModelRun\"\n", ")]}) for execution 4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component Trainer is finished.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component ExampleValidator is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.example_validator.component.ExampleValidator\"\n", " }\n", " id: \"ExampleValidator\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.ExampleValidator\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"schema\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"schema_importer\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.schema_importer\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Schema\"\n", " }\n", " }\n", " output_key: \"result\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", " inputs {\n", " key: \"statistics\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"StatisticsGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.StatisticsGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"ExampleStatistics\"\n", " base_type: STATISTICS\n", " }\n", " }\n", " output_key: \"statistics\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"anomalies\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ExampleAnomalies\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"StatisticsGen\"\n", "upstream_nodes: \"schema_importer\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[ExampleValidator] Resolved inputs: ({'schema': [Artifact(artifact: id: 2\n", "type_id: 17\n", "uri: \"schema\"\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 1\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Schema\"\n", "create_time_since_epoch: 1715160976782\n", "last_update_time_since_epoch: 1715160976782\n", ", artifact_type: id: 17\n", "name: \"Schema\"\n", ")], 'statistics': [Artifact(artifact: id: 3\n", "type_id: 19\n", "uri: \"pipelines/penguin-tfdv/StatisticsGen/statistics/3\"\n", "properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"stats_dashboard_link\"\n", " value {\n", " string_value: \"\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"ExampleStatistics\"\n", "create_time_since_epoch: 1715160979570\n", "last_update_time_since_epoch: 1715160979570\n", ", artifact_type: id: 19\n", "name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]},)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=5, input_dict={'schema': [Artifact(artifact: id: 2\n", "type_id: 17\n", "uri: \"schema\"\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 1\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Schema\"\n", "create_time_since_epoch: 1715160976782\n", "last_update_time_since_epoch: 1715160976782\n", ", artifact_type: id: 17\n", "name: \"Schema\"\n", ")], 'statistics': [Artifact(artifact: id: 3\n", "type_id: 19\n", "uri: \"pipelines/penguin-tfdv/StatisticsGen/statistics/3\"\n", "properties {\n", " key: \"span\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value {\n", " string_value: \"[\\\"train\\\", \\\"eval\\\"]\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"stats_dashboard_link\"\n", " value {\n", " string_value: \"\"\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"ExampleStatistics\"\n", "create_time_since_epoch: 1715160979570\n", "last_update_time_since_epoch: 1715160979570\n", ", artifact_type: id: 19\n", "name: \"ExampleStatistics\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", "base_type: STATISTICS\n", ")]}, output_dict=defaultdict(, {'anomalies': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/ExampleValidator/anomalies/5\"\n", ", artifact_type: name: \"ExampleAnomalies\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", ")]}), exec_properties={'exclude_splits': '[]'}, execution_output_uri='pipelines/penguin-tfdv/ExampleValidator/.system/executor_execution/5/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/ExampleValidator/.system/stateful_working_dir/37093a55-ba7a-42c7-a4fc-388b5f69b7d8', tmp_dir='pipelines/penguin-tfdv/ExampleValidator/.system/executor_execution/5/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.example_validator.component.ExampleValidator\"\n", " }\n", " id: \"ExampleValidator\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.ExampleValidator\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"schema\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"schema_importer\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.schema_importer\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Schema\"\n", " }\n", " }\n", " output_key: \"result\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", " inputs {\n", " key: \"statistics\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"StatisticsGen\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.StatisticsGen\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"ExampleStatistics\"\n", " base_type: STATISTICS\n", " }\n", " }\n", " output_key: \"statistics\"\n", " }\n", " min_count: 1\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"anomalies\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"ExampleAnomalies\"\n", " properties {\n", " key: \"span\"\n", " value: INT\n", " }\n", " properties {\n", " key: \"split_names\"\n", " value: STRING\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"exclude_splits\"\n", " value {\n", " field_value {\n", " string_value: \"[]\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"StatisticsGen\"\n", "upstream_nodes: \"schema_importer\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv\"\n", ", pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Validating schema against the computed statistics for split train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Anomalies alerts created for split train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Validation complete for split train. Anomalies written to pipelines/penguin-tfdv/ExampleValidator/anomalies/5/Split-train.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Validating schema against the computed statistics for split eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Anomalies alerts created for split eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Validation complete for split eval. Anomalies written to pipelines/penguin-tfdv/ExampleValidator/anomalies/5/Split-eval.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 5 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/ExampleValidator/.system/stateful_working_dir/37093a55-ba7a-42c7-a4fc-388b5f69b7d8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'anomalies': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/ExampleValidator/anomalies/5\"\n", ", artifact_type: name: \"ExampleAnomalies\"\n", "properties {\n", " key: \"span\"\n", " value: INT\n", "}\n", "properties {\n", " key: \"split_names\"\n", " value: STRING\n", "}\n", ")]}) for execution 5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component ExampleValidator is finished.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component Pusher is running.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Running launcher for node_info {\n", " type {\n", " name: \"tfx.components.pusher.component.Pusher\"\n", " base_type: DEPLOY\n", " }\n", " id: \"Pusher\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.Pusher\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"model\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"Trainer\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.Trainer\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Model\"\n", " base_type: MODEL\n", " }\n", " }\n", " output_key: \"model\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"pushed_model\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"PushedModel\"\n", " base_type: MODEL\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"custom_config\"\n", " value {\n", " field_value {\n", " string_value: \"null\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"push_destination\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"filesystem\\\": {\\n \\\"base_directory\\\": \\\"serving_model/penguin-tfdv\\\"\\n }\\n}\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"Trainer\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:ArtifactQuery.property_predicate is not supported.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:[Pusher] Resolved inputs: ({'model': [Artifact(artifact: id: 4\n", "type_id: 21\n", "uri: \"pipelines/penguin-tfdv/Trainer/model/4\"\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Model\"\n", "create_time_since_epoch: 1715160988205\n", "last_update_time_since_epoch: 1715160988205\n", ", artifact_type: id: 21\n", "name: \"Model\"\n", "base_type: MODEL\n", ")]},)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution 6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Going to run a new execution: ExecutionInfo(execution_id=6, input_dict={'model': [Artifact(artifact: id: 4\n", "type_id: 21\n", "uri: \"pipelines/penguin-tfdv/Trainer/model/4\"\n", "custom_properties {\n", " key: \"is_external\"\n", " value {\n", " int_value: 0\n", " }\n", "}\n", "custom_properties {\n", " key: \"tfx_version\"\n", " value {\n", " string_value: \"1.15.0\"\n", " }\n", "}\n", "state: LIVE\n", "type: \"Model\"\n", "create_time_since_epoch: 1715160988205\n", "last_update_time_since_epoch: 1715160988205\n", ", artifact_type: id: 21\n", "name: \"Model\"\n", "base_type: MODEL\n", ")]}, output_dict=defaultdict(, {'pushed_model': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/Pusher/pushed_model/6\"\n", ", artifact_type: name: \"PushedModel\"\n", "base_type: MODEL\n", ")]}), exec_properties={'custom_config': 'null', 'push_destination': '{\\n \"filesystem\": {\\n \"base_directory\": \"serving_model/penguin-tfdv\"\\n }\\n}'}, execution_output_uri='pipelines/penguin-tfdv/Pusher/.system/executor_execution/6/executor_output.pb', stateful_working_dir='pipelines/penguin-tfdv/Pusher/.system/stateful_working_dir/5146eed9-4c4d-4bef-b849-ce87e44956ad', tmp_dir='pipelines/penguin-tfdv/Pusher/.system/executor_execution/6/.temp/', pipeline_node=node_info {\n", " type {\n", " name: \"tfx.components.pusher.component.Pusher\"\n", " base_type: DEPLOY\n", " }\n", " id: \"Pusher\"\n", "}\n", "contexts {\n", " contexts {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " contexts {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.Pusher\"\n", " }\n", " }\n", " }\n", "}\n", "inputs {\n", " inputs {\n", " key: \"model\"\n", " value {\n", " channels {\n", " producer_node_query {\n", " id: \"Trainer\"\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"pipeline_run\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"2024-05-08T09:36:15.816321\"\n", " }\n", " }\n", " }\n", " context_queries {\n", " type {\n", " name: \"node\"\n", " }\n", " name {\n", " field_value {\n", " string_value: \"penguin-tfdv.Trainer\"\n", " }\n", " }\n", " }\n", " artifact_query {\n", " type {\n", " name: \"Model\"\n", " base_type: MODEL\n", " }\n", " }\n", " output_key: \"model\"\n", " }\n", " }\n", " }\n", "}\n", "outputs {\n", " outputs {\n", " key: \"pushed_model\"\n", " value {\n", " artifact_spec {\n", " type {\n", " name: \"PushedModel\"\n", " base_type: MODEL\n", " }\n", " }\n", " }\n", " }\n", "}\n", "parameters {\n", " parameters {\n", " key: \"custom_config\"\n", " value {\n", " field_value {\n", " string_value: \"null\"\n", " }\n", " }\n", " }\n", " parameters {\n", " key: \"push_destination\"\n", " value {\n", " field_value {\n", " string_value: \"{\\n \\\"filesystem\\\": {\\n \\\"base_directory\\\": \\\"serving_model/penguin-tfdv\\\"\\n }\\n}\"\n", " }\n", " }\n", " }\n", "}\n", "upstream_nodes: \"Trainer\"\n", "execution_options {\n", " caching_options {\n", " }\n", "}\n", ", pipeline_info=id: \"penguin-tfdv\"\n", ", pipeline_run_id='2024-05-08T09:36:15.816321', top_level_pipeline_run_id=None, frontend_url=None)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:Pusher is going to push the model without validation. Consider using Evaluator or InfraValidator in your pipeline.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Model version: 1715160988\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Model written to serving path serving_model/penguin-tfdv/1715160988.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Model pushed to pipelines/penguin-tfdv/Pusher/pushed_model/6.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateless execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Execution 6 succeeded.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Cleaning up stateful execution info.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Deleted stateful_working_dir pipelines/penguin-tfdv/Pusher/.system/stateful_working_dir/5146eed9-4c4d-4bef-b849-ce87e44956ad\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Publishing output artifacts defaultdict(, {'pushed_model': [Artifact(artifact: uri: \"pipelines/penguin-tfdv/Pusher/pushed_model/6\"\n", ", artifact_type: name: \"PushedModel\"\n", "base_type: MODEL\n", ")]}) for execution 6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:Component Pusher is finished.\n" ] } ], "source": [ "tfx.orchestration.LocalDagRunner().run(\n", " _create_pipeline(\n", " pipeline_name=PIPELINE_NAME,\n", " pipeline_root=PIPELINE_ROOT,\n", " data_root=DATA_ROOT,\n", " schema_path=SCHEMA_PATH,\n", " module_file=_trainer_module_file,\n", " serving_model_dir=SERVING_MODEL_DIR,\n", " metadata_path=METADATA_PATH))" ] }, { "cell_type": "markdown", "metadata": { "id": "AZ3nTzG8uAzn" }, "source": [ "You should see \"INFO:absl:Component Pusher is finished.\" if the pipeline\n", "finished successfully." ] }, { "cell_type": "markdown", "metadata": { "id": "uuD5FRPAcOn8" }, "source": [ "### Examine outputs of the pipeline\n", "\n", "We have trained the classification model for penguins, and we also have\n", "validated the input examples in the ExampleValidator component. We can analyze\n", "the output from ExampleValidator as we did with the previous pipeline." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:28.326959Z", "iopub.status.busy": "2024-05-08T09:36:28.326700Z", "iopub.status.idle": "2024-05-08T09:36:28.337014Z", "shell.execute_reply": "2024-05-08T09:36:28.336343Z" }, "id": "TtsrZEUB1-J4" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:absl:MetadataStore with DB connection initialized\n" ] } ], "source": [ "metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(\n", " METADATA_PATH)\n", "\n", "with Metadata(metadata_connection_config) as metadata_handler:\n", " ev_output = get_latest_artifacts(metadata_handler, PIPELINE_NAME,\n", " 'ExampleValidator')\n", " anomalies_artifacts = ev_output[standard_component_specs.ANOMALIES_KEY]" ] }, { "cell_type": "markdown", "metadata": { "id": "3U5MNAUIdBtN" }, "source": [ "ExampleAnomalies from the ExampleValidator can be visualized as well." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-05-08T09:36:28.340616Z", "iopub.status.busy": "2024-05-08T09:36:28.339970Z", "iopub.status.idle": "2024-05-08T09:36:28.353607Z", "shell.execute_reply": "2024-05-08T09:36:28.352999Z" }, "id": "F-4oAjGR-IR0", "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
'train' split:

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

No anomalies found.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
'eval' split:

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

No anomalies found.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "visualize_artifacts(anomalies_artifacts)" ] }, { "cell_type": "markdown", "metadata": { "id": "t026ZzbU0961" }, "source": [ "You should see \"No anomalies found\" for each split of examples. Because we\n", "used the same data which was used for the schema generation in this pipeline,\n", "no anomaly is expected here. If you run this pipeline repeatedly with new\n", "incoming data, ExampleValidator should be able to find any discrepancies\n", "between the new data and the existing schema.\n", "\n", "If any anomalies were found, you may review your data to check to see if any\n", "examples do not follow your assumptions. Outputs from other components like\n", "StatisticsGen might be useful. However, any anomalies which are found will\n", "NOT block further pipeline executions." ] }, { "cell_type": "markdown", "metadata": { "id": "08R8qvweThRf" }, "source": [ "## Next steps\n", "\n", "You can find more resources on https://www.tensorflow.org/tfx/tutorials.\n", "\n", "Please see\n", "[Understanding TFX Pipelines](https://www.tensorflow.org/tfx/guide/understanding_tfx_pipelines)\n", "to learn more about various concepts in TFX.\n", "\n" ] } ], "metadata": { "colab": { "collapsed_sections": [ "DjUA6S30k52h" ], "name": "penguin_tfdv.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 0 }