{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "pL--_KGdYoBz" }, "source": [ "##### Copyright 2019 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2024-08-16T07:04:00.217731Z", "iopub.status.busy": "2024-08-16T07:04:00.217508Z", "iopub.status.idle": "2024-08-16T07:04:00.221206Z", "shell.execute_reply": "2024-08-16T07:04:00.220638Z" }, "id": "uBDvXpYzYnGj" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "HQzaEQuJiW_d" }, "source": [ "# TFRecord and tf.train.Example\n", "\n", "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "3pkUd_9IZCFO" }, "source": [ "The TFRecord format is a simple format for storing a sequence of binary records.\n", "\n", "[Protocol buffers](https://developers.google.com/protocol-buffers/) are a cross-platform, cross-language library for efficient serialization of structured data.\n", "\n", "Protocol messages are defined by `.proto` files, these are often the easiest way to understand a message type.\n", "\n", "The `tf.train.Example` message (or protobuf) is a flexible message type that represents a `{\"string\": value}` mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as [TFX](https://www.tensorflow.org/tfx/)." ] }, { "cell_type": "markdown", "metadata": { "id": "Ac83J0QxjhFt" }, "source": [ "This notebook demonstrates how to create, parse, and use the `tf.train.Example` message, and then serialize, write, and read `tf.train.Example` messages to and from `.tfrecord` files.\n", "\n", "Note: While useful, these structures are optional. There is no need to convert existing code to use TFRecords, unless you are [using tf.data](https://www.tensorflow.org/guide/data) and reading data is still the bottleneck to training. You can refer to [Better performance with the tf.data API](https://www.tensorflow.org/guide/data_performance) for dataset performance tips.\n", "\n", "Note: In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10 MB+ and ideally 100 MB+) so that you can benefit from I/O prefetching. For example, say you have `X` GB of data and you plan to train on up to `N` hosts. Ideally, you should shard the data to ~`10*N` files, as long as ~`X/(10*N)` is 10 MB+ (and ideally 100 MB+). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits." ] }, { "cell_type": "markdown", "metadata": { "id": "WkRreBf1eDVc" }, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:00.224623Z", "iopub.status.busy": "2024-08-16T07:04:00.224413Z", "iopub.status.idle": "2024-08-16T07:04:02.595414Z", "shell.execute_reply": "2024-08-16T07:04:02.594619Z" }, "id": "Ja7sezsmnXph" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-08-16 07:04:00.479365: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2024-08-16 07:04:00.500427: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2024-08-16 07:04:00.506819: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" ] } ], "source": [ "import tensorflow as tf\n", "\n", "import numpy as np\n", "import IPython.display as display" ] }, { "cell_type": "markdown", "metadata": { "id": "e5Kq88ccUWQV" }, "source": [ "## `tf.train.Example`" ] }, { "cell_type": "markdown", "metadata": { "id": "VrdQHgvNijTi" }, "source": [ "### Data types for `tf.train.Example`" ] }, { "cell_type": "markdown", "metadata": { "id": "lZw57Qrn4CTE" }, "source": [ "Fundamentally, a `tf.train.Example` is a `{\"string\": tf.train.Feature}` mapping.\n", "\n", "The `tf.train.Feature` message type can accept one of the following three types (See the [`.proto` file](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto) for reference). Most other generic types can be coerced into one of these:\n", "\n", "1. `tf.train.BytesList` (the following types can be coerced)\n", "\n", " - `string`\n", " - `byte`\n", "\n", "1. `tf.train.FloatList` (the following types can be coerced)\n", "\n", " - `float` (`float32`)\n", " - `double` (`float64`)\n", "\n", "1. `tf.train.Int64List` (the following types can be coerced)\n", "\n", " - `bool`\n", " - `enum`\n", " - `int32`\n", " - `uint32`\n", " - `int64`\n", " - `uint64`" ] }, { "cell_type": "markdown", "metadata": { "id": "_e3g9ExathXP" }, "source": [ "In order to convert a standard TensorFlow type to a `tf.train.Example`-compatible `tf.train.Feature`, you can use the shortcut functions below. Note that each function takes a scalar input value and returns a `tf.train.Feature` containing one of the three `list` types above:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:02.600655Z", "iopub.status.busy": "2024-08-16T07:04:02.599778Z", "iopub.status.idle": "2024-08-16T07:04:02.605633Z", "shell.execute_reply": "2024-08-16T07:04:02.605045Z" }, "id": "mbsPOUpVtYxA" }, "outputs": [], "source": [ "# The following functions can be used to convert a value to a type compatible\n", "# with tf.train.Example.\n", "\n", "def _bytes_feature(value):\n", " \"\"\"Returns a bytes_list from a string / byte.\"\"\"\n", " if isinstance(value, type(tf.constant(0))):\n", " value = value.numpy() # BytesList won't unpack a string from an EagerTensor.\n", " return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))\n", "\n", "def _float_feature(value):\n", " \"\"\"Returns a float_list from a float / double.\"\"\"\n", " return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))\n", "\n", "def _int64_feature(value):\n", " \"\"\"Returns an int64_list from a bool / enum / int / uint.\"\"\"\n", " return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))" ] }, { "cell_type": "markdown", "metadata": { "id": "Wst0v9O8hgzy" }, "source": [ "Note: To stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use `tf.io.serialize_tensor` to convert tensors to binary-strings. Strings are scalars in TensorFlow. Use `tf.io.parse_tensor` to convert the binary-string back to a tensor." ] }, { "cell_type": "markdown", "metadata": { "id": "vsMbkkC8xxtB" }, "source": [ "Below are some examples of how these functions work. Note the varying input types and the standardized output types. If the input type for a function does not match one of the coercible types stated above, the function will raise an exception (e.g. `_int64_feature(1.0)` will error out because `1.0` is a float—therefore, it should be used with the `_float_feature` function instead):" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:02.609258Z", "iopub.status.busy": "2024-08-16T07:04:02.608633Z", "iopub.status.idle": "2024-08-16T07:04:04.775786Z", "shell.execute_reply": "2024-08-16T07:04:04.774814Z" }, "id": "hZzyLGr0u73y" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bytes_list {\n", " value: \"test_string\"\n", "}\n", "\n", "bytes_list {\n", " value: \"test_bytes\"\n", "}\n", "\n", "float_list {\n", " value: 2.7182817459106445\n", "}\n", "\n", "int64_list {\n", " value: 1\n", "}\n", "\n", "int64_list {\n", " value: 1\n", "}\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1723791843.092175 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.095539 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.099144 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.102704 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.113978 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.117024 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.120462 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.123843 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.126718 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.129679 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.133016 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791843.136422 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.360431 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.362573 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.364572 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.366645 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.368643 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.370639 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.372538 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.374538 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.376473 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.378464 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.380358 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.382352 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.421137 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.423231 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.425187 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.427219 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See mo" ] }, { "name": "stderr", "output_type": "stream", "text": [ "re at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.429165 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.431166 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.433071 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.435053 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.436984 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.439359 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.441678 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n", "I0000 00:00:1723791844.444034 194894 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n" ] } ], "source": [ "print(_bytes_feature(b'test_string'))\n", "print(_bytes_feature(u'test_bytes'.encode('utf-8')))\n", "\n", "print(_float_feature(np.exp(1)))\n", "\n", "print(_int64_feature(True))\n", "print(_int64_feature(1))" ] }, { "cell_type": "markdown", "metadata": { "id": "nj1qpfQU5qmi" }, "source": [ "All proto messages can be serialized to a binary-string using the `.SerializeToString` method:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:04.779941Z", "iopub.status.busy": "2024-08-16T07:04:04.779689Z", "iopub.status.idle": "2024-08-16T07:04:04.786325Z", "shell.execute_reply": "2024-08-16T07:04:04.785742Z" }, "id": "5afZkORT5pjm" }, "outputs": [ { "data": { "text/plain": [ "b'\\x12\\x06\\n\\x04T\\xf8-@'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature = _float_feature(np.exp(1))\n", "\n", "feature.SerializeToString()" ] }, { "cell_type": "markdown", "metadata": { "id": "laKnw9F3hL-W" }, "source": [ "### Creating a `tf.train.Example` message" ] }, { "cell_type": "markdown", "metadata": { "id": "b_MEnhxchQPC" }, "source": [ "Suppose you want to create a `tf.train.Example` message from existing data. In practice, the dataset may come from anywhere, but the procedure of creating the `tf.train.Example` message from a single observation will be the same:\n", "\n", "1. Within each observation, each value needs to be converted to a `tf.train.Feature` containing one of the 3 compatible types, using one of the functions above.\n", "\n", "1. You create a map (dictionary) from the feature name string to the encoded feature value produced in #1.\n", "\n", "1. The map produced in step 2 is converted to a [`Features` message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto#L85)." ] }, { "cell_type": "markdown", "metadata": { "id": "4EgFQ2uHtchc" }, "source": [ "In this notebook, you will create a dataset using NumPy.\n", "\n", "This dataset will have 4 features:\n", "\n", "* a boolean feature, `False` or `True` with equal probability\n", "* an integer feature uniformly randomly chosen from `[0, 5]`\n", "* a string feature generated from a string table by using the integer feature as an index\n", "* a float feature from a standard normal distribution\n", "\n", "Consider a sample consisting of 10,000 independently and identically distributed observations from each of the above distributions:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:04.790057Z", "iopub.status.busy": "2024-08-16T07:04:04.789504Z", "iopub.status.idle": "2024-08-16T07:04:04.794367Z", "shell.execute_reply": "2024-08-16T07:04:04.793697Z" }, "id": "CnrguFAy3YQv" }, "outputs": [], "source": [ "# The number of observations in the dataset.\n", "n_observations = int(1e4)\n", "\n", "# Boolean feature, encoded as False or True.\n", "feature0 = np.random.choice([False, True], n_observations)\n", "\n", "# Integer feature, random from 0 to 4.\n", "feature1 = np.random.randint(0, 5, n_observations)\n", "\n", "# String feature.\n", "strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])\n", "feature2 = strings[feature1]\n", "\n", "# Float feature, from a standard normal distribution.\n", "feature3 = np.random.randn(n_observations)" ] }, { "cell_type": "markdown", "metadata": { "id": "aGrscehJr7Jd" }, "source": [ "Each of these features can be coerced into a `tf.train.Example`-compatible type using one of `_bytes_feature`, `_float_feature`, `_int64_feature`. You can then create a `tf.train.Example` message from these encoded features:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:04.797598Z", "iopub.status.busy": "2024-08-16T07:04:04.797134Z", "iopub.status.idle": "2024-08-16T07:04:04.801455Z", "shell.execute_reply": "2024-08-16T07:04:04.800851Z" }, "id": "RTCS49Ij_kUw" }, "outputs": [], "source": [ "@tf.py_function(Tout=tf.string)\n", "def serialize_example(feature0, feature1, feature2, feature3):\n", " \"\"\"\n", " Creates a tf.train.Example message ready to be written to a file.\n", " \"\"\"\n", " # Create a dictionary mapping the feature name to the tf.train.Example-compatible\n", " # data type.\n", " feature = {\n", " 'feature0': _int64_feature(feature0),\n", " 'feature1': _int64_feature(feature1),\n", " 'feature2': _bytes_feature(feature2),\n", " 'feature3': _float_feature(feature3),\n", " }\n", "\n", " # Create a Features message using tf.train.Example.\n", "\n", " example_proto = tf.train.Example(features=tf.train.Features(feature=feature))\n", " return example_proto.SerializeToString()" ] }, { "cell_type": "markdown", "metadata": { "id": "XftzX9CN_uGT" }, "source": [ "For example, suppose you have a single observation from the dataset, `[False, 4, bytes('goat'), 0.9876]`. You can create and print the `tf.train.Example` message for this observation using `serialize_example()`. Each single observation will be written as a `Features` message as per the above. Note that the `tf.train.Example` [message](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto#L88) is just a wrapper around the `Features` message:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:04.804904Z", "iopub.status.busy": "2024-08-16T07:04:04.804373Z", "iopub.status.idle": "2024-08-16T07:04:04.814299Z", "shell.execute_reply": "2024-08-16T07:04:04.813627Z" }, "id": "N8BtSx2RjYcb" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is an example observation from the dataset.\n", "\n", "example_observation = [False, 4, b'goat', 0.9876]\n", "serialized_example = serialize_example(*example_observation)\n", "serialized_example" ] }, { "cell_type": "markdown", "metadata": { "id": "_pbGATlG6u-4" }, "source": [ "To decode the message use the `tf.train.Example.FromString` method." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:04.817654Z", "iopub.status.busy": "2024-08-16T07:04:04.817091Z", "iopub.status.idle": "2024-08-16T07:04:04.821380Z", "shell.execute_reply": "2024-08-16T07:04:04.820811Z" }, "id": "dGim-mEm6vit" }, "outputs": [ { "data": { "text/plain": [ "features {\n", " feature {\n", " key: \"feature0\"\n", " value {\n", " int64_list {\n", " value: 0\n", " }\n", " }\n", " }\n", " feature {\n", " key: \"feature1\"\n", " value {\n", " int64_list {\n", " value: 4\n", " }\n", " }\n", " }\n", " feature {\n", " key: \"feature2\"\n", " value {\n", " bytes_list {\n", " value: \"goat\"\n", " }\n", " }\n", " }\n", " feature {\n", " key: \"feature3\"\n", " value {\n", " float_list {\n", " value: 0.9876000285148621\n", " }\n", " }\n", " }\n", "}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_proto = tf.train.Example.FromString(serialized_example.numpy())\n", "example_proto" ] }, { "cell_type": "markdown", "metadata": { "id": "o6qxofy89obI" }, "source": [ "## TFRecords format details\n", "\n", "A TFRecord file contains a sequence of records. The file can only be read sequentially.\n", "\n", "Each record contains a byte-string, for the data-payload, plus the data-length, and CRC-32C ([32-bit CRC](https://en.wikipedia.org/wiki/Cyclic_redundancy_check#CRC-32_algorithm) using the [Castagnoli polynomial](https://en.wikipedia.org/wiki/Cyclic_redundancy_check#Standards_and_common_use)) hashes for integrity checking.\n", "\n", "Each record is stored in the following formats:\n", "\n", " uint64 length\n", " uint32 masked_crc32_of_length\n", " byte data[length]\n", " uint32 masked_crc32_of_data\n", "\n", "The records are concatenated together to produce the file. CRCs are\n", "[described here](https://en.wikipedia.org/wiki/Cyclic_redundancy_check), and\n", "the mask of a CRC is:\n", "\n", " masked_crc = ((crc >> 15) | (crc << 17)) + 0xa282ead8ul\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-0iHagLQCJv6" }, "source": [ "Note: There is no requirement to use `tf.train.Example` in TFRecord files. `tf.train.Example` is just a method of serializing dictionaries to byte-strings. Any byte-string that can be decoded in TensorFlow could be stored in a TFRecord file. Examples include: lines of text, JSON (using `tf.io.decode_json_example`), encoded image data, or serialized `tf.Tensors` (using `tf.io.serialize_tensor`/`tf.io.parse_tensor`). See the `tf.io` module for more options." ] }, { "cell_type": "markdown", "metadata": { "id": "jyg1g3gU7DNn" }, "source": [ "## Reading and writing TFRecord files" ] }, { "cell_type": "markdown", "metadata": { "id": "3FXG3miA7Kf1" }, "source": [ "The `tf.io` module also contains pure-Python functions for reading and writing TFRecord files." ] }, { "cell_type": "markdown", "metadata": { "id": "CKn5uql2lAaN" }, "source": [ "### Writing a TFRecord file" ] }, { "cell_type": "markdown", "metadata": { "id": "LNW_FA-GQWXs" }, "source": [ "Next, write the 10,000 observations to the file `test.tfrecord`. Each observation is converted to a `tf.train.Example` message, then written to file. You can then verify that the file `test.tfrecord` has been created:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:04.825436Z", "iopub.status.busy": "2024-08-16T07:04:04.824850Z", "iopub.status.idle": "2024-08-16T07:04:04.827825Z", "shell.execute_reply": "2024-08-16T07:04:04.827282Z" }, "id": "gxB_cwlN0DLy" }, "outputs": [], "source": [ "filename = 'test.tfrecord'" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:04.831247Z", "iopub.status.busy": "2024-08-16T07:04:04.830692Z", "iopub.status.idle": "2024-08-16T07:04:19.387486Z", "shell.execute_reply": "2024-08-16T07:04:19.386737Z" }, "id": "MKPHzoGv7q44" }, "outputs": [], "source": [ "# Write the `tf.train.Example` observations to the file.\n", "with tf.io.TFRecordWriter(filename) as writer:\n", " for i in range(n_observations):\n", " example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])\n", " writer.write(example.numpy())" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.391633Z", "iopub.status.busy": "2024-08-16T07:04:19.391057Z", "iopub.status.idle": "2024-08-16T07:04:19.554669Z", "shell.execute_reply": "2024-08-16T07:04:19.553813Z" }, "id": "EjdFHHJMpUUo" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "984K\ttest.tfrecord\r\n" ] } ], "source": [ "!du -sh {filename}" ] }, { "cell_type": "markdown", "metadata": { "id": "2osVRnYNni-E" }, "source": [ "### Reading a TFRecord file in python\n", "\n", "These serialized tensors can be easily parsed using `tf.train.Example.ParseFromString`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.558846Z", "iopub.status.busy": "2024-08-16T07:04:19.558139Z", "iopub.status.idle": "2024-08-16T07:04:19.583137Z", "shell.execute_reply": "2024-08-16T07:04:19.582411Z" }, "id": "U3tnd3LerOtV" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filenames = [filename]\n", "raw_dataset = tf.data.TFRecordDataset(filenames)\n", "raw_dataset" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.586334Z", "iopub.status.busy": "2024-08-16T07:04:19.585912Z", "iopub.status.idle": "2024-08-16T07:04:19.626499Z", "shell.execute_reply": "2024-08-16T07:04:19.625742Z" }, "id": "nsEAACHcnm3f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "features {\n", " feature {\n", " key: \"feature0\"\n", " value {\n", " int64_list {\n", " value: 0\n", " }\n", " }\n", " }\n", " feature {\n", " key: \"feature1\"\n", " value {\n", " int64_list {\n", " value: 3\n", " }\n", " }\n", " }\n", " feature {\n", " key: \"feature2\"\n", " value {\n", " bytes_list {\n", " value: \"horse\"\n", " }\n", " }\n", " }\n", " feature {\n", " key: \"feature3\"\n", " value {\n", " float_list {\n", " value: -0.6452258229255676\n", " }\n", " }\n", " }\n", "}\n", "\n" ] } ], "source": [ "for raw_record in raw_dataset.take(1):\n", " example = tf.train.Example()\n", " example.ParseFromString(raw_record.numpy())\n", " print(example)" ] }, { "cell_type": "markdown", "metadata": { "id": "yhnZZmhm1miG" }, "source": [ "That returns a `tf.train.Example` proto which is dificult to use as is, but it's fundamentally a representation of a:\n", "\n", "```\n", "Dict[str,\n", " Union[List[float],\n", " List[int],\n", " List[str]]]\n", "```\n", "\n", "The following code manually converts the `Example` to a dictionary of NumPy arrays, without using TensorFlow Ops. Refer to [the PROTO file](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto) for details." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.629956Z", "iopub.status.busy": "2024-08-16T07:04:19.629406Z", "iopub.status.idle": "2024-08-16T07:04:19.635720Z", "shell.execute_reply": "2024-08-16T07:04:19.635015Z" }, "id": "Ziv9tiNE1l6J" }, "outputs": [ { "data": { "text/plain": [ "{'feature2': array([b'horse'], dtype='|S5'),\n", " 'feature1': array([3]),\n", " 'feature0': array([0]),\n", " 'feature3': array([-0.64522582])}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = {}\n", "# example.features.feature is the dictionary\n", "for key, feature in example.features.feature.items():\n", " # The values are the Feature objects which contain a `kind` which contains:\n", " # one of three fields: bytes_list, float_list, int64_list\n", "\n", " kind = feature.WhichOneof('kind')\n", " result[key] = np.array(getattr(feature, kind).value)\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": { "id": "6aV0GQhV8tmp" }, "source": [ "### Reading a TFRecord file Using tf.data" ] }, { "cell_type": "markdown", "metadata": { "id": "o3J5D4gcSy8N" }, "source": [ "You can also read the TFRecord file using the `tf.data.TFRecordDataset` class.\n", "\n", "More information on consuming TFRecord files using `tf.data` can be found in the [tf.data: Build TensorFlow input pipelines](https://www.tensorflow.org/guide/data#consuming_tfrecord_data) guide.\n", "\n", "Using `TFRecordDataset`s can be useful for standardizing input data and optimizing performance." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.639247Z", "iopub.status.busy": "2024-08-16T07:04:19.638682Z", "iopub.status.idle": "2024-08-16T07:04:19.651574Z", "shell.execute_reply": "2024-08-16T07:04:19.650886Z" }, "id": "6OjX6UZl-bHC" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filenames = [filename]\n", "raw_dataset = tf.data.TFRecordDataset(filenames)\n", "raw_dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "6_EQ9i2E_-Fz" }, "source": [ "At this point the dataset contains serialized `tf.train.Example` messages. When iterated over it returns these as scalar string tensors.\n", "\n", "Use the `.take` method to only show the first 10 records.\n", "\n", "Note: iterating over a `tf.data.Dataset` only works with eager execution enabled." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.655097Z", "iopub.status.busy": "2024-08-16T07:04:19.654415Z", "iopub.status.idle": "2024-08-16T07:04:19.680442Z", "shell.execute_reply": "2024-08-16T07:04:19.679796Z" }, "id": "hxVXpLz_AJlm" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n", "\\n\\x11\\n\\x08feature0\\x12\\x05\\x1a\\x03\\n\\x01\\x01'>\n", "\n", "\n", "\n" ] } ], "source": [ "for raw_record in raw_dataset.take(10):\n", " print(repr(raw_record))" ] }, { "cell_type": "markdown", "metadata": { "id": "W-6oNzM4luFQ" }, "source": [ "These tensors can be parsed using the function below. Note that the `feature_description` is necessary here because `tf.data.Dataset`s use graph-execution, and need this description to build their shape and type signature:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.683921Z", "iopub.status.busy": "2024-08-16T07:04:19.683294Z", "iopub.status.idle": "2024-08-16T07:04:19.688739Z", "shell.execute_reply": "2024-08-16T07:04:19.688112Z" }, "id": "zQjbIR1nleiy" }, "outputs": [], "source": [ "# Create a description of the features.\n", "feature_description = {\n", " 'feature0': tf.io.FixedLenFeature([], tf.int64, default_value=0),\n", " 'feature1': tf.io.FixedLenFeature([], tf.int64, default_value=0),\n", " 'feature2': tf.io.FixedLenFeature([], tf.string, default_value=''),\n", " 'feature3': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),\n", "}\n", "\n", "def _parse_function(example_proto):\n", " # Parse the input `tf.train.Example` proto using the dictionary above.\n", " return tf.io.parse_single_example(example_proto, feature_description)" ] }, { "cell_type": "markdown", "metadata": { "id": "gWETjUqhEQZf" }, "source": [ "Alternatively, use `tf.parse_example` to parse the whole batch at once. Apply this function to each item in the dataset using the `tf.data.Dataset.map` method:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.692209Z", "iopub.status.busy": "2024-08-16T07:04:19.691624Z", "iopub.status.idle": "2024-08-16T07:04:19.741393Z", "shell.execute_reply": "2024-08-16T07:04:19.740836Z" }, "id": "6Ob7D-zmBm1w" }, "outputs": [ { "data": { "text/plain": [ "<_MapDataset element_spec={'feature0': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature1': TensorSpec(shape=(), dtype=tf.int64, name=None), 'feature2': TensorSpec(shape=(), dtype=tf.string, name=None), 'feature3': TensorSpec(shape=(), dtype=tf.float32, name=None)}>" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parsed_dataset = raw_dataset.map(_parse_function)\n", "parsed_dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "sNV-XclGnOvn" }, "source": [ "Use eager execution to display the observations in the dataset. There are 10,000 observations in this dataset, but you will only display the first 10. The data is displayed as a dictionary of features. Each item is a `tf.Tensor`, and the `numpy` element of this tensor displays the value of the feature:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.744882Z", "iopub.status.busy": "2024-08-16T07:04:19.744249Z", "iopub.status.idle": "2024-08-16T07:04:19.788685Z", "shell.execute_reply": "2024-08-16T07:04:19.788065Z" }, "id": "x2LT2JCqhoD_" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n", "{'feature0': , 'feature1': , 'feature2': , 'feature3': }\n" ] } ], "source": [ "for parsed_record in parsed_dataset.take(10):\n", " print(repr(parsed_record))" ] }, { "cell_type": "markdown", "metadata": { "id": "Cig9EodTlDmg" }, "source": [ "Here, the `tf.parse_example` function unpacks the `tf.train.Example` fields into standard tensors." ] }, { "cell_type": "markdown", "metadata": { "id": "S0tFDrwdoj3q" }, "source": [ "## Walkthrough: Reading and writing image data" ] }, { "cell_type": "markdown", "metadata": { "id": "rjN2LFxFpcR9" }, "source": [ "This is an end-to-end example of how to read and write image data using TFRecords. Using an image as input data, you will write the data as a TFRecord file, then read the file back and display the image.\n", "\n", "This can be useful if, for example, you want to use several models on the same input dataset. Instead of storing the image data raw, it can be preprocessed into the TFRecords format, and that can be used in all further processing and modelling.\n", "\n", "First, let's download [this image](https://commons.wikimedia.org/wiki/File:Felis_catus-cat_on_snow.jpg) of a cat in the snow and [this photo](https://upload.wikimedia.org/wikipedia/commons/f/fe/New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg) of the Williamsburg Bridge, NYC under construction." ] }, { "cell_type": "markdown", "metadata": { "id": "5Lk2qrKvN0yu" }, "source": [ "### Fetch the images" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.792533Z", "iopub.status.busy": "2024-08-16T07:04:19.791966Z", "iopub.status.idle": "2024-08-16T07:04:19.948892Z", "shell.execute_reply": "2024-08-16T07:04:19.948318Z" }, "id": "3a0fmwg8lHdF" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[1m 0/17858\u001b[0m \u001b[37m━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[1m0s\u001b[0m 0s/step" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "\u001b[1m17858/17858\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 0us/step\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[1m 0/15477\u001b[0m \u001b[37m━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[1m0s\u001b[0m 0s/step" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\r", "\u001b[1m15477/15477\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 0us/step\n" ] } ], "source": [ "cat_in_snow = tf.keras.utils.get_file(\n", " '320px-Felis_catus-cat_on_snow.jpg',\n", " 'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')\n", "\n", "williamsburg_bridge = tf.keras.utils.get_file(\n", " '194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg',\n", " 'https://storage.googleapis.com/download.tensorflow.org/example_images/194px-New_East_River_Bridge_from_Brooklyn_det.4a09796u.jpg')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.951959Z", "iopub.status.busy": "2024-08-16T07:04:19.951575Z", "iopub.status.idle": "2024-08-16T07:04:19.957655Z", "shell.execute_reply": "2024-08-16T07:04:19.957100Z" }, "id": "7aJJh7vENeE4" }, "outputs": [ { "data": { "image/jpeg": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Image cc-by: Von.grzanka" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display.display(display.Image(filename=cat_in_snow))\n", "display.display(display.HTML('Image cc-by: Von.grzanka'))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.960698Z", "iopub.status.busy": "2024-08-16T07:04:19.960278Z", "iopub.status.idle": "2024-08-16T07:04:19.965690Z", "shell.execute_reply": "2024-08-16T07:04:19.965093Z" }, "id": "KkW0uuhcXZqA" }, "outputs": [ { "data": { "image/jpeg": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "From Wikimedia" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display.display(display.Image(filename=williamsburg_bridge))\n", "display.display(display.HTML('From Wikimedia'))" ] }, { "cell_type": "markdown", "metadata": { "id": "VSOgJSwoN5TQ" }, "source": [ "### Write the TFRecord file" ] }, { "cell_type": "markdown", "metadata": { "id": "Azx83ryQEU6T" }, "source": [ "As before, encode the features as types compatible with `tf.train.Example`. This stores the raw image string feature, as well as the height, width, depth, and arbitrary `label` feature. The latter is used when you write the file to distinguish between the cat image and the bridge image. Use `0` for the cat image, and `1` for the bridge image:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.969093Z", "iopub.status.busy": "2024-08-16T07:04:19.968648Z", "iopub.status.idle": "2024-08-16T07:04:19.971627Z", "shell.execute_reply": "2024-08-16T07:04:19.971067Z" }, "id": "kC4TS1ZEONHr" }, "outputs": [], "source": [ "image_labels = {\n", " cat_in_snow : 0,\n", " williamsburg_bridge : 1,\n", "}" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.974685Z", "iopub.status.busy": "2024-08-16T07:04:19.974095Z", "iopub.status.idle": "2024-08-16T07:04:19.981893Z", "shell.execute_reply": "2024-08-16T07:04:19.981290Z" }, "id": "c5njMSYNEhNZ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "features {\n", " feature {\n", " key: \"depth\"\n", " value {\n", " int64_list {\n", " value: 3\n", " }\n", " }\n", " }\n", " feature {\n", " key: \"height\"\n", " value {\n", " int64_list {\n", " value: 213\n", " }\n", "...\n" ] } ], "source": [ "# This is an example, just using the cat image.\n", "image_string = open(cat_in_snow, 'rb').read()\n", "\n", "label = image_labels[cat_in_snow]\n", "\n", "# Create a dictionary with features that may be relevant.\n", "def image_example(image_string, label):\n", " image_shape = tf.io.decode_jpeg(image_string).shape\n", "\n", " feature = {\n", " 'height': _int64_feature(image_shape[0]),\n", " 'width': _int64_feature(image_shape[1]),\n", " 'depth': _int64_feature(image_shape[2]),\n", " 'label': _int64_feature(label),\n", " 'image_raw': _bytes_feature(image_string),\n", " }\n", "\n", " return tf.train.Example(features=tf.train.Features(feature=feature))\n", "\n", "for line in str(image_example(image_string, label)).split('\\n')[:15]:\n", " print(line)\n", "print('...')" ] }, { "cell_type": "markdown", "metadata": { "id": "2G_o3O9MN0Qx" }, "source": [ "Notice that all of the features are now stored in the `tf.train.Example` message. Next, functionalize the code above and write the example messages to a file named `images.tfrecords`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.985306Z", "iopub.status.busy": "2024-08-16T07:04:19.984638Z", "iopub.status.idle": "2024-08-16T07:04:19.990518Z", "shell.execute_reply": "2024-08-16T07:04:19.989930Z" }, "id": "qcw06lQCOCZU" }, "outputs": [], "source": [ "# Write the raw image files to `images.tfrecords`.\n", "# First, process the two images into `tf.train.Example` messages.\n", "# Then, write to a `.tfrecords` file.\n", "record_file = 'images.tfrecords'\n", "with tf.io.TFRecordWriter(record_file) as writer:\n", " for filename, label in image_labels.items():\n", " image_string = open(filename, 'rb').read()\n", " tf_example = image_example(image_string, label)\n", " writer.write(tf_example.SerializeToString())" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:19.993383Z", "iopub.status.busy": "2024-08-16T07:04:19.993164Z", "iopub.status.idle": "2024-08-16T07:04:20.148799Z", "shell.execute_reply": "2024-08-16T07:04:20.148010Z" }, "id": "yJrTe6tHPCfs" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "36K\timages.tfrecords\r\n" ] } ], "source": [ "!du -sh {record_file}" ] }, { "cell_type": "markdown", "metadata": { "id": "jJSsCkZLPH6K" }, "source": [ "### Read the TFRecord file\n", "\n", "You now have the file—`images.tfrecords`—and can now iterate over the records in it to read back what you wrote. Given that in this example you will only reproduce the image, the only feature you will need is the raw image string. Extract it using the getters described above, namely `example.features.feature['image_raw'].bytes_list.value[0]`. You can also use the labels to determine which record is the cat and which one is the bridge:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:20.153130Z", "iopub.status.busy": "2024-08-16T07:04:20.152475Z", "iopub.status.idle": "2024-08-16T07:04:20.210631Z", "shell.execute_reply": "2024-08-16T07:04:20.209992Z" }, "id": "M6Cnfd3cTKHN" }, "outputs": [ { "data": { "text/plain": [ "<_MapDataset element_spec={'depth': TensorSpec(shape=(), dtype=tf.int64, name=None), 'height': TensorSpec(shape=(), dtype=tf.int64, name=None), 'image_raw': TensorSpec(shape=(), dtype=tf.string, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'width': TensorSpec(shape=(), dtype=tf.int64, name=None)}>" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_image_dataset = tf.data.TFRecordDataset('images.tfrecords')\n", "\n", "# Create a dictionary describing the features.\n", "image_feature_description = {\n", " 'height': tf.io.FixedLenFeature([], tf.int64),\n", " 'width': tf.io.FixedLenFeature([], tf.int64),\n", " 'depth': tf.io.FixedLenFeature([], tf.int64),\n", " 'label': tf.io.FixedLenFeature([], tf.int64),\n", " 'image_raw': tf.io.FixedLenFeature([], tf.string),\n", "}\n", "\n", "def _parse_image_function(example_proto):\n", " # Parse the input tf.train.Example proto using the dictionary above.\n", " return tf.io.parse_single_example(example_proto, image_feature_description)\n", "\n", "parsed_image_dataset = raw_image_dataset.map(_parse_image_function)\n", "parsed_image_dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "0PEEFPk4NEg1" }, "source": [ "Recover the images from the TFRecord file:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-08-16T07:04:20.214163Z", "iopub.status.busy": "2024-08-16T07:04:20.213606Z", "iopub.status.idle": "2024-08-16T07:04:20.257182Z", "shell.execute_reply": "2024-08-16T07:04:20.256579Z" }, "id": "yZf8jOyEIjSF" }, "outputs": [ { "data": { "image/jpeg": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/jpeg": "", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for image_features in parsed_image_dataset:\n", " image_raw = image_features['image_raw'].numpy()\n", " display.display(display.Image(data=image_raw))" ] } ], "metadata": { "colab": { "collapsed_sections": [ "pL--_KGdYoBz" ], "name": "tfrecord.ipynb", "private_outputs": true, "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 0 }