{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "oIMvgrGMe7ZF" }, "source": [ "##### Copyright 2022 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "n25wrPRbfCGc" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "ZyGUj_q7IdfQ" }, "source": [ "# 数据集集合" ] }, { "cell_type": "markdown", "metadata": { "id": "LpO0um1nez_q" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码\n", " 下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "p8AFT7CpSzBG" }, "source": [ "## 概述\n", "\n", "数据集集合提供了一种简单的方法,可以将任意数量的现有 TFDS 数据集组合在一起,并对它们执行简单的运算。\n", "\n", "它们可能会很有用,例如,将与同一任务相关的不同数据集组合在一起,或者便于对固定数量的不同任务中的模型进行[基准测试](https://ruder.io/nlp-benchmarking/)。" ] }, { "cell_type": "markdown", "metadata": { "id": "WZjxBV9E79Fl" }, "source": [ "## 安装\n", "\n", "首先,安装几个软件包:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1AnxnW65I_FC" }, "outputs": [], "source": [ "# Use tfds-nightly to ensure access to the latest features.\n", "!pip install -q tfds-nightly tensorflow\n", "!pip install -U conllu" ] }, { "cell_type": "markdown", "metadata": { "id": "81CCGS5R8GeV" }, "source": [ "将 TensorFlow 和 TensorFlow Datasets 软件包导入您的开发环境:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-hxMPT0wIu3f" }, "outputs": [], "source": [ "import pprint\n", "\n", "import tensorflow as tf\n", "import tensorflow_datasets as tfds" ] }, { "cell_type": "markdown", "metadata": { "id": "at0bMS_jIdjt" }, "source": [ "数据集集合提供了一种简单的方式,可以将 Tensorflow Datasets 中任意数量的现有数据集组合在一起,并对它们执行简单的运算。\n", "\n", "它们可能会很有用,例如,将与同一任务相关的不同数据集组合在一起,或者便于对固定数量的不同任务中的模型进行[基准测试](https://ruder.io/nlp-benchmarking/)。" ] }, { "cell_type": "markdown", "metadata": { "id": "aLvkZBKwIdmL" }, "source": [ "## 查找可用的数据集集合\n", "\n", "所有数据集集合构建工具都是 `tfds.core.dataset_collection_builder.DatasetCollection` 的子类。\n", "\n", "要获取可用构建工具的列表,请使用 `tfds.list_dataset_collections()`。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "R14uGGzKItDz" }, "outputs": [], "source": [ "tfds.list_dataset_collections()" ] }, { "cell_type": "markdown", "metadata": { "id": "Jpcq2AMvI5K1" }, "source": [ "## 加载并检查数据集集合\n", "\n", "加载数据集集合的最简单方式是使用 [`tfds.dataset_collection`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/dataset_collection) 命令实例化 `DatasetCollectionLoader` 对象。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "leIwyl9aI3WA" }, "outputs": [], "source": [ "collection_loader = tfds.dataset_collection('xtreme')" ] }, { "cell_type": "markdown", "metadata": { "id": "KgjomybjY7qI" }, "source": [ "可以按照与加载TFDS 数据集相同的语法加载特定的数据集集合版本:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pyILkuYJY6ts" }, "outputs": [], "source": [ "collection_loader = tfds.dataset_collection('xtreme:1.0.0')" ] }, { "cell_type": "markdown", "metadata": { "id": "uKOJ6CNQKG9S" }, "source": [ "数据集集合加载程序可以显示有关集合的信息:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kwk4PVDoKEAC" }, "outputs": [], "source": [ "collection_loader.print_info()" ] }, { "cell_type": "markdown", "metadata": { "id": "2FlLLbwuLLTu" }, "source": [ "数据集加载程序还可以显示有关集合中包含的数据集的信息:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IxNJEie6K55T" }, "outputs": [], "source": [ "collection_loader.print_datasets()" ] }, { "cell_type": "markdown", "metadata": { "id": "oGxorc3kLwRj" }, "source": [ "### 从数据集集合加载数据集\n", "\n", "从集合中加载一个数据集的最简单方式是使用 `DatasetCollectionLoader` 对象的 `load_dataset` 方法,此方法通过调用 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 加载所需的数据集。\n", "\n", "此调用返回拆分名称的字典和相应的 `tf.data.Dataset`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UP1nRj4ILwb6" }, "outputs": [], "source": [ "splits = collection_loader.load_dataset(\"ner\")\n", "\n", "pprint.pprint(splits)" ] }, { "cell_type": "markdown", "metadata": { "id": "2spLEgN1Lwmm" }, "source": [ "`load_dataset` 接受以下可选形参:\n", "\n", "- `split`:要加载的拆分。它接受单个拆分 (`split=\"test\"`) 或拆分列表:(`split=[\"train\", \"test\"]`)。如果未指定,它将加载给定数据集的所有拆分。\n", "- `loader_kwargs`:要传递给 `tfds.load` 函数的关键字实参。有关不同加载选项的全面概览,请参阅 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 文档。" ] }, { "cell_type": "markdown", "metadata": { "id": "aClLU4eAh2oC" }, "source": [ "### 从数据集集合加载多个数据集\n", "\n", "从集合中加载多个数据集的最简单方式是使用 `DatasetCollectionLoader` 对象的 `load_dataset` 方法,此方法通过调用 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 加载所需的数据集。\n", "\n", "它返回一个数据集名称字典,每个字典都与一个拆分名称字典和相应的 `tf.data.Dataset` 相关联,如下例所示:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sEG5744Oh2vQ" }, "outputs": [], "source": [ "datasets = collection_loader.load_datasets(['xnli', 'bucc'])\n", "\n", "pprint.pprint(datasets)" ] }, { "cell_type": "markdown", "metadata": { "id": "WF0kNqwsiN1Y" }, "source": [ "`load_all_datasets` 方法加载给定集合的*所有*可用数据集:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QX-M3xcjiM35" }, "outputs": [], "source": [ "all_datasets = collection_loader.load_all_datasets()\n", "\n", "pprint.pprint(all_datasets)" ] }, { "cell_type": "markdown", "metadata": { "id": "GXxVztK5kAHh" }, "source": [ "`load_datasets` 方法接受以下可选形参:\n", "\n", "- `split`:要加载的拆分。它接受单个拆分 (`split=\"test\"`) 或拆分列表:(`split=[\"train\", \"test\"]`)。如果未指定,它将加载给定数据集的所有拆分。\n", "- `loader_kwargs`:要传递给 `tfds.load` 函数的关键字实参。有关不同加载选项的全面概览,请参阅 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 文档。" ] }, { "cell_type": "markdown", "metadata": { "id": "d4JoreSHfcKZ" }, "source": [ "### 制定 `loader_kwargs`\n", "\n", "`loader_kwargs` 是要传递给 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 函数的可选关键字实参。可以通过三种方式指定这些实参:\n", "\n", "1. 初始化 `DatasetCollectionLoader` 类时:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TjgZSIlvfcSP" }, "outputs": [], "source": [ "collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))" ] }, { "cell_type": "markdown", "metadata": { "id": "uJcEZl97Xj6Y" }, "source": [ "1. 使用 `DatasetCollectioLoader` 的 `set_loader_kwargs` 方法:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zrysflp-k1d3" }, "outputs": [], "source": [ "collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))" ] }, { "cell_type": "markdown", "metadata": { "id": "Ra-ZonhfXkLD" }, "source": [ "1. 作为 `load_dataset`、`load_datasets` 和 `load_all_datasets` 方法的可选形参。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rHSu-8GnlGTk" }, "outputs": [], "source": [ "dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))" ] }, { "cell_type": "markdown", "metadata": { "id": "BJDGoeAqmJAQ" }, "source": [ "### 反馈\n", "\n", "我们一直在致力于改进数据集创建工作流,但只有在充分了解存在哪些问题后才能有效地实施改进。您在创建数据集时遇到了哪些问题和错误?是否存在令您困惑的部分、样板问题,以及首次创建失败的经历?请在 [GitHub](https://github.com/tensorflow/datasets/issues) 上分享您的反馈。" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "dataset_collections.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }