{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oIMvgrGMe7ZF"
      },
      "source": [
        "##### Copyright 2022 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "n25wrPRbfCGc"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZyGUj_q7IdfQ"
      },
      "source": [
        "# 数据集集合"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LpO0um1nez_q"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>     <a target=\"_blank\" href=\"https://tensorflow.google.cn/datasets/dataset_collections\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看</a> </td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/datasets/dataset_collections.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 中运行</a> </td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/datasets/dataset_collections.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 GitHub 上查看源代码</a>\n",
        "</td>\n",
        "  <td>     <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/datasets/dataset_collections.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a> </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "p8AFT7CpSzBG"
      },
      "source": [
        "## 概述\n",
        "\n",
        "数据集集合提供了一种简单的方法，可以将任意数量的现有 TFDS 数据集组合在一起，并对它们执行简单的运算。\n",
        "\n",
        "它们可能会很有用，例如，将与同一任务相关的不同数据集组合在一起，或者便于对固定数量的不同任务中的模型进行[基准测试](https://ruder.io/nlp-benchmarking/)。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WZjxBV9E79Fl"
      },
      "source": [
        "## 安装\n",
        "\n",
        "首先，安装几个软件包："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1AnxnW65I_FC"
      },
      "outputs": [],
      "source": [
        "# Use tfds-nightly to ensure access to the latest features.\n",
        "!pip install -q tfds-nightly tensorflow\n",
        "!pip install -U conllu"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "81CCGS5R8GeV"
      },
      "source": [
        "将 TensorFlow 和 TensorFlow Datasets 软件包导入您的开发环境："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-hxMPT0wIu3f"
      },
      "outputs": [],
      "source": [
        "import pprint\n",
        "\n",
        "import tensorflow as tf\n",
        "import tensorflow_datasets as tfds"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "at0bMS_jIdjt"
      },
      "source": [
        "数据集集合提供了一种简单的方式，可以将 Tensorflow Datasets 中任意数量的现有数据集组合在一起，并对它们执行简单的运算。\n",
        "\n",
        "它们可能会很有用，例如，将与同一任务相关的不同数据集组合在一起，或者便于对固定数量的不同任务中的模型进行[基准测试](https://ruder.io/nlp-benchmarking/)。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aLvkZBKwIdmL"
      },
      "source": [
        "## 查找可用的数据集集合\n",
        "\n",
        "所有数据集集合构建工具都是 `tfds.core.dataset_collection_builder.DatasetCollection` 的子类。\n",
        "\n",
        "要获取可用构建工具的列表，请使用 `tfds.list_dataset_collections()`。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "R14uGGzKItDz"
      },
      "outputs": [],
      "source": [
        "tfds.list_dataset_collections()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Jpcq2AMvI5K1"
      },
      "source": [
        "## 加载并检查数据集集合\n",
        "\n",
        "加载数据集集合的最简单方式是使用 [`tfds.dataset_collection`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/dataset_collection) 命令实例化 `DatasetCollectionLoader` 对象。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "leIwyl9aI3WA"
      },
      "outputs": [],
      "source": [
        "collection_loader = tfds.dataset_collection('xtreme')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KgjomybjY7qI"
      },
      "source": [
        "可以按照与加载TFDS 数据集相同的语法加载特定的数据集集合版本："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pyILkuYJY6ts"
      },
      "outputs": [],
      "source": [
        "collection_loader = tfds.dataset_collection('xtreme:1.0.0')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uKOJ6CNQKG9S"
      },
      "source": [
        "数据集集合加载程序可以显示有关集合的信息："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kwk4PVDoKEAC"
      },
      "outputs": [],
      "source": [
        "collection_loader.print_info()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2FlLLbwuLLTu"
      },
      "source": [
        "数据集加载程序还可以显示有关集合中包含的数据集的信息："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IxNJEie6K55T"
      },
      "outputs": [],
      "source": [
        "collection_loader.print_datasets()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oGxorc3kLwRj"
      },
      "source": [
        "### 从数据集集合加载数据集\n",
        "\n",
        "从集合中加载一个数据集的最简单方式是使用 `DatasetCollectionLoader` 对象的 `load_dataset` 方法，此方法通过调用 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 加载所需的数据集。\n",
        "\n",
        "此调用返回拆分名称的字典和相应的 `tf.data.Dataset`："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UP1nRj4ILwb6"
      },
      "outputs": [],
      "source": [
        "splits = collection_loader.load_dataset(\"ner\")\n",
        "\n",
        "pprint.pprint(splits)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2spLEgN1Lwmm"
      },
      "source": [
        "`load_dataset` 接受以下可选形参：\n",
        "\n",
        "- `split`：要加载的拆分。它接受单个拆分 (`split=\"test\"`) 或拆分列表：(`split=[\"train\", \"test\"]`)。如果未指定，它将加载给定数据集的所有拆分。\n",
        "- `loader_kwargs`：要传递给 `tfds.load` 函数的关键字实参。有关不同加载选项的全面概览，请参阅 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 文档。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aClLU4eAh2oC"
      },
      "source": [
        "### 从数据集集合加载多个数据集\n",
        "\n",
        "从集合中加载多个数据集的最简单方式是使用 `DatasetCollectionLoader` 对象的 `load_dataset` 方法，此方法通过调用 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 加载所需的数据集。\n",
        "\n",
        "它返回一个数据集名称字典，每个字典都与一个拆分名称字典和相应的 `tf.data.Dataset` 相关联，如下例所示："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sEG5744Oh2vQ"
      },
      "outputs": [],
      "source": [
        "datasets = collection_loader.load_datasets(['xnli', 'bucc'])\n",
        "\n",
        "pprint.pprint(datasets)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WF0kNqwsiN1Y"
      },
      "source": [
        "`load_all_datasets` 方法加载给定集合的*所有*可用数据集："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QX-M3xcjiM35"
      },
      "outputs": [],
      "source": [
        "all_datasets = collection_loader.load_all_datasets()\n",
        "\n",
        "pprint.pprint(all_datasets)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GXxVztK5kAHh"
      },
      "source": [
        "`load_datasets` 方法接受以下可选形参：\n",
        "\n",
        "- `split`：要加载的拆分。它接受单个拆分 (`split=\"test\"`) 或拆分列表：(`split=[\"train\", \"test\"]`)。如果未指定，它将加载给定数据集的所有拆分。\n",
        "- `loader_kwargs`：要传递给 `tfds.load` 函数的关键字实参。有关不同加载选项的全面概览，请参阅 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 文档。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d4JoreSHfcKZ"
      },
      "source": [
        "### 制定 `loader_kwargs`\n",
        "\n",
        "`loader_kwargs` 是要传递给 [`tfds.load`](https://tensorflow.google.cn/datasets/api_docs/python/tfds/load) 函数的可选关键字实参。可以通过三种方式指定这些实参：\n",
        "\n",
        "1. 初始化 `DatasetCollectionLoader` 类时："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TjgZSIlvfcSP"
      },
      "outputs": [],
      "source": [
        "collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uJcEZl97Xj6Y"
      },
      "source": [
        "1. 使用 `DatasetCollectioLoader` 的 `set_loader_kwargs` 方法："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zrysflp-k1d3"
      },
      "outputs": [],
      "source": [
        "collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ra-ZonhfXkLD"
      },
      "source": [
        "1. 作为 `load_dataset`、`load_datasets` 和 `load_all_datasets` 方法的可选形参。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rHSu-8GnlGTk"
      },
      "outputs": [],
      "source": [
        "dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BJDGoeAqmJAQ"
      },
      "source": [
        "### 反馈\n",
        "\n",
        "我们一直在致力于改进数据集创建工作流，但只有在充分了解存在哪些问题后才能有效地实施改进。您在创建数据集时遇到了哪些问题和错误？是否存在令您困惑的部分、样板问题，以及首次创建失败的经历？请在 [GitHub](https://github.com/tensorflow/datasets/issues) 上分享您的反馈。"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "dataset_collections.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}