{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "oIMvgrGMe7ZF" }, "source": [ "##### Copyright 2022 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2023-10-03T09:24:50.698432Z", "iopub.status.busy": "2023-10-03T09:24:50.697991Z", "iopub.status.idle": "2023-10-03T09:24:50.702102Z", "shell.execute_reply": "2023-10-03T09:24:50.701481Z" }, "id": "n25wrPRbfCGc" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "ZyGUj_q7IdfQ" }, "source": [ "# Dataset Collections" ] }, { "cell_type": "markdown", "metadata": { "id": "LpO0um1nez_q" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "p8AFT7CpSzBG" }, "source": [ "## Overview\n", "\n", "Dataset collections provide a simple way to group together an arbitrary number\n", "of existing TFDS datasets, and to perform simple operations over them.\n", "\n", "They can be useful, for example, to group together different datasets related to the same task, or for easy [benchmarking](https://ruder.io/nlp-benchmarking/) of models over a fixed number of different tasks." ] }, { "cell_type": "markdown", "metadata": { "id": "WZjxBV9E79Fl" }, "source": [ "## Setup\n", "\n", "To get started, install a few packages:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:24:50.705811Z", "iopub.status.busy": "2023-10-03T09:24:50.705316Z", "iopub.status.idle": "2023-10-03T09:24:58.665914Z", "shell.execute_reply": "2023-10-03T09:24:58.664978Z" }, "id": "1AnxnW65I_FC" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting conllu\r\n", " Obtaining dependency information for conllu from https://files.pythonhosted.org/packages/ce/3f/70a1dc5bc536755ec082b806594598a10cfffaf0de978f51d4e0e4fdfa47/conllu-4.5.3-py2.py3-none-any.whl.metadata\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading conllu-4.5.3-py2.py3-none-any.whl.metadata (19 kB)\r\n", "Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: conllu\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully installed conllu-4.5.3\r\n" ] } ], "source": [ "# Use tfds-nightly to ensure access to the latest features.\n", "!pip install -q tfds-nightly tensorflow\n", "!pip install -U conllu" ] }, { "cell_type": "markdown", "metadata": { "id": "81CCGS5R8GeV" }, "source": [ "Import TensorFlow and the Tensorflow Datasets package into your development environment:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:24:58.670564Z", "iopub.status.busy": "2023-10-03T09:24:58.669828Z", "iopub.status.idle": "2023-10-03T09:25:01.536271Z", "shell.execute_reply": "2023-10-03T09:25:01.535460Z" }, "id": "-hxMPT0wIu3f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-10-03 09:24:58.961730: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2023-10-03 09:24:58.961781: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2023-10-03 09:24:58.961817: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n" ] } ], "source": [ "import pprint\n", "\n", "import tensorflow as tf\n", "import tensorflow_datasets as tfds" ] }, { "cell_type": "markdown", "metadata": { "id": "at0bMS_jIdjt" }, "source": [ "Dataset collections provide a simple way to group together an arbitrary number\n", "of existing datasets from Tensorflow Datasets (TFDS), and to perform simple operations over them.\n", "\n", "They can be useful, for example, to group together different datasets related to the same task, or for easy [benchmarking](https://ruder.io/nlp-benchmarking/) of models over a fixed number of different tasks." ] }, { "cell_type": "markdown", "metadata": { "id": "aLvkZBKwIdmL" }, "source": [ "## Find available dataset collections\n", "\n", "All dataset collection builders are a subclass of\n", "`tfds.core.dataset_collection_builder.DatasetCollection`.\n", "\n", "To get the list of available builders, use `tfds.list_dataset_collections()`.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:01.540839Z", "iopub.status.busy": "2023-10-03T09:25:01.540147Z", "iopub.status.idle": "2023-10-03T09:25:01.547653Z", "shell.execute_reply": "2023-10-03T09:25:01.546964Z" }, "id": "R14uGGzKItDz" }, "outputs": [ { "data": { "text/plain": [ "['longt5', 'xtreme']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfds.list_dataset_collections()" ] }, { "cell_type": "markdown", "metadata": { "id": "Jpcq2AMvI5K1" }, "source": [ "## Load and inspect a dataset collection\n", "\n", "The easiest way of loading a dataset collection is to instantiate a `DatasetCollectionLoader` object using the [`tfds.dataset_collection`](https://www.tensorflow.org/datasets/api_docs/python/tfds/dataset_collection) command.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:01.551248Z", "iopub.status.busy": "2023-10-03T09:25:01.550667Z", "iopub.status.idle": "2023-10-03T09:25:01.555093Z", "shell.execute_reply": "2023-10-03T09:25:01.554456Z" }, "id": "leIwyl9aI3WA" }, "outputs": [], "source": [ "collection_loader = tfds.dataset_collection('xtreme')" ] }, { "cell_type": "markdown", "metadata": { "id": "KgjomybjY7qI" }, "source": [ "Specific dataset collection versions can be loaded following the same syntax as with TFDS datasets:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:01.558423Z", "iopub.status.busy": "2023-10-03T09:25:01.557889Z", "iopub.status.idle": "2023-10-03T09:25:01.562100Z", "shell.execute_reply": "2023-10-03T09:25:01.561474Z" }, "id": "pyILkuYJY6ts" }, "outputs": [], "source": [ "collection_loader = tfds.dataset_collection('xtreme:1.0.0')" ] }, { "cell_type": "markdown", "metadata": { "id": "uKOJ6CNQKG9S" }, "source": [ "A dataset collection loader can display information about the collection:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:01.565420Z", "iopub.status.busy": "2023-10-03T09:25:01.564863Z", "iopub.status.idle": "2023-10-03T09:25:01.569461Z", "shell.execute_reply": "2023-10-03T09:25:01.568817Z" }, "id": "kwk4PVDoKEAC" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset collection: xtreme\n", "Version: 1.0.0\n", "Description: # Xtreme Benchmark\n", "\n", "The Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME)\n", "benchmark is a benchmark for the evaluation of the cross-lingual generalization\n", "ability of pre-trained multilingual models. It covers 40 typologically diverse\n", "languages (spanning 12 language families) and includes nine tasks that\n", "collectively require reasoning about different levels of syntax and semantics.\n", "The languages in XTREME are selected to maximize language diversity, coverage\n", "in existing tasks, and availability of training data. Among these are many\n", "under-studied languages, such as the Dravidian languages Tamil (spoken in\n", "southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken\n", "mainly in southern India), and the Niger-Congo languages Swahili and Yoruba,\n", "spoken in Africa.\n", "\n", "For a full description of the benchmark,\n", "see the [paper](https://arxiv.org/abs/2003.11080).\n", "\n", "Citation:\n", "@article{hu2020xtreme,\n", " author = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham\n", " Neubig and Orhan Firat and Melvin Johnson},\n", " title = {XTREME: A Massively Multilingual Multi-task Benchmark for\n", " Evaluating Cross-lingual Generalization},\n", " journal = {CoRR},\n", " volume = {abs/2003.11080},\n", " year = {2020},\n", " archivePrefix = {arXiv},\n", " eprint = {2003.11080}\n", "}\n", "\n" ] } ], "source": [ "collection_loader.print_info()" ] }, { "cell_type": "markdown", "metadata": { "id": "2FlLLbwuLLTu" }, "source": [ "The dataset loader can also display information about the datasets contained in the collection:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:01.572660Z", "iopub.status.busy": "2023-10-03T09:25:01.572265Z", "iopub.status.idle": "2023-10-03T09:25:01.577230Z", "shell.execute_reply": "2023-10-03T09:25:01.576595Z" }, "id": "IxNJEie6K55T" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The dataset collection xtreme (version: 1.0.0) contains the datasets:\n", " - xnli: DatasetReference(dataset_name='xtreme_xnli', namespace=None, config=None, version='1.1.0', data_dir=None, split_mapping=None)\n", " - pawsx: DatasetReference(dataset_name='xtreme_pawsx', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)\n", " - pos: DatasetReference(dataset_name='xtreme_pos', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)\n", " - ner: DatasetReference(dataset_name='wikiann', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)\n", " - xquad: DatasetReference(dataset_name='xquad', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None)\n", " - mlqa: DatasetReference(dataset_name='mlqa', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)\n", " - tydiqa: DatasetReference(dataset_name='tydi_qa', namespace=None, config=None, version='3.0.0', data_dir=None, split_mapping=None)\n", " - bucc: DatasetReference(dataset_name='bucc', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)\n", " - tatoeba: DatasetReference(dataset_name='tatoeba', namespace=None, config=None, version='1.0.0', data_dir=None, split_mapping=None)\n", "\n" ] } ], "source": [ "collection_loader.print_datasets()" ] }, { "cell_type": "markdown", "metadata": { "id": "oGxorc3kLwRj" }, "source": [ "### Loading datasets from a dataset collection\n", "\n", "The easiest way to load one dataset from a collection is to use a `DatasetCollectionLoader` object's `load_dataset` method, which loads the required dataset by calling [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).\n", "\n", "This call returns a dictionary of split names and the corresponding `tf.data.Dataset`s:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:01.580533Z", "iopub.status.busy": "2023-10-03T09:25:01.580015Z", "iopub.status.idle": "2023-10-03T09:25:02.941649Z", "shell.execute_reply": "2023-10-03T09:25:02.940819Z" }, "id": "UP1nRj4ILwb6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,\n", " 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,\n", " 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2023-10-03 09:25:02.792501: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n" ] } ], "source": [ "splits = collection_loader.load_dataset(\"ner\")\n", "\n", "pprint.pprint(splits)" ] }, { "cell_type": "markdown", "metadata": { "id": "2spLEgN1Lwmm" }, "source": [ "`load_dataset` accepts the following optional parameters:\n", "\n", "* `split`: which split(s) to load. It accepts a single split (`split=\"test\"`) or a list of splits: (`split=[\"train\", \"test\"]`). If not specified, it will load all splits for the given dataset.\n", "* `loader_kwargs`: keyword arguments to be passed to the `tfds.load` function. Refer to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) documentation for a comprehensive overview of the different loading options." ] }, { "cell_type": "markdown", "metadata": { "id": "aClLU4eAh2oC" }, "source": [ "### Loading multiple datasets from a dataset collection\n", "\n", "The easiest way to load multiple datasets from a collection is to use the `DatasetCollectionLoader` object's `load_datasets` method, which loads the required datasets by calling [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).\n", "\n", "It returns a dictionary of dataset names, each one of which is associated with a dictionary of split names and the corresponding `tf.data.Dataset`s, as in the following example:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:02.946725Z", "iopub.status.busy": "2023-10-03T09:25:02.946155Z", "iopub.status.idle": "2023-10-03T09:25:05.077290Z", "shell.execute_reply": "2023-10-03T09:25:05.076516Z" }, "id": "sEG5744Oh2vQ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},\n", " 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)}}>}}\n" ] } ], "source": [ "datasets = collection_loader.load_datasets(['xnli', 'bucc'])\n", "\n", "pprint.pprint(datasets)" ] }, { "cell_type": "markdown", "metadata": { "id": "WF0kNqwsiN1Y" }, "source": [ "The `load_all_datasets` method loads *all* available datasets for a given collection:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:05.081226Z", "iopub.status.busy": "2023-10-03T09:25:05.080695Z", "iopub.status.idle": "2023-10-03T09:25:14.340165Z", "shell.execute_reply": "2023-10-03T09:25:14.339415Z" }, "id": "QX-M3xcjiM35" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'bucc': {'test': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation': <_PrefetchDataset element_spec={'source_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},\n", " 'mlqa': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>},\n", " 'ner': {'test': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,\n", " 'train': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>,\n", " 'validation': <_PrefetchDataset element_spec={'langs': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'spans': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'tags': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None)}>},\n", " 'pawsx': {'train': <_PrefetchDataset element_spec={'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'sentence1': TensorSpec(shape=(), dtype=tf.string, name=None), 'sentence2': TensorSpec(shape=(), dtype=tf.string, name=None)}>},\n", " 'pos': {'dev': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>,\n", " 'test': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>,\n", " 'train': <_PrefetchDataset element_spec={'tokens': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'upos': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}>},\n", " 'tatoeba': {'train': <_PrefetchDataset element_spec={'source_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'source_sentence': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_language': TensorSpec(shape=(), dtype=tf.string, name=None), 'target_sentence': TensorSpec(shape=(), dtype=tf.string, name=None)}>},\n", " 'tydiqa': {'train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-ar': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-bn': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-en': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-fi': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-id': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-ko': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-ru': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-sw': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'validation-te': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>},\n", " 'xnli': {'train': <_PrefetchDataset element_spec={'hypothesis': {'language': TensorSpec(shape=(None,), dtype=tf.string, name=None), 'translation': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': {'ar': TensorSpec(shape=(), dtype=tf.string, name=None), 'bg': TensorSpec(shape=(), dtype=tf.string, name=None), 'de': TensorSpec(shape=(), dtype=tf.string, name=None), 'el': TensorSpec(shape=(), dtype=tf.string, name=None), 'en': TensorSpec(shape=(), dtype=tf.string, name=None), 'es': TensorSpec(shape=(), dtype=tf.string, name=None), 'fr': TensorSpec(shape=(), dtype=tf.string, name=None), 'hi': TensorSpec(shape=(), dtype=tf.string, name=None), 'ru': TensorSpec(shape=(), dtype=tf.string, name=None), 'sw': TensorSpec(shape=(), dtype=tf.string, name=None), 'th': TensorSpec(shape=(), dtype=tf.string, name=None), 'tr': TensorSpec(shape=(), dtype=tf.string, name=None), 'ur': TensorSpec(shape=(), dtype=tf.string, name=None), 'vi': TensorSpec(shape=(), dtype=tf.string, name=None), 'zh': TensorSpec(shape=(), dtype=tf.string, name=None)}}>},\n", " 'xquad': {'test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-dev': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-test': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>,\n", " 'translate-train': <_PrefetchDataset element_spec={'answers': {'answer_start': TensorSpec(shape=(None,), dtype=tf.int32, name=None), 'text': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, 'context': TensorSpec(shape=(), dtype=tf.string, name=None), 'id': TensorSpec(shape=(), dtype=tf.string, name=None), 'question': TensorSpec(shape=(), dtype=tf.string, name=None), 'title': TensorSpec(shape=(), dtype=tf.string, name=None)}>}}\n" ] } ], "source": [ "all_datasets = collection_loader.load_all_datasets()\n", "\n", "pprint.pprint(all_datasets)" ] }, { "cell_type": "markdown", "metadata": { "id": "GXxVztK5kAHh" }, "source": [ "The `load_datasets` method accepts the following optional parameters:\n", "\n", "* `split`: which split(s) to load. It accepts a single split `(split=\"test\")` or a list of splits: `(split=[\"train\", \"test\"])`. If not specified, it will load all splits for the given dataset.\n", "* `loader_kwargs`: keyword arguments to be passed to the `tfds.load` function. Refer to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) documentation for a comprehensive overview of the different loading options." ] }, { "cell_type": "markdown", "metadata": { "id": "d4JoreSHfcKZ" }, "source": [ "### Specifying `loader_kwargs`\n", "\n", "The `loader_kwargs` are optional keyword arguments to be passed to the [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) function.\n", "They can be specified in three ways:\n", "\n", "1. When initializing the `DatasetCollectionLoader` class:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:14.344228Z", "iopub.status.busy": "2023-10-03T09:25:14.343534Z", "iopub.status.idle": "2023-10-03T09:25:14.348428Z", "shell.execute_reply": "2023-10-03T09:25:14.347754Z" }, "id": "TjgZSIlvfcSP" }, "outputs": [], "source": [ "collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))" ] }, { "cell_type": "markdown", "metadata": { "id": "uJcEZl97Xj6Y" }, "source": [ "2. Using `DatasetCollectioLoader`'s `set_loader_kwargs` method:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:14.351676Z", "iopub.status.busy": "2023-10-03T09:25:14.351144Z", "iopub.status.idle": "2023-10-03T09:25:14.354600Z", "shell.execute_reply": "2023-10-03T09:25:14.353943Z" }, "id": "zrysflp-k1d3" }, "outputs": [], "source": [ "collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))" ] }, { "cell_type": "markdown", "metadata": { "id": "Ra-ZonhfXkLD" }, "source": [ "3. As optional parameters to the `load_dataset`, `load_datasets` and `load_all_datasets` methods." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2023-10-03T09:25:14.357847Z", "iopub.status.busy": "2023-10-03T09:25:14.357352Z", "iopub.status.idle": "2023-10-03T09:25:15.233696Z", "shell.execute_reply": "2023-10-03T09:25:15.232940Z" }, "id": "rHSu-8GnlGTk" }, "outputs": [], "source": [ "dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))" ] }, { "cell_type": "markdown", "metadata": { "id": "BJDGoeAqmJAQ" }, "source": [ "### Feedback\n", "\n", "We are continuously trying to improve the dataset creation workflow, but can\n", "only do so if we are aware of the issues. Which issues, errors did you\n", "encountered while creating the dataset collection? Was there a part which was confusing,\n", "boilerplate or wasn't working the first time? Please share your feedback on\n", "[GitHub](https://github.com/tensorflow/datasets/issues)." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "dataset_collections.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 0 }