{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2021-02-13T02:56:30.463113Z", "iopub.status.busy": "2021-02-13T02:56:30.462414Z", "iopub.status.idle": "2021-02-13T02:56:30.464578Z", "shell.execute_reply": "2021-02-13T02:56:30.464977Z" }, "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
TensorFlow.orgで表示 Google Colab で実行View source on GitHubノートブックをダウンロード/a0}
" ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "## 概要\n", "\n", "このチュートリアルでは、一般的に使用されるゲノミクス IO 機能を提供するtfio.genomeパッケージについて解説します。これは、いくつかのゲノミクスファイル形式を読み取り、データを準備するための一般的な演算を提供します (例: One-Hot エンコーディングまたは Phred クオリティスコアを確率に解析します)。\n", "\n", "このパッケージは、[Google Nucleus](https://github.com/google/nucleus) ライブラリを使用して、主な機能の一部を提供します。 " ] }, { "cell_type": "markdown", "metadata": { "id": "MUXex9ctTuDB" }, "source": [ "## セットアップ" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-02-13T02:56:30.475377Z", "iopub.status.busy": "2021-02-13T02:56:30.474581Z", "iopub.status.idle": "2021-02-13T02:56:33.173158Z", "shell.execute_reply": "2021-02-13T02:56:33.172450Z" }, "id": "IqR2PQG4ZaZ0" }, "outputs": [], "source": [ "try:\n", " %tensorflow_version 2.x\n", "except Exception:\n", " pass\n", "!pip install -q tensorflow-io" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2021-02-13T02:56:33.177901Z", "iopub.status.busy": "2021-02-13T02:56:33.177233Z", "iopub.status.idle": "2021-02-13T02:56:39.675692Z", "shell.execute_reply": "2021-02-13T02:56:39.674975Z" }, "id": "bkF2WtCMaJ-3" }, "outputs": [], "source": [ "import tensorflow_io as tfio\n", "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": { "id": "6wkjlql3cOy0" }, "source": [ "## FASTQ データ\n", "\n", "FASTQ は、基本的な品質情報に加えて両方の配列情報を保存する一般的なゲノミクスファイル形式です。\n", "\n", "まず、サンプルの`fastq`ファイルをダウンロードします。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2021-02-13T02:56:39.687620Z", "iopub.status.busy": "2021-02-13T02:56:39.686821Z", "iopub.status.idle": "2021-02-13T02:56:40.140737Z", "shell.execute_reply": "2021-02-13T02:56:40.139940Z" }, "id": "yASvppCxceBu" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", " Dload Upload Total Spent Left Speed\r\n", "\r", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "100 407 100 407 0 0 1229 0 --:--:-- --:--:-- --:--:-- 1229\r\n" ] } ], "source": [ "# Download some sample data:\n", "!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq" ] }, { "cell_type": "markdown", "metadata": { "id": "3zekWXlVdprb" }, "source": [ "### FASTQ データの読み込み\n", "\n", "`tfio.genome.read_fastq`を使用してこのファイルを読みこみます (`tf.data` API は近日中にリリースされる予定です)。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2021-02-13T02:56:40.518897Z", "iopub.status.busy": "2021-02-13T02:56:40.518132Z", "iopub.status.idle": "2021-02-13T02:56:42.064215Z", "shell.execute_reply": "2021-02-13T02:56:42.063646Z" }, "id": "vl761cHTc7N1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(\n", "[b'GATTACA'\n", " b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'\n", " b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)\n", "tf.Tensor(\n", "[b'BB>B@FA'\n", " b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'\n", " b'FAFAF;F/9;.:/;999B/9A.DFFF;-->.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)\n" ] } ], "source": [ "fastq_data = tfio.genome.read_fastq(filename=\"test.fastq\")\n", "print(fastq_data.sequences)\n", "print(fastq_data.raw_quality)" ] }, { "cell_type": "markdown", "metadata": { "id": "qxHjVKXzdx5W" }, "source": [ "ご覧のとおり、返された`fastq_data`には fastq ファイル内のすべてのシーケンスの文字列テンソル (それぞれ異なるサイズにすることが可能) である`fastq_data.sequences`、および、シーケンスで読み取られた各塩基の品質に関する Phred エンコードされた品質情報を含む`fastq_data.raw_quality`が含まれています。\n", "\n", "### 品質\n", "\n", "関心がある場合は、ヘルパーオペレーションを使用して、この品質情報を確率に変換できます。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2021-02-13T02:56:42.075580Z", "iopub.status.busy": "2021-02-13T02:56:42.068847Z", "iopub.status.idle": "2021-02-13T02:56:42.730660Z", "shell.execute_reply": "2021-02-13T02:56:42.729985Z" }, "id": "6IYxfFI4eQTM" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use fn_output_signature instead\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(4, None, 1)\n", "[ 7 68 46 3]\n", "\n" ] } ], "source": [ "quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)\n", "print(quality.shape)\n", "print(quality.row_lengths().numpy())\n", "print(quality)" ] }, { "cell_type": "markdown", "metadata": { "id": "bg3wzTFzhcfS" }, "source": [ "### One-Hot エンコーディング\n", "\n", "また、One-Hot エンコーダ―を使用してゲノムシーケンスデータ (`A` `T` `C` `G`の塩基配列で構成される) をエンコードすることもできます。これに役立つ演算が組み込まれています。\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2021-02-13T02:56:42.736054Z", "iopub.status.busy": "2021-02-13T02:56:42.735349Z", "iopub.status.idle": "2021-02-13T02:56:42.738416Z", "shell.execute_reply": "2021-02-13T02:56:42.737766Z" }, "id": "oAiepmy8h32a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Convert DNA sequences into a one hot nucleotide encoding.\n", "\n", " Each nucleotide in each sequence is mapped as follows:\n", " A -> [1, 0, 0, 0]\n", " C -> [0, 1, 0, 0]\n", " G -> [0 ,0 ,1, 0]\n", " T -> [0, 0, 0, 1]\n", "\n", " If for some reason a non (A, T, C, G) character exists in the string, it is\n", " currently mapped to a error one hot encoding [1, 1, 1, 1].\n", "\n", " Args:\n", " sequences: A tf.string tensor where each string represents a DNA sequence\n", "\n", " Returns:\n", " tf.RaggedTensor: The output sequences with nucleotides one hot encoded.\n", " \n" ] } ], "source": [ "print(tfio.genome.sequences_to_onehot.__doc__)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2021-02-13T02:56:42.742823Z", "iopub.status.busy": "2021-02-13T02:56:42.742094Z", "iopub.status.idle": "2021-02-13T02:56:42.745069Z", "shell.execute_reply": "2021-02-13T02:56:42.745533Z" }, "id": "oAiepmy8h32a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Convert DNA sequences into a one hot nucleotide encoding.\n", "\n", " Each nucleotide in each sequence is mapped as follows:\n", " A -> [1, 0, 0, 0]\n", " C -> [0, 1, 0, 0]\n", " G -> [0 ,0 ,1, 0]\n", " T -> [0, 0, 0, 1]\n", "\n", " If for some reason a non (A, T, C, G) character exists in the string, it is\n", " currently mapped to a error one hot encoding [1, 1, 1, 1].\n", "\n", " Args:\n", " sequences: A tf.string tensor where each string represents a DNA sequence\n", "\n", " Returns:\n", " tf.RaggedTensor: The output sequences with nucleotides one hot encoded.\n", " \n" ] } ], "source": [ "print(tfio.genome.sequences_to_onehot.__doc__)" ] } ], "metadata": { "colab": { "collapsed_sections": [ "Tce3stUlHN0L" ], "name": "genome.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 0 }