{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2020-10-27T16:23:36.369863Z", "iopub.status.busy": "2020-10-27T16:23:36.369204Z", "iopub.status.idle": "2020-10-27T16:23:36.371588Z", "shell.execute_reply": "2020-10-27T16:23:36.371092Z" }, "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "## Overview\n", "\n", "This tutorial demonstrates the `tfio.genome` package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities). \n", "\n", "This package uses the [Google Nucleus](https://github.com/google/nucleus) library to provide some of the core functionality. " ] }, { "cell_type": "markdown", "metadata": { "id": "MUXex9ctTuDB" }, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2020-10-27T16:23:36.379794Z", "iopub.status.busy": "2020-10-27T16:23:36.376808Z", "iopub.status.idle": "2020-10-27T16:23:38.755863Z", "shell.execute_reply": "2020-10-27T16:23:38.756317Z" }, "id": "IqR2PQG4ZaZ0" }, "outputs": [], "source": [ "try:\n", " %tensorflow_version 2.x\n", "except Exception:\n", " pass\n", "!pip install -q tensorflow-io" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2020-10-27T16:23:38.760878Z", "iopub.status.busy": "2020-10-27T16:23:38.760213Z", "iopub.status.idle": "2020-10-27T16:23:45.683755Z", "shell.execute_reply": "2020-10-27T16:23:45.683162Z" }, "id": "bkF2WtCMaJ-3" }, "outputs": [], "source": [ "import tensorflow_io as tfio\n", "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": { "id": "6wkjlql3cOy0" }, "source": [ "## FASTQ Data\n", "FASTQ is a common genomics file format that stores both sequence information in addition to base quality information.\n", "\n", "First, let's download a sample `fastq` file." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2020-10-27T16:23:45.695050Z", "iopub.status.busy": "2020-10-27T16:23:45.694289Z", "iopub.status.idle": "2020-10-27T16:23:46.017771Z", "shell.execute_reply": "2020-10-27T16:23:46.017107Z" }, "id": "yASvppCxceBu" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", " Dload Upload Total Spent Left Speed\r\n", "\r", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "100 407 100 407 0 0 2035 0 --:--:-- --:--:-- --:--:-- 2035\r\n" ] } ], "source": [ "# Download some sample data:\n", "!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq" ] }, { "cell_type": "markdown", "metadata": { "id": "3zekWXlVdprb" }, "source": [ "### Read FASTQ Data\n", "Now, let's use `tfio.genome.read_fastq` to read this file (note a `tf.data` API coming soon)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2020-10-27T16:23:46.177995Z", "iopub.status.busy": "2020-10-27T16:23:46.177197Z", "iopub.status.idle": "2020-10-27T16:23:46.182018Z", "shell.execute_reply": "2020-10-27T16:23:46.181520Z" }, "id": "vl761cHTc7N1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(\n", "[b'GATTACA'\n", " b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'\n", " b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)\n", "tf.Tensor(\n", "[b'BB>B@FA'\n", " b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'\n", " b'FAFAF;F/9;.:/;999B/9A.DFFF;-->.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)\n" ] } ], "source": [ "fastq_data = tfio.genome.read_fastq(filename=\"test.fastq\")\n", "print(fastq_data.sequences)\n", "print(fastq_data.raw_quality)" ] }, { "cell_type": "markdown", "metadata": { "id": "qxHjVKXzdx5W" }, "source": [ "As you see, the returned `fastq_data` has `fastq_data.sequences` which is a string tensor of all sequences in the fastq file (which can each be a different size) along with `fastq_data.raw_quality` which includes Phred encoded quality information about the quality of each base read in the sequence.\n", "\n", "### Quality\n", "You can use a helper op to convert this quality information into probabilities if you are interested." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2020-10-27T16:23:46.186813Z", "iopub.status.busy": "2020-10-27T16:23:46.186186Z", "iopub.status.idle": "2020-10-27T16:23:46.445815Z", "shell.execute_reply": "2020-10-27T16:23:46.445308Z" }, "id": "6IYxfFI4eQTM" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use fn_output_signature instead\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(4, None, 1)\n", "[ 7 68 46 3]\n", "\n" ] } ], "source": [ "quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)\n", "print(quality.shape)\n", "print(quality.row_lengths().numpy())\n", "print(quality)" ] }, { "cell_type": "markdown", "metadata": { "id": "bg3wzTFzhcfS" }, "source": [ "### One hot encodings\n", "You may also want to encode the genome sequence data (which consists of `A` `T` `C` `G` bases) using a one hot encoder. There's a built in operation that can help with this.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2020-10-27T16:23:46.454019Z", "iopub.status.busy": "2020-10-27T16:23:46.453342Z", "iopub.status.idle": "2020-10-27T16:23:47.051270Z", "shell.execute_reply": "2020-10-27T16:23:47.050657Z" }, "id": "oAiepmy8h32a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "(4, None, 4)\n" ] } ], "source": [ "one_hot = tfio.genome.sequences_to_onehot(fastq_data.sequences)\n", "print(one_hot)\n", "print(one_hot.shape)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2020-10-27T16:23:47.055452Z", "iopub.status.busy": "2020-10-27T16:23:47.054797Z", "iopub.status.idle": "2020-10-27T16:23:47.057483Z", "shell.execute_reply": "2020-10-27T16:23:47.056863Z" }, "id": "oAiepmy8h32a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Convert DNA sequences into a one hot nucleotide encoding.\n", "\n", " Each nucleotide in each sequence is mapped as follows:\n", " A -> [1, 0, 0, 0]\n", " C -> [0, 1, 0, 0]\n", " G -> [0 ,0 ,1, 0]\n", " T -> [0, 0, 0, 1]\n", "\n", " If for some reason a non (A, T, C, G) character exists in the string, it is\n", " currently mapped to a error one hot encoding [1, 1, 1, 1].\n", "\n", " Args:\n", " sequences: A tf.string tensor where each string represents a DNA sequence\n", "\n", " Returns:\n", " tf.RaggedTensor: The output sequences with nucleotides one hot encoded.\n", " \n" ] } ], "source": [ "print(tfio.genome.sequences_to_onehot.__doc__)" ] } ], "metadata": { "colab": { "collapsed_sections": [ "Tce3stUlHN0L" ], "name": "genome.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 0 }