{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Tce3stUlHN0L"
   },
   "source": [
    "##### Copyright 2020 The TensorFlow Authors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cellView": "form",
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:36.369863Z",
     "iopub.status.busy": "2020-10-27T16:23:36.369204Z",
     "iopub.status.idle": "2020-10-27T16:23:36.371588Z",
     "shell.execute_reply": "2020-10-27T16:23:36.371092Z"
    },
    "id": "tuOe1ymfHZPu"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MfBg1C5NB3X0"
   },
   "source": [
    "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://www.tensorflow.org/io/tutorials/genome\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/genome.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
    "  </td>\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://github.com/tensorflow/io/blob/master/docs/tutorials/genome.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
    "  </td>\n",
    "      <td>\n",
    "    <a href=\"https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/genome.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
    "  </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xHxb-dlhMIzW"
   },
   "source": [
    "## Overview\n",
    "\n",
    "This tutorial demonstrates the `tfio.genome` package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities). \n",
    "\n",
    "This package uses the [Google Nucleus](https://github.com/google/nucleus) library to provide some of the core functionality. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MUXex9ctTuDB"
   },
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:36.379794Z",
     "iopub.status.busy": "2020-10-27T16:23:36.376808Z",
     "iopub.status.idle": "2020-10-27T16:23:38.755863Z",
     "shell.execute_reply": "2020-10-27T16:23:38.756317Z"
    },
    "id": "IqR2PQG4ZaZ0"
   },
   "outputs": [],
   "source": [
    "try:\n",
    "  %tensorflow_version 2.x\n",
    "except Exception:\n",
    "  pass\n",
    "!pip install -q tensorflow-io"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:38.760878Z",
     "iopub.status.busy": "2020-10-27T16:23:38.760213Z",
     "iopub.status.idle": "2020-10-27T16:23:45.683755Z",
     "shell.execute_reply": "2020-10-27T16:23:45.683162Z"
    },
    "id": "bkF2WtCMaJ-3"
   },
   "outputs": [],
   "source": [
    "import tensorflow_io as tfio\n",
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6wkjlql3cOy0"
   },
   "source": [
    "## FASTQ Data\n",
    "FASTQ is a common genomics file format that stores both sequence information in addition to base quality information.\n",
    "\n",
    "First, let's download a sample `fastq` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:45.695050Z",
     "iopub.status.busy": "2020-10-27T16:23:45.694289Z",
     "iopub.status.idle": "2020-10-27T16:23:46.017771Z",
     "shell.execute_reply": "2020-10-27T16:23:46.017107Z"
    },
    "id": "yASvppCxceBu"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\r\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\r\n",
      "\r",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "100   407  100   407    0     0   2035      0 --:--:-- --:--:-- --:--:--  2035\r\n"
     ]
    }
   ],
   "source": [
    "# Download some sample data:\n",
    "!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3zekWXlVdprb"
   },
   "source": [
    "### Read FASTQ Data\n",
    "Now, let's use `tfio.genome.read_fastq` to read this file (note a `tf.data` API coming soon)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:46.177995Z",
     "iopub.status.busy": "2020-10-27T16:23:46.177197Z",
     "iopub.status.idle": "2020-10-27T16:23:46.182018Z",
     "shell.execute_reply": "2020-10-27T16:23:46.181520Z"
    },
    "id": "vl761cHTc7N1"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tf.Tensor(\n",
      "[b'GATTACA'\n",
      " b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'\n",
      " b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)\n",
      "tf.Tensor(\n",
      "[b'BB>B@FA'\n",
      " b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'\n",
      " b'FAFAF;F/9;.:/;999B/9A.DFFF;-->.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)\n"
     ]
    }
   ],
   "source": [
    "fastq_data = tfio.genome.read_fastq(filename=\"test.fastq\")\n",
    "print(fastq_data.sequences)\n",
    "print(fastq_data.raw_quality)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qxHjVKXzdx5W"
   },
   "source": [
    "As you see, the returned `fastq_data` has `fastq_data.sequences` which is a string tensor of all sequences in the fastq file (which can each be a different size) along with `fastq_data.raw_quality` which includes Phred encoded quality information about the quality of each base read in the sequence.\n",
    "\n",
    "### Quality\n",
    "You can use a helper op to convert this quality information into probabilities if you are interested."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:46.186813Z",
     "iopub.status.busy": "2020-10-27T16:23:46.186186Z",
     "iopub.status.idle": "2020-10-27T16:23:46.445815Z",
     "shell.execute_reply": "2020-10-27T16:23:46.445308Z"
    },
    "id": "6IYxfFI4eQTM"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Use fn_output_signature instead\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4, None, 1)\n",
      "[ 7 68 46  3]\n",
      "<tf.RaggedTensor [[[0.0005011872854083776], [0.0005011872854083776], [0.0012589251855388284], [0.0005011872854083776], [0.0007943279924802482], [0.00019952621369156986], [0.0006309572490863502]], [[0.0006309572490863502], [0.0006309572490863502], [0.0006309572490863502], [0.0006309572490863502], [0.0006309572490863502], [0.0005011872854083776], [0.00019952621369156986], [0.0007943279924802482], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.0003162277571391314], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952621369156986], [0.00019952621369156986], [0.0001584893325343728], [0.00019952621369156986], [0.0001584893325343728], [0.00012589251855388284], [0.0005011872854083776], [0.00019952621369156986], [0.0005011872854083776], [0.00019952621369156986], [0.0005011872854083776], [0.00019952621369156986], [0.0006309572490863502], [0.0005011872854083776], [0.0005011872854083776], [0.0005011872854083776], [0.00012589251855388284], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00019952621369156986], [0.00012589251855388284], [0.00012589251855388284], [0.0003981070767622441], [0.0002511885541025549], [0.00019952621369156986], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.0010000000474974513], [0.00019952621369156986], [0.0001584893325343728], [0.00019952621369156986], [0.00019952621369156986], [0.00012589251855388284], [0.0002511885541025549], [0.0003162277571391314], [0.0001584893325343728], [0.015848929062485695], [0.0002511885541025549], [0.00019952621369156986], [0.0001584893325343728], [0.0001584893325343728], [0.0001584893325343728], [0.00012589251855388284], [0.0002511885541025549], [0.0001584893325343728], [0.00012589251855388284], [0.0001584893325343728]], [[0.00019952621369156986], [0.0006309572490863502], [0.00019952621369156986], [0.0006309572490863502], [0.00019952621369156986], [0.002511885715648532], [0.00019952621369156986], [0.03981072083115578], [0.003981071058660746], [0.002511885715648532], [0.050118714570999146], [0.003162277629598975], [0.03981072083115578], [0.002511885715648532], [0.003981071058660746], [0.003981071058660746], [0.003981071058660746], [0.0005011872854083776], [0.03981072083115578], [0.003981071058660746], [0.0006309572490863502], [0.050118714570999146], [0.0003162277571391314], [0.00019952621369156986], [0.00019952621369156986], [0.00019952621369156986], [0.002511885715648532], [0.06309572607278824], [0.06309572607278824], [0.0012589251855388284], [0.050118714570999146], [0.0006309572490863502], [0.0006309572490863502], [0.0005011872854083776], [0.03981072083115578], [0.00019952621369156986], [0.0003981070767622441], [0.002511885715648532], [0.003981071058660746], [0.06309572607278824], [0.0007943279924802482], [0.06309572607278824], [0.001584893325343728], [0.002511885715648532], [0.001584893325343728], [0.050118714570999146]], [[0.00019952621369156986], [0.0006309572490863502], [0.0003162277571391314]]]>\n"
     ]
    }
   ],
   "source": [
    "quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)\n",
    "print(quality.shape)\n",
    "print(quality.row_lengths().numpy())\n",
    "print(quality)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bg3wzTFzhcfS"
   },
   "source": [
    "### One hot encodings\n",
    "You may also want to encode the genome sequence data (which consists of `A` `T` `C` `G` bases) using a one hot encoder. There's a built in operation that can help with this.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:46.454019Z",
     "iopub.status.busy": "2020-10-27T16:23:46.453342Z",
     "iopub.status.idle": "2020-10-27T16:23:47.051270Z",
     "shell.execute_reply": "2020-10-27T16:23:47.050657Z"
    },
    "id": "oAiepmy8h32a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<tf.RaggedTensor [[[0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0, 0, 1, 0], [0, 1, 0, 0]], [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0]]]>\n",
      "(4, None, 4)\n"
     ]
    }
   ],
   "source": [
    "one_hot = tfio.genome.sequences_to_onehot(fastq_data.sequences)\n",
    "print(one_hot)\n",
    "print(one_hot.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-10-27T16:23:47.055452Z",
     "iopub.status.busy": "2020-10-27T16:23:47.054797Z",
     "iopub.status.idle": "2020-10-27T16:23:47.057483Z",
     "shell.execute_reply": "2020-10-27T16:23:47.056863Z"
    },
    "id": "oAiepmy8h32a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Convert DNA sequences into a one hot nucleotide encoding.\n",
      "\n",
      "    Each nucleotide in each sequence is mapped as follows:\n",
      "    A -> [1, 0, 0, 0]\n",
      "    C -> [0, 1, 0, 0]\n",
      "    G -> [0 ,0 ,1, 0]\n",
      "    T -> [0, 0, 0, 1]\n",
      "\n",
      "    If for some reason a non (A, T, C, G) character exists in the string, it is\n",
      "    currently mapped to a error one hot encoding [1, 1, 1, 1].\n",
      "\n",
      "    Args:\n",
      "        sequences: A tf.string tensor where each string represents a DNA sequence\n",
      "\n",
      "    Returns:\n",
      "        tf.RaggedTensor: The output sequences with nucleotides one hot encoded.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(tfio.genome.sequences_to_onehot.__doc__)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "Tce3stUlHN0L"
   ],
   "name": "genome.ipynb",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}