{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2020-11-12T02:21:41.840037Z", "iopub.status.busy": "2020-11-12T02:21:41.839346Z", "iopub.status.idle": "2020-11-12T02:21:41.841868Z", "shell.execute_reply": "2020-11-12T02:21:41.841315Z" }, "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
TensorFlow.org에서 보기Google Colab에서 실행하기GitHub에서소스 보기노트북 다운로드하기
" ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "## 개요\n", "\n", "이 튜토리얼은 일반적으로 사용되는 게놈 IO 기능을 제공하는 `tfio.genome` 패키지를 시연합니다. 즉, 여러 게놈 파일 형식을 읽고 데이터를 준비하기 위한 몇 가지 일반적인 연산도 제공합니다(예: 원-핫 인코딩 또는 Phred 품질을 확률로 구문 분석).\n", "\n", "이 패키지는 [Google Nucleus](https://github.com/google/nucleus) 라이브러리를 사용하여 일부 핵심 기능을 제공합니다. " ] }, { "cell_type": "markdown", "metadata": { "id": "MUXex9ctTuDB" }, "source": [ "## 설정" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2020-11-12T02:21:41.847511Z", "iopub.status.busy": "2020-11-12T02:21:41.846857Z", "iopub.status.idle": "2020-11-12T02:21:44.222747Z", "shell.execute_reply": "2020-11-12T02:21:44.223228Z" }, "id": "IqR2PQG4ZaZ0" }, "outputs": [], "source": [ "try:\n", " %tensorflow_version 2.x\n", "except Exception:\n", " pass\n", "!pip install -q tensorflow-io" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2020-11-12T02:21:44.229099Z", "iopub.status.busy": "2020-11-12T02:21:44.227856Z", "iopub.status.idle": "2020-11-12T02:21:50.848935Z", "shell.execute_reply": "2020-11-12T02:21:50.848300Z" }, "id": "bkF2WtCMaJ-3" }, "outputs": [], "source": [ "import tensorflow_io as tfio\n", "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": { "id": "6wkjlql3cOy0" }, "source": [ "## FASTQ 데이터\n", "\n", "FASTQ는 기본 품질 정보와 함께 두 가지 시퀀스 정보를 모두 저장하는 일반적인 게놈 파일 형식입니다.\n", "\n", "먼저, 샘플 `fastq` 파일을 다운로드하겠습니다." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2020-11-12T02:21:50.859206Z", "iopub.status.busy": "2020-11-12T02:21:50.853524Z", "iopub.status.idle": "2020-11-12T02:21:51.388696Z", "shell.execute_reply": "2020-11-12T02:21:51.388048Z" }, "id": "yASvppCxceBu" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\r\n", " Dload Upload Total Spent Left Speed\r\n", "\r", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "100 407 100 407 0 0 1000 0 --:--:-- --:--:-- --:--:-- 1000\r\n" ] } ], "source": [ "# Download some sample data:\n", "!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq" ] }, { "cell_type": "markdown", "metadata": { "id": "3zekWXlVdprb" }, "source": [ "### FASTQ 데이터 읽기\n", "\n", "이제 `tfio.genome.read_fastq`를 사용하여 이 파일을 읽겠습니다(`tf.data` API는 곧 제공 예정)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2020-11-12T02:21:52.791479Z", "iopub.status.busy": "2020-11-12T02:21:52.790660Z", "iopub.status.idle": "2020-11-12T02:21:52.794558Z", "shell.execute_reply": "2020-11-12T02:21:52.794012Z" }, "id": "vl761cHTc7N1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(\n", "[b'GATTACA'\n", " b'CGTTAGCGCAGGGGGCATCTTCACACTGGTGACAGGTAACCGCCGTAGTAAAGGTTCCGCCTTTCACT'\n", " b'CGGCTGGTCAGGCTGACATCGCCGCCGGCCTGCAGCGAGCCGCTGC' b'CGG'], shape=(4,), dtype=string)\n", "tf.Tensor(\n", "[b'BB>B@FA'\n", " b'AAAAABF@BBBDGGGG?FFGFGHBFBFBFABBBHGGGFHHCEFGGGGG?FGFFHEDG3EFGGGHEGHG'\n", " b'FAFAF;F/9;.:/;999B/9A.DFFF;-->.AAB/FC;9-@-=;=.' b'FAD'], shape=(4,), dtype=string)\n" ] } ], "source": [ "fastq_data = tfio.genome.read_fastq(filename=\"test.fastq\")\n", "print(fastq_data.sequences)\n", "print(fastq_data.raw_quality)" ] }, { "cell_type": "markdown", "metadata": { "id": "qxHjVKXzdx5W" }, "source": [ "보는 바와 같이 반환된 `fastq_data`에는 시퀀스에서 읽은 각 기본 정보의 품질에 관한 Phred 인코딩 품질 정보를 포함하는 `fastq_data.raw_quality`와 함께 fastq 파일(각각 크기가 다를 수 있음)에 있는 모든 시퀀스의 문자열 텐서인 `fastq_data.sequences`가 있습니다.\n", "\n", "### 품질\n", "\n", "관심이 있는 경우 도우미 op를 사용하여 이 품질 정보를 확률로 변환할 수 있습니다." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2020-11-12T02:21:52.799467Z", "iopub.status.busy": "2020-11-12T02:21:52.798765Z", "iopub.status.idle": "2020-11-12T02:21:53.389496Z", "shell.execute_reply": "2020-11-12T02:21:53.388811Z" }, "id": "6IYxfFI4eQTM" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py:574: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use fn_output_signature instead\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(4, None, 1)\n", "[ 7 68 46 3]\n", "\n" ] } ], "source": [ "quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)\n", "print(quality.shape)\n", "print(quality.row_lengths().numpy())\n", "print(quality)" ] }, { "cell_type": "markdown", "metadata": { "id": "bg3wzTFzhcfS" }, "source": [ "### 원-핫 인코딩\n", "\n", "또한, 원-핫 인코더를 사용하여 `A` `T` `C` `G` 염기 서열로 구성된 게놈 시퀀스 데이터를 인코딩할 수도 있습니다. 인코딩에 도움이 되는 내장 연산이 있습니다.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2020-11-12T02:21:53.394412Z", "iopub.status.busy": "2020-11-12T02:21:53.393482Z", "iopub.status.idle": "2020-11-12T02:21:53.396443Z", "shell.execute_reply": "2020-11-12T02:21:53.396914Z" }, "id": "oAiepmy8h32a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Convert DNA sequences into a one hot nucleotide encoding.\n", "\n", " Each nucleotide in each sequence is mapped as follows:\n", " A -> [1, 0, 0, 0]\n", " C -> [0, 1, 0, 0]\n", " G -> [0 ,0 ,1, 0]\n", " T -> [0, 0, 0, 1]\n", "\n", " If for some reason a non (A, T, C, G) character exists in the string, it is\n", " currently mapped to a error one hot encoding [1, 1, 1, 1].\n", "\n", " Args:\n", " sequences: A tf.string tensor where each string represents a DNA sequence\n", "\n", " Returns:\n", " tf.RaggedTensor: The output sequences with nucleotides one hot encoded.\n", " \n" ] } ], "source": [ "print(tfio.genome.sequences_to_onehot.__doc__)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2020-11-12T02:21:53.401369Z", "iopub.status.busy": "2020-11-12T02:21:53.400477Z", "iopub.status.idle": "2020-11-12T02:21:53.403260Z", "shell.execute_reply": "2020-11-12T02:21:53.403643Z" }, "id": "oAiepmy8h32a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Convert DNA sequences into a one hot nucleotide encoding.\n", "\n", " Each nucleotide in each sequence is mapped as follows:\n", " A -> [1, 0, 0, 0]\n", " C -> [0, 1, 0, 0]\n", " G -> [0 ,0 ,1, 0]\n", " T -> [0, 0, 0, 1]\n", "\n", " If for some reason a non (A, T, C, G) character exists in the string, it is\n", " currently mapped to a error one hot encoding [1, 1, 1, 1].\n", "\n", " Args:\n", " sequences: A tf.string tensor where each string represents a DNA sequence\n", "\n", " Returns:\n", " tf.RaggedTensor: The output sequences with nucleotides one hot encoded.\n", " \n" ] } ], "source": [ "print(tfio.genome.sequences_to_onehot.__doc__)" ] } ], "metadata": { "colab": { "collapsed_sections": [ "Tce3stUlHN0L" ], "name": "genome.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 0 }