{ "cells": [ { "cell_type": "markdown", "id": "68e3537d", "metadata": {}, "source": [ " # GEO data processing\n", "\n", "![](./images/Module1/Data_Processing.jpg)\n", "\n", "## Learning Objectives:\n", "1. Understand the type of data that is accessible from GEO.\n", "2. Demonstrate how to navigate the GEO website, search for dataset using accession number, and select samples.\n", "3. Use command-line to download data from GEO.\n", "4. Perform data pre-processing, normalization, loading and saving expression matrix." ] }, { "cell_type": "markdown", "id": "5ae77e2e", "metadata": {}, "source": [ "## Accessing public GEO sequencing data\n", "The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes comprehensive sets of microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. In addition to data storage, the website provides set of web-based tools and applications to help users query and download the studies and gene expression patterns stored in GEO.\n", "### Searching and accessing data on GEO\n", "Searching for GEO is relatively straight forward. First, users need to navigate to\n", " https://www.ncbi.nlm.nih.gov/geo/.\n", "There are multiple ways to search datasets but the simplest way is to provide the accession number in the search box. We will use an example dataset with the accession number\n", "GSE48350 for demonstration.\n", "The GEO website interface with searching procedure of the example dataset is shown in the figure below:\n", "![](./images/Module1/GEO_Website.png)\n", "\n", "When the searching process is done, a webpage with detailed record of the example dataset such as published date, title, organism, experiment type, dataset summary etc. will be shown in the figure below:\n", "\n", "![](./images/Module1/GEO_Dataset_Page.png)\n", "\n", "We can see that the dataset GSE48350 was published in Apr 21, 2014 which focuses on Human Alzheimer's\n", "Disease using microarray sequencing technology. Samples were primary collected from 4 brain regions:\n", "hippocampus (HC), entorhinal cortex (EC), superior frontal cortex (SCG), post-central gyrus (PCG)\n", "of normal and disease patients.\n", "\n", "To display the quiz in all the learning sub-modules, it is necessary to install the `IRdisplay` package.\n", "This package allow quizzes written in `html` format to show up in the notebook. User can install the `IRdisplay` using the following\n", "command\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [] }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "suppressWarnings(if (!require(\"IRdisplay\")) install.packages(\"IRdisplay\"))\n", "suppressWarnings(library(IRdisplay))" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 2, "id": "4e7c1f11", "metadata": { "tags": [ "remove-input" ], "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Run the following command to take the quiz\n", "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### Downloading data using web-interface\n", "At the bottom of the dataset page, users will find additional information about the dataset such as sequencing platform, number of samples, project ID and links to download the expression data.\n", "The exact places to find this information are shown in the figure below:\n", "\n", "![](./images/Module1/Manual_Download.png)\n", "\n", "The dataset GSE48350 was sequenced using Affymetrix Human Genome U133 Plus 2.0 Array and the whole dataset contains 253 samples. The summary table at the bottom of the page show the all the data generated from the experiment. Users can click to \"(http)\" hyperlink to download all the samples or click \"(custom)\" to select and download the samples of interest. Note that, expression data download from this step is raw data and additional data processing needs to be done locally for further analysis." ] }, { "cell_type": "markdown", "id": "13964320", "metadata": {}, "source": [ "## Accessing GEO data using R command line\n", "### Downloading and pre-processing the data\n", "\n", "If users have a programming background, they can use R's command line to automate the download. Getting data from GEO is really quite easy using GEOquery R package available in Bioconductor. Before starting, user will need to install the GEOquery package using the following command." ] }, { "cell_type": "code", "execution_count": 4, "id": "77bf03b8", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning message:\n", "\"package 'BiocManager' was built under R version 4.2.2\"\n" ] } ], "source": [ "suppressMessages({if (!require(\"BiocManager\", quietly = TRUE))\n", " suppressWarnings(install.packages(\"BiocManager\"))\n", " suppressWarnings(BiocManager::install(\"GEOquery\", update = F))\n", "})" ] }, { "cell_type": "markdown", "id": "1624fe5a", "metadata": {}, "source": [ "We can use the `getGEO` function from the `GEOquery` package to download GEO dataset. First, users have to specify the accession ID of the dataset. For this demonstration, we will use the same dataset `GSE48350`." ] }, { "cell_type": "code", "execution_count": 5, "id": "6e56fd57-3a91-4715-962b-07f43324c519", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "suppressMessages(library(\"GEOquery\"))\n", "## change my_id to be the dataset that you want.\n", "accession_ID <- \"GSE48350\"\n", "suppressMessages({gse <- getGEO(accession_ID,GSEMatrix =TRUE, AnnotGPL=TRUE)})" ] }, { "cell_type": "markdown", "id": "968c77a6", "metadata": {}, "source": [ "Some datasets on GEO may be derived from different microarray platforms. Therefore the object `gse` can be a list of different datasets.\n", "You can find out how many platforms were used by checking the length of the `gse` object." ] }, { "cell_type": "code", "execution_count": 6, "id": "1d028647", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Number of platforms: 1\"\n" ] } ], "source": [ "## check how many platforms used\n", "if (length(gse) > 1) idx <- grep(\"GPL570\", attr(gse, \"names\")) else idx <- 1\n", "data <- gse[[idx]]\n", "print(paste0(\"Number of platforms: \",length(data)))" ] }, { "cell_type": "markdown", "id": "b459eea8", "metadata": {}, "source": [ "The result shows that we have only one dataset that belongs to microarray platform mentioned GEO dataset page.\n", "Next, we can access the samples information and genes using the command below:" ] }, { "cell_type": "code", "execution_count": 7, "id": "2c31ce74", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"The dataset contains 253 samples and 54675 genes\"\n" ] } ], "source": [ "# Get the samples information\n", "samples <- pData(data)\n", "# Get the genes information\n", "genes <- fData(data)\n", "# Check the number of samples and genes\n", "print(paste0(\"The dataset contains \", dim(samples)[1] ,\" samples and \",dim(genes)[1],\" genes\"))" ] }, { "cell_type": "code", "execution_count": 8, "id": "a058c4f6", "metadata": { "tags": [ "remove-input" ], "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Quiz_Submodule1-1\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", " \r\n", "\r\n", "\r\n", "\r\n", "\r\n", " \r\n", " \r\n", " \r\n", " \r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "
\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "\r\n", "
\r\n", "
\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "id": "56de1fc1", "metadata": {}, "source": [ "The ```samples``` contain the metadata of each sample such as title, status, GEO accession, submission data etc. while the ```genes```\n", "dataframe show us the probeID, title, gene symbol, gene ID, and many more. We can inspect the detail of the first several rows and columns\n", "of ```samples``` and ```genes``` dataframe using the following command:" ] }, { "cell_type": "code", "execution_count": 9, "id": "1fbcf2d2", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 5 × 4
titlegeo_accessionstatussubmission_date
<chr><chr><chr><chr>
GSM300166PostcentralGyrus_female_91yrs_indiv10 GSM300166Public on Oct 09 2008Jun 25 2008
GSM300167SuperiorFrontalGyrus_female_91yrs_indiv10GSM300167Public on Oct 09 2008Jun 25 2008
GSM300168Hippocampus_female_96yrs_indiv105 GSM300168Public on Oct 09 2008Jun 25 2008
GSM300169Hippocampus_male_82yrs_indiv106 GSM300169Public on Oct 09 2008Jun 25 2008
GSM300170Hippocampus_male_84yrs_indiv108 GSM300170Public on Oct 09 2008Jun 25 2008
\n" ], "text/latex": [ "A data.frame: 5 × 4\n", "\\begin{tabular}{r|llll}\n", " & title & geo\\_accession & status & submission\\_date\\\\\n", " & & & & \\\\\n", "\\hline\n", "\tGSM300166 & PostcentralGyrus\\_female\\_91yrs\\_indiv10 & GSM300166 & Public on Oct 09 2008 & Jun 25 2008\\\\\n", "\tGSM300167 & SuperiorFrontalGyrus\\_female\\_91yrs\\_indiv10 & GSM300167 & Public on Oct 09 2008 & Jun 25 2008\\\\\n", "\tGSM300168 & Hippocampus\\_female\\_96yrs\\_indiv105 & GSM300168 & Public on Oct 09 2008 & Jun 25 2008\\\\\n", "\tGSM300169 & Hippocampus\\_male\\_82yrs\\_indiv106 & GSM300169 & Public on Oct 09 2008 & Jun 25 2008\\\\\n", "\tGSM300170 & Hippocampus\\_male\\_84yrs\\_indiv108 & GSM300170 & Public on Oct 09 2008 & Jun 25 2008\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 5 × 4\n", "\n", "| | title <chr> | geo_accession <chr> | status <chr> | submission_date <chr> |\n", "|---|---|---|---|---|\n", "| GSM300166 | PostcentralGyrus_female_91yrs_indiv10 | GSM300166 | Public on Oct 09 2008 | Jun 25 2008 |\n", "| GSM300167 | SuperiorFrontalGyrus_female_91yrs_indiv10 | GSM300167 | Public on Oct 09 2008 | Jun 25 2008 |\n", "| GSM300168 | Hippocampus_female_96yrs_indiv105 | GSM300168 | Public on Oct 09 2008 | Jun 25 2008 |\n", "| GSM300169 | Hippocampus_male_82yrs_indiv106 | GSM300169 | Public on Oct 09 2008 | Jun 25 2008 |\n", "| GSM300170 | Hippocampus_male_84yrs_indiv108 | GSM300170 | Public on Oct 09 2008 | Jun 25 2008 |\n", "\n" ], "text/plain": [ " title geo_accession\n", "GSM300166 PostcentralGyrus_female_91yrs_indiv10 GSM300166 \n", "GSM300167 SuperiorFrontalGyrus_female_91yrs_indiv10 GSM300167 \n", "GSM300168 Hippocampus_female_96yrs_indiv105 GSM300168 \n", "GSM300169 Hippocampus_male_82yrs_indiv106 GSM300169 \n", "GSM300170 Hippocampus_male_84yrs_indiv108 GSM300170 \n", " status submission_date\n", "GSM300166 Public on Oct 09 2008 Jun 25 2008 \n", "GSM300167 Public on Oct 09 2008 Jun 25 2008 \n", "GSM300168 Public on Oct 09 2008 Jun 25 2008 \n", "GSM300169 Public on Oct 09 2008 Jun 25 2008 \n", "GSM300170 Public on Oct 09 2008 Jun 25 2008 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 5 × 5
IDGene titleGene symbolGene IDUniGene title
<chr><chr><chr><chr><chr>
1007_s_at1007_s_atmicroRNA 4640///discoidin domain receptor tyrosine kinase 1MIR4640///DDR1100616237///780
1053_at1053_at replication factor C subunit 2 RFC2 5982
117_at117_at heat shock protein family A (Hsp70) member 6 HSPA6 3310
121_at121_at paired box 8 PAX8 7849
1255_g_at1255_g_atguanylate cyclase activator 1A GUCA1A 2978
\n" ], "text/latex": [ "A data.frame: 5 × 5\n", "\\begin{tabular}{r|lllll}\n", " & ID & Gene title & Gene symbol & Gene ID & UniGene title\\\\\n", " & & & & & \\\\\n", "\\hline\n", "\t1007\\_s\\_at & 1007\\_s\\_at & microRNA 4640///discoidin domain receptor tyrosine kinase 1 & MIR4640///DDR1 & 100616237///780 & \\\\\n", "\t1053\\_at & 1053\\_at & replication factor C subunit 2 & RFC2 & 5982 & \\\\\n", "\t117\\_at & 117\\_at & heat shock protein family A (Hsp70) member 6 & HSPA6 & 3310 & \\\\\n", "\t121\\_at & 121\\_at & paired box 8 & PAX8 & 7849 & \\\\\n", "\t1255\\_g\\_at & 1255\\_g\\_at & guanylate cyclase activator 1A & GUCA1A & 2978 & \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 5 × 5\n", "\n", "| | ID <chr> | Gene title <chr> | Gene symbol <chr> | Gene ID <chr> | UniGene title <chr> |\n", "|---|---|---|---|---|---|\n", "| 1007_s_at | 1007_s_at | microRNA 4640///discoidin domain receptor tyrosine kinase 1 | MIR4640///DDR1 | 100616237///780 | |\n", "| 1053_at | 1053_at | replication factor C subunit 2 | RFC2 | 5982 | |\n", "| 117_at | 117_at | heat shock protein family A (Hsp70) member 6 | HSPA6 | 3310 | |\n", "| 121_at | 121_at | paired box 8 | PAX8 | 7849 | |\n", "| 1255_g_at | 1255_g_at | guanylate cyclase activator 1A | GUCA1A | 2978 | |\n", "\n" ], "text/plain": [ " ID Gene title \n", "1007_s_at 1007_s_at microRNA 4640///discoidin domain receptor tyrosine kinase 1\n", "1053_at 1053_at replication factor C subunit 2 \n", "117_at 117_at heat shock protein family A (Hsp70) member 6 \n", "121_at 121_at paired box 8 \n", "1255_g_at 1255_g_at guanylate cyclase activator 1A \n", " Gene symbol Gene ID UniGene title\n", "1007_s_at MIR4640///DDR1 100616237///780 \n", "1053_at RFC2 5982 \n", "117_at HSPA6 3310 \n", "121_at PAX8 7849 \n", "1255_g_at GUCA1A 2978 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "samples[1:5,1:4]\n", "genes[1:5,1:5]" ] }, { "cell_type": "markdown", "id": "0983f7f4", "metadata": {}, "source": [ "To inspect the expression data of the first few rows and samples, we can use this command" ] }, { "cell_type": "code", "execution_count": 10, "id": "b86fd83b", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A matrix: 6 × 5 of type dbl
GSM300166GSM300167GSM300168GSM300169GSM300170
1007_s_at0.88801621.43551851.60960151.7549601.820730
1053_at0.66646040.88585211.85907771.0366661.421393
117_at0.85963841.06202983.09739452.2436555.060301
121_at0.97514951.05074480.98228381.1982371.039529
1255_g_at0.49125470.53752543.17964461.5142902.185801
1294_at0.83442680.92918901.51936941.5570891.641263
\n" ], "text/latex": [ "A matrix: 6 × 5 of type dbl\n", "\\begin{tabular}{r|lllll}\n", " & GSM300166 & GSM300167 & GSM300168 & GSM300169 & GSM300170\\\\\n", "\\hline\n", "\t1007\\_s\\_at & 0.8880162 & 1.4355185 & 1.6096015 & 1.754960 & 1.820730\\\\\n", "\t1053\\_at & 0.6664604 & 0.8858521 & 1.8590777 & 1.036666 & 1.421393\\\\\n", "\t117\\_at & 0.8596384 & 1.0620298 & 3.0973945 & 2.243655 & 5.060301\\\\\n", "\t121\\_at & 0.9751495 & 1.0507448 & 0.9822838 & 1.198237 & 1.039529\\\\\n", "\t1255\\_g\\_at & 0.4912547 & 0.5375254 & 3.1796446 & 1.514290 & 2.185801\\\\\n", "\t1294\\_at & 0.8344268 & 0.9291890 & 1.5193694 & 1.557089 & 1.641263\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A matrix: 6 × 5 of type dbl\n", "\n", "| | GSM300166 | GSM300167 | GSM300168 | GSM300169 | GSM300170 |\n", "|---|---|---|---|---|---|\n", "| 1007_s_at | 0.8880162 | 1.4355185 | 1.6096015 | 1.754960 | 1.820730 |\n", "| 1053_at | 0.6664604 | 0.8858521 | 1.8590777 | 1.036666 | 1.421393 |\n", "| 117_at | 0.8596384 | 1.0620298 | 3.0973945 | 2.243655 | 5.060301 |\n", "| 121_at | 0.9751495 | 1.0507448 | 0.9822838 | 1.198237 | 1.039529 |\n", "| 1255_g_at | 0.4912547 | 0.5375254 | 3.1796446 | 1.514290 | 2.185801 |\n", "| 1294_at | 0.8344268 | 0.9291890 | 1.5193694 | 1.557089 | 1.641263 |\n", "\n" ], "text/plain": [ " GSM300166 GSM300167 GSM300168 GSM300169 GSM300170\n", "1007_s_at 0.8880162 1.4355185 1.6096015 1.754960 1.820730 \n", "1053_at 0.6664604 0.8858521 1.8590777 1.036666 1.421393 \n", "117_at 0.8596384 1.0620298 3.0973945 2.243655 5.060301 \n", "121_at 0.9751495 1.0507448 0.9822838 1.198237 1.039529 \n", "1255_g_at 0.4912547 0.5375254 3.1796446 1.514290 2.185801 \n", "1294_at 0.8344268 0.9291890 1.5193694 1.557089 1.641263 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "head(exprs(data)[,1:5])" ] }, { "cell_type": "markdown", "id": "dcc74770", "metadata": {}, "source": [ "The `summary` function can then be used to print the distributions of each sample and\n", "the `range` function can help to check for the expression value range." ] }, { "cell_type": "code", "execution_count": 11, "id": "27bce30c", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/plain": [ " GSM300166 GSM300167 GSM300168 GSM300169 \n", " Min. : 0.0100 Min. : 0.0100 Min. : 0.0100 Min. : 0.01942 \n", " 1st Qu.: 0.6366 1st Qu.: 0.6519 1st Qu.: 0.3342 1st Qu.: 0.32045 \n", " Median : 0.8622 Median : 0.8828 Median : 0.7756 Median : 0.75063 \n", " Mean : 0.9193 Mean : 0.9492 Mean : 0.9274 Mean : 0.87928 \n", " 3rd Qu.: 1.0903 3rd Qu.: 1.0880 3rd Qu.: 1.1929 3rd Qu.: 1.15987 \n", " Max. :335.6746 Max. :531.7116 Max. :99.4657 Max. :26.46015 \n", " GSM300170 \n", " Min. : 0.01098 \n", " 1st Qu.: 0.32324 \n", " Median : 0.77234 \n", " Mean : 0.93768 \n", " 3rd Qu.: 1.21978 \n", " Max. :83.41583 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "
  1. 0.01
  2. 1438.7626
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 0.01\n", "\\item 1438.7626\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 0.01\n", "2. 1438.7626\n", "\n", "\n" ], "text/plain": [ "[1] 0.010 1438.763" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "summary(exprs(data)[,1:5])\n", "range(exprs(data))\n" ] }, { "cell_type": "markdown", "id": "96aafb5f", "metadata": {}, "source": [ "From the summary of the data above, we clearly see that the maximum expression values can be in the scale of thousands while the average expression values in each sample are below one. Therefore, pre-processing and normalization are required for quality assurance. One common step is to perform quartile filtering to remove the outlier and missing expression values. Also, we will need to perform a $log_2$ transformation to ensure the distributions of all samples are similar. Then, a `boxplot` can also be generated to see if the data have been correctly normalized. We can use the sample code below to perform all of those steps." ] }, { "cell_type": "code", "execution_count": 12, "id": "29bda34b", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA0gAAANICAMAAADKOT/pAAAAM1BMVEUAAABNTU1oaGh8fHyMjIyampqnp6eysrK9vb3Hx8fQ0NDT09PZ2dnh4eHp6enw8PD///8uNL8wAAAACXBIWXMAABJ0AAASdAHeZh94AAAgAElEQVR4nO2d7WLaPLNFzUcIyQmB+7/aA4akyvM6dZG2tTXWWn/q1tkZeTwLsCHNcAGAYgb3AgDWACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAQAWRBoBgZEy5XhxDCQAliAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhApPMr/IQpyQaRVQeNcINKqoHEuEGlV0DgXiLQqchrHNZYCRIIEGp8LIkECjc8FkSCBxueCSKuitHE0PhdEWhU0zgUirQoa5wKRVgWNc4FIq4JrJBeIBAk0PhdEggQanwsiQQKNzwWRimnps2pcI7lAJCnuhbvr9wsiSXEv3F2/XxBJinvh7vr9gkhS3AvnGskFIkECjc8FkSCBxueCSJBA43NBJCnu/zPBcY3U0vtoPhBJivti3904d30fiCTFLYK7ce76PhBJilsEd+Pc9X0gkhS3CNHzcUEkSKDxuSASJND4XBAJEmh8LogkxX2NET0fF0SS4h5Ed+Pc9X0gkhS3CO7Guev7QCQpbhHcjXPX94FIUtwiRM/HBZEggcbngkiQQONzQSRIoPG5IJIU9zVG9HxcEEmKexDdjXPX94FIUtwiuBvnru8DkaS4RXA3zl3fByJJcYsQPR8XRIIEGp8LIkECjc8FkSCBxueCSFLc1xjR83FBJCnuQXQ3zl3fByJJcYvgbpy7vg9EkuIWwd04d30fiCTFLUL0fFwQCRJofC6IBAk0PpeqIn287sdf97E/fCxVAoqg8blUFOm8TX51zm6REnbc1xjR83GpKNJh2Lydxq3P981wWKKEHfcguhvnru+jokib4fS9fRo2S5Sw4xbB3Th3fR8VRfrxqxD//nsRw54Ptwjuxrnr++AZSYpbhOj5uNS9Rnr/HLfWe40UHRqfS83b37vkrt32vEgJKIPG51L3faTD+D7SZv/K+0htQuNz4ZMNUtzXGNHzcUEkKe5BdDfOXd8HIklxi+BunLu+D5dIvI/UZL4Ud30f7Yg0pChKOHCLED0fF17aQQKNzwWRIIHG54JIkEDjc6kp0vllGHbvj2/CzQbya6LmD/Zt7j8ee/8miNRivhR3fR9VP7R6vNp03Iw/HItITeZLcdf3UfXHKMY/PjfbT0RqNF+Ku74Pww/2nXc7RCK/MiqKtB2+fnRiu1urSNGh8blUFOk4vDy2PocdIjUJjc+l5u3vw7c97zOfAuJ8mqDxuVR9Q/a0/9r6fFmnSO5rjOj5uPDJBinuQXQ3zl3fByJJcYvgbpy7vg9EkuIWwd04d30fiCTFLUL0fFwQCRJofC6IBAk0PhdEggQanwsiSXFfY0TPxwWRpLgH0d04d30fiCTFLYK7ce76PhBJilsEd+Pc9X0gkhS3CNHzcUEkSKDxuSASJND4XBAJEmh8LogkxX2NET0fF0SS4h5Ed+Pc9X0gkhS3CO7Guev7QCQpbhHcjXPX94FIUtwiRM/HBZEggcbngkiQQONzQSRIoPG5IJIU9zVG9HxcEEmKexDdjXPX94FIUtwiuBvnru8DkaS4RXA3zl3fByJJcYsQPR8XRIIEGp8LIkECjc8FkSCBxueCSFLc1xjR83FBJCnuQXQ3zl3fByJJcYvgbpy7vg9EkuIWwd04d30fiCTFLUL0fFwQCRJofC6IBAk0PhdEggQanwsiSXFfY0TPxwWRpLgH0d04d30fiCTFLYK7ce76PhBJilsEd+Pc9X0gkhS3CNHzcUEkSKDxuSASJND4XBAJEmh8LogkxX2NET0fF0SS4h5Ed+Pc9X0gkhS3CO7Guev7QCQpbhHcjXPX94FIUtwiRM/HBZEggcbngkiQQONzQSRIoPG5IJIU9zVG9HxcEEmKexDdjXPX94FIUtwiuBvnru8DkaS4RXA3zl3fByJJcYsQPR8XRIIEGp8LIkECjc8FkSCBxueCSFLc1xjR83FBJCnuQXQ3zl3fByJJcYvgbpy7vg9EkuIWwd04d30fiCTFLUL0fFwQCRJofC6IBAk0PhdEggQanwsiSXFfY0TPxwWRpLgH0d04d30fiCTFLYK7ce76PhBJilsEd+Pc9X0gkhS3CNHzcUEkSKDxuSASJND4XBAJEmh8LogkxX2NET3vYPiVp75LRuHnIw2WWAb3ILob565fSv76EUmKWwR349z1S0GkRnCL4G6cu34piNQIbhGi590gEkig8bkgEiTQ+FwQCRJofC6IJMV9jRE974ZrpEZwD6K7ce76pSBSI7hFcDfOXb8URGoEtwjuxrnrl9K1SJrPSmlwixA976ZrkSRBGKF/uSASJDj619IrinwQCRLc/XPXzweRpLivMXrPl8I1UiO4B8nduH7Xj0jS1+juQYo7iJp8KYjUSOHo+VLc9UtBpNIg+SbybhCpNCjKR6f348+nvkjH7TDs3xcqgUhluI8/p34b70NVFOl+YLv7MR4WKWEfhOi4+xf3gbC2SIfhcL5cPg/DcYkS3Q8CeVe+tkib4XzbPg/bJUrYiTsIGtzr70ekrxeuf38By+1vU74U9/r7EenlS6TNEiWKiXsiNflS3PVLCSLS/vX4PrxdN8+Hv99tCHvXjnxsgoj0fVNyGDbnJUp0Pwil9H78+dR8H+l0Oh73+/GWw+GvHiGSC/fxxz1/fLIBEtz9QyQtYUVyDwJ5V35lIrmJOwga3OvvTyTeR2oyX4p7/Yh0udg+bvhjEZ3nS3HXLyWeSAuVcA8i+dggUmlQlI9O78efDyJJ89FxH3/c84dIkODuHyJpCSuSexDIu/IrE8lN3EHQ4F5/FyI98QP13P425Utxr78LkY6I1Hy+FHf9UkKIdDltdkuXcA8i+djEEOlymvnPg8pL9D4IpfR+/PnUvdlwHE7LlkCkMtzHH/f8reyunXsQouPuHyJpCSuSexDIu/IrE8lN3EHQ4F4/IlUvsUzh6PlS3OtHpOollikcPV+Ku34piFQaJN9E3g0ilQZF+ej0fvz5IJI0Hx338cc9f4gECe7+IZKWsCK5B4G8K78ykdzEHQQN7vUjUvUSyxSOni/FvX5Eql5imcLR86W465eCSKVB8k3k3SBSaVCUj07vx58PIknz0XEff9zzh0iQ4O4fImkJK5J7EMi78isTyU3cQdDgXj8iVS+xTOHo+VLc60ek6iWWKRw9X4q7fimIVBok30TeDSKVBkX56PR+/PkgkjQfHffxxz1/iAQJ7v4hkpawIrkHgbwrvzKR3MQdBA3u9SNS9RLLFI6eL8W9fkSqXmKZwtHzpbjrl4JIpUHyTeTdIFJpUJSPTu/Hnw8iSfPRcR9/3POHSJDg7h8iaQkrknsQyLvyKxPJTdxB0OBePyJVL7FM4ej5UtzrR6TqJZYpHD1firt+KYhUGiTfRN4NIpUGRfno9H78+SCSNB8d9/HHPX+IBAnu/iGSlrAiuQeBvCu/MpHcxB0EDe71I1L1EssUjp4vxb1+RKpeYpnC0fOluOuXgkilQfJN5N0gUmlQlI9O78efDyJJ89FxH3/c84dIkODuHyJpCSuSexDIu/IrE8lN3EHQ4F4/IlUvsUzh6PlS3OtHpOollikcPV+Ku34piFQaJN9E3g0ilQZF+ej0fvz5IJI0Hx338cc9f4gECe7+IZKWsCK5B4G8K78ykdzEHQQN7vUjUvUSyxSOni/FvX5Eql5imcLR86W465eCSKVB8k3k3SBSaVCUj07vx58PIknz0XEff9zzh0iQ4O4fImkJK5J7EMi78isTyU3cQdDgXj8iVS+xTOHo+VLc60ek6iWWKRw9X4q7fimIVBok30TeDSKVBkX56PR+/PkgkjQfHffxxz1/iAQJ7v4hkpawIrkHgbwrvzKR3MQdBA3u9SNS9RLLFI6eL8W9fkSqXmKZwtHzpbjrl4JIpcFY+SEDZf3F8m4QqTQoyldi+L+nqSNSvyCSNF+J1YoU9/whUkQQaaF81cqItFz9f71GKhRptddY3LWrXqII9yAWi1SYL13/CvOIlFPYPYhukdwPJA3mESmncPciFeabBZFKg8/lEaks3yyIVBp8Lu8eJLcI7uNvEETKybuvMdwiuJ+RF8tXrbxmkf61jHkQo+d/b2zmCVHlq1ZGJPsgRs//3tjME+LPr0ykSrgHMXr+98ZmnhB/HpFyCrsHMXq+tP+l+QXeB0OknMLuQYyed7PA+lcmUpS7dp3n3SDSYsHn8u5BDJ/Xv7R6CkRaLPhc3j6Inefd5y+/cmGkWgluf3eR//3E1Dl/+ZULI9VKIFIX+d9PTJ3zl1+5MNJgiSLcg9R7/vcTU+f85VcujKhLLHaxyjVSiLz7/OVXLoyoS7jvGrkHqfd8KYj0tb+wEeSD580PhFMjObNfE3nw8bofj2l/+CgrYT+R5LvOT43kzH5NZOS8TR4fdkUl3I0k33d+aiRn9msiI4dh83Yatz7fN8OhpIS7keT7zk+N5Mx+TWRkM5y+t0/DpqSEu5Hk+85PjeTMfk3knht++8vTJdyNJN93fmokZ/ZrIiM8I5FfSX5qJGf2ayIj12uk989xi2sk8qHzUyM5s18TubNL7tptzyUl3I0k33d+aiRn9msiDz4O4/tIm/0r7yORD5yfGsmZ/ZqIuoS7keT7zk+N5Mx+TURdwt1I8n3np0ZyZr8moi7hbiT5vvNTIzmzXxOZ+Ca8j0Q+bH5qJGf2ayIT3+R/vsuvH9WdSpsbSb7v/NRIzuzXRNQl3I0k33d+aiRn9msi6hLuRpLvOz81kjP7NRF1CXcjyfednxrJmf2aiLqEu5Hk+85PjeTMfk1EXcLdSPJ956dGcma/JqIu4W4k+b7zUyM5s18Tued+/88oni3hbiT5vvNTIzmzXxMZOSIS+XXkp0ZyZr8mcue0+ft/efLvJdyNJN93fmokZ/ZrIg9Of/9xvn8v4W4k+b7zUyM5s18T+eKY/LR5SQl3I8n3nZ8ayZn9moi6hLuR5PvOT43kzH5NRF3C3UjyhXnzfzmMSF/7zY0kX5h3i1RYf2okZ/ZrIuoS9kEgX5ZHpLyIuoR9EMjHziPSY7/7RJDvOj81kjP7NRF1CXcjyfednxrJmf2aiLqEu5HkzfnCl2aI9LW/9ESQj51HJE0J+4kk780jkqZE9BNBPnZ+aiRn9msi6hKIRN6ZnxrJmf2aiLoEIpF35qdGcma/JqIu4W4k+b7zUyM5s18TUZfgGYW8Mz81kjP7NRF1CUQi78xPjeTMfk1EXQKRyDvzUyM5s18TUZdwN5J83/mpkZzZr4moS7gbSb7v/NRIzuzXRNQl3I0k33d+aiRn9msi6hLuRpLvOz81kjP7NRF1CXcjyfednxrJmf2aiLqEu5Hk+85PjeTMfk1EXcLdSPJ956dGcma/JqIu4W4k+b7zUyM5s18TUZdwN5J83/mpkZzZr4moS7gbSb7v/NRIzuzXRNQl3I0k33d+aiRn9msi6hLuRpLvOz81kjP7NRF1CXcjyfednxrJmf2aiLqEu5Hk+85PjeTMfk1EXcLdSH6Mo+/81EjO7NdE1CXcjbSL5K7feX5qJGf2ayLqEu5G2gfZXb/z/NRIzuzXRNQl3I205xHJmp8ayZn9moi6hLuR5PvOT43kzH5NRF3C3UieUfrOT43kzH5NRF3C3UhE6js/NZIz+zURdQl3IxGp7/zUSM7s10TUJdyN7D7f+QPB1EjO7NdE1CXcjew+j0j/M5Iz+zURdQl3I8PnS0VApP8ZyZn9moi6hLuR4fOdi4BIX/vNjQyfR6Si/NRIzuzXRNQl3I0k33d+aiRn9msi6hLuRpLvOz81kjP7NRF1CXcjyfednxrJmf2aiLqEu5Hk+85PjeTMfk1EXcLdSPJ956dGcma/JqIu4W4k+b7zUyM5s18TUZdwN5J83/mpkZzZr4moS7gbSb4wH/x9rKmRnNmviahLuBtJvixfSoPrRyTy9fOlNLh+RCJfP5994ttdPyKRr5/PPvGq9Rdeo+WvvDCiLuEeBPJl+ewTv3T9fBCJfP189olvpL4miUjkC/PZJ76R+prkGkQK/j5G9HwpC1zjPL5x/pKqRNQl7CeysH7v+WZBpLon0j2I0fPN0p1ISz21/+v6zIMYPf97YzNPiCpftXIDIrkLuwcxer60/4vlq1ZuWaRKJ8I9iNHzpf1fLF+1MiLZBzF6vrT/DeYRKSfvHsTo+dL+N5hHpJy8exCj55sFkUqDT5YxD2L0fLMgUl3cgxg93yyIVBf3IEbP/97YzBOiyletvGaRuEaqki/t/2L5qpVbFombDSHypf1fLF+1MiLZBzF6vrT/DeYRKSfvHsTo+dL+N5hHpJy8exCj55sFkUqDT5YxD2L0fLMgUl3cgxg93yyIVBf3IEbP/97YzBOiyletvGaRuEaqki/t/2L5qpVbFombDSHypf1fLF+1MiLZBzF6vrT/DeYRKSfvHsTo+dL+N5hHpJy8exCj55sFkUqDT5YxD2L0fLMgUl3cgxg93yyIVBf3IEbP/97YzBOiyletvGaRuEaqki/t/2L5qpVbFombDSHypf1fLF+1MiLZBzF6vrT/DeYRKSfvHsTo+dL+N5hHpJy8exCj55sFkUqDT5YxD2L0fLMgUl3cgxg93yyIVBf3IEbP/97YzBOiyletvGaRuEaqki/t/2L5qpVbFombDSHypf1fLF+1MiLZBzF6vrT/DeYRKSfvHsTo+dL+N5hHpJy8exCj55sFkUqDT5YxD2L0fLMgUl3cgxg93yyIVBf3IEbP/97YzBOiyletXLjY43YY9u+LlsiHa6Qq+dL+L5avWjl3scMY3A0jh0VKcLMhRr60/4vlq1YuEukwHM6Xy+dhOC5RApFi5Ev732C+tkib4XzbPg/bJUogUox8af8bzNcWaRiSv8hLIFKMfLPEEenlS6TNEiV4HylGvlmCiLR/Pb4Pb9fN8+Hvdxta77p7EKPnmyWISHfGzc15iRK1cA9i9Pzvjc08Iap81crZiz2djsf9frzlcPirR7yPtPJ8af8Xy1et3PInG7jZECJf2v/F8lUrI5J9EKPnS/vfYB6RcvLuQYyeL+1/g3mXSLyP1HO+WdYg0pCS/W3LVvXPZcyDGD3fLPFEspcowj2I0fPNgkh1cQ9i9Pzvjc08Iap81cprFolrpCr50v4vlq9aOX+xH6/78Qpof/hYqAQ3G0LkS/u/WL5q5dzFnrfJ3YTdIiUQKUa+tP8N5iuKdBg2b6dx6/N9s9CHVhEpRL60/w3mK4q0GU7f26dGf4wCkarkmyWESD/eHWr0Ddl/LWMexOj5ZgkhUo1npEq4BzF6vllCiHS9Rnr/HLeWu0aqhHsQo+d/b2zmCVHlq1bOXuwuuWu3bfMH+7hGqpIv7f9i+aqVC95HOozvI232r7yP1HW+tP+L5atWbvmTDYgUIl/a/wbziJSTdw9i9Hxp/xvMI1JO3j2I0fPNgkilwSfLmAcxer5ZEKku7kGMnm8WRKqLexDt+Qz+rbGZJ0SVr1p5zSJxjfRP+dL+NZuvWrllkbjZUCW/GIikpXmRCl/auEVAJHkekaT5fy1jFgGR5HlEcuTdIjR7jeQGkUqDImqJZH5pWXr8zYJIjVDpGqu4PiJNg0iN4F54dJHc+aqV1yxS3BP5VH33M2Kz+aqVWxYp7oloo76buOcPkaT5Utz13bjPH9dIpUFRvhR3fTfu84dIpUHyTeTdIFJpUIR7EKPn3SBSI7gX7hbBffylIFIjuBfuFiF6vmrlNYsU90Rq6veer1q5ZZHinog26ruJe/4QSZovxV3fjfv8cY1UGhTlS3HXd+M+f4hUGiTfRN4NIpUGRbgHMXreDSI1gnvhbhHcx18KIjWCe+FuEaLnq1Zes0hxT6Smfu/5qpVbFinuiWijvpu45w+RpPlS3PXduM8f10ilQVG+FHd9N+7zh0ilQfJN5N0gUmlQhHsQo+fdIFIjuBfuFsF9/KUgUiO4F+4WIXq+auU1ixT3RGrq956vWrllkeKeiDbqu4l7/hBJmi/FXd+N+/xxjVQaFOVLcdd34z5/iFQaJN9E3g0ilQZFuAcxet4NIjWCe+FuEdzHXwoiNYJ74W4RouerVl6zSHFPpKZ+7/mqlVsWKe6JaKO+m7jnD5Gk+VLc9d24zx/XSKVBUb4Ud30Hhb9t8+f3Kl1LzSQikV8q7waRSoMi3IMYPe8GkRrBvXC3CO7jLwWRGsG9cLcI0Y+/auU1i9T7ILnzpSCSqASD0DeIJCqBSH3D7W9RCUTqjzbeh0KkRk4E+RZApNKgCPcgRs+7QaRGcC/cLYL7+HPQvCJBJCnuhbtFcB+/D0SS4l64WwT38ftYmUjRT2T09fcLIjVF9PX3CyI1RfT19wsiSXFfY0TPx2UFIinfUC3FPYjR83FZgUgt4V64WwT38ftAJCnuhbtFcB+/D0SS4l64WwT38ftApKbo9sDDg0hN0e2BhweRmqLbAw8PIklxX2NEz8cFkaS4BzF6Pi6IJMW9cLcI7uP3gUhS3At3i+A+fh+IJMW9cLcI7uP3gUhN0e2BhweRmqLbAw8PIjVFtwceHkSS4r7GiJ6PCyJJcQ9i9HxcEEmKe+FuEdzH7wORpLgX7hbBffw+EEmKe+FuEdzH7wORmqLbAw8PIjVFtwceHkRqim4PPDyIJMV9jRE9HxdEkuIexOj5uCCSFPfC3SK4j98HIklxL9wtgvv4fSCSFPfC3SK4j98HIjVFtwceHkRqim4PPDyI1BTdHnh4EEmK+xojej4uiCTFPYjR83FBJCnuhbtFcB+/j6oifbzux9+ktz98LFXCjHvhbhHcx++jokjnbfJbKXeLlLDjXrhbBPfx+6go0mHYvJ3Grc/3zXBYokR4uj3w8FQUaTOcvrdPw2aJEuHp9sDDU1GkH79l/O+/crzbeer2wMPDM5IU9zVG9Hxc6l4jvX+OW+u9RspZ+PArdeq3lI9LzZO1S2Zke16khBv3wt0iuI/fR933kQ7j+0ib/SvvI7VZ352PC59sKEb50sxR351fB4gEIACRAAS4ROJ9JFgV7YjU7atrWAO8tAMQgEgAAhAJQAA/2AcggB/sAxDAD/YBCODHKAAE8IN9AAJ4RgIQwA/2AQjgB/sABPCDfQAC+GQDgABEAhCASAACEAlAACIBCEAkAAGNigQQjIwp14sjw/0fHJLvO99usSdxN5J83/l2iz2Ju5Hk+863W+xJ3I0k33e+3WJP4m4k+b7z7RZ7Encjyfedb7fYk7gbSb7vfLvFnsTdSPJ959st9iTuRpLvO99usSdxN5J83/l2iz2Ju5Hk+863W+xJ3I0k33e+3WIAawWRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAG1RDofb79xdn/8+uthOwy7+982++Pj96Mf95tx58swvJzuX3jYDJvD+b+bl8vxa+Gn2xd//ncz5PqvX7B7X2b5S63/x/86f9wmXxBv/dn/f/79O2XmnuR981jmZjzm8+Ovm9txXf98Gb/o5XEY951jJ+6/O337n83r+H0d8Puf75Nshlz//Qtel1j+Yuv/msPbAB8C9j9Z/ymCSNdxebn97vOP/djy6yHvrg353A2H2xKG7fiPl812PIzDrS+HYX/7+mFzupw2w8ePzcvtz8fCN9d/Pe/H75NsRlz/cdidbw+mp1Drf3z7j9sgvpxvR/ESdv37kiXWEWkzfL1keRluz8fDMD5unR9PqYdxek7XP4fxi++PM5dbO265t9vDdLJ5m7lHI97GVp5v7U02Q65/N57jz0UeCBZb/8h5cxvB/ddjfND1H8teDFQR6e2P7J+H27j86PYwvI/dOQ5vyb+PQuyH2zP5+FiRbF6uw/b4yuQBfJnH8nrr/3qxsYu0/pH9cE6/W9D1H4fjpYAqIu3vT6h/OKQ3Ba4PL8P9Ie3zz+EdHo88j6/4sXk5ff91O1xeN+NrinQz5PrTLwiz/hun9En0vMQDQZX174f3l2GT/Xqgikj/OxzXp9bt4eN773Z8it58f+H1oeWQBP/biHTP/utaN9kMuf7t+JD5sYRIy63/xo8npOOwwH3HKuvf3+815D4O1BQpvS3yfrvFsnl/7D1cH3I+rteIX4d33G/GV6z/MIi3i/WX2xcnmyHX/zrsz5fT7n+HpuX1X8abDH8qfW6Krtid6x+Gt9tt9dwXeC6Rrny83u+hXP/ldg34ej2QZIrGq8p/GMTbNcbn7a5mshly/fe7tvtKIonWf/m6ir9z3izwwq7i+m8vTTPnp4pI6WvcdFBO46qv//J5fUbdXV/aJDvHO1mbP0e/+WUQv//4T6fCrf/2xLR5Dbb+y/eOkd0Cj2KXiuu/5M9PFZHekmf/ZOYfG/fDHA88PYrb9v1Wy+efuy6fX/dvHl+Z3HNd8PZrlfXfOS3xjLrc+n+8/fK53S3zuZJK6/+540mqiJS8D3C+H9/x8Zf7XYLbE/H4Ntp4FPf3AcaXO3c3ZZkAAAIwSURBVK9j7v125ZhsXi7fx3v/19sDUroZcv2bx03YJS4yFlt/etf4fZHWV1v/Vyyz/3VEer+++B/fmT6Md6g+huF4XfXH7vse5dswXuuNhze25Dw267d3pr8bcW3X+HGAtx+bIdc/xj62sdZ/G+rHO2HLPITVW//h5tj5kHvXsY5Il4+vz0rdn6QPQ3Kv8XZMn9e/fH4d3ubPvu3k5uVPI17//Ovrjy8It/7H58eWeEJacP3Xf37c/H75qhB0/Y/+576RVEmk62PG/rrQ3evjRfTp5fa3+4Pv4/n461n6Mn5Od/t48h4/svvfzUvSiPfd978mmxHX/3kdxf1Sn/5ebP3JxoIiVVj/+AXb7E83VBMJYM0gEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAABiAQgAJEABCASgABEAhCASAACEAlAACIBCEAkAAGIBCAAkQAEIBKAAEQCEIBIAAIQCUAAIgEIQCQAAYgEIACRAAQgEoAARAIQgEgAAhAJQAAiAQhAJAAB/w9hPff29ZvkWgAAAABJRU5ErkJggg==", "text/plain": [ "plot without title" ] }, "metadata": { "image/png": { "height": 420, "width": 420 } }, "output_type": "display_data" } ], "source": [ "# Get expression matrix\n", "expression_data <- exprs(data)\n", "# Calculate data quantile and remove the NA value\n", "qx <- as.numeric(quantile(expression_data, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))\n", "LogC <- (qx[5] > 100) ||\n", " (qx[6]-qx[1] > 50 && qx[2] > 0)\n", "\n", "# Replace negative values by NA and perform log transformation\n", "if (LogC) {\n", " expression_data[which(expression_data <= 0)] <- NaN #\n", " norm_expression_data <- log2(expression_data+1)\n", "}\n", "# Plot the boxplot of 10 samples\n", "boxplot(norm_expression_data[,1:10],outline=FALSE)" ] }, { "cell_type": "code", "execution_count": 13, "id": "321dff60", "metadata": { "tags": [ "remove-input" ], "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Quiz_Submodule1-2\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", " \r\n", "\r\n", "\r\n", "\r\n", "\r\n", " \r\n", " \r\n", " \r\n", " \r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "
\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "\r\n", "
\r\n", "
\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "
\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Run the following command to take the quiz\n", "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "id": "74400910", "metadata": {}, "source": [ "### Samples and groups selection\n", "In this learning course, we focus on analyzing two groups: *disease* and *control* from the *entorhinal cortex* region of the ```GSE48350``` dataset . Therefore, we need to select the samples\n", "that belong to *entorhinal cortex* region and form a group that has two classes. To select the samples that belong to the *entorhinal cortex*, we can use the following command:" ] }, { "cell_type": "code", "execution_count": 14, "id": "b805a146", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Select sample from Entorhinal Cortex region\n", "idx <- grep(\"entorhinal\", samples$`brain region:ch1`)\n", "samples = samples[idx,]\n", "# Get expression and normalized expression data for samples collected from entorhinal cortex region\n", "expression_data = expression_data[,idx]\n", "norm_expression_data = norm_expression_data[,idx]" ] }, { "cell_type": "markdown", "id": "33963c0d", "metadata": {}, "source": [ "To check how many samples belong to the *entorhinal cortex* region, we can use the following command:" ] }, { "cell_type": "code", "execution_count": 15, "id": "5640f260", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"#Genes: 54675 - #Samples: 54\"\n" ] } ], "source": [ "# Print out new number genes and samples\n", "print(paste0(\"#Genes: \",dim(norm_expression_data)[1],\" - #Samples: \",dim(norm_expression_data)[2]))" ] }, { "cell_type": "markdown", "id": "66b15c60", "metadata": {}, "source": [ "In order to perform Differential Expression analysis in the later submodule, we need to group patients into two groups: “disease” and “control”. From the samples source name, patients diagnosed with Alzheimer’s Disease are annotated with ‘AD’.\n", "Therefore, any samples that contain string ‘AD’ are labelled by ‘d’ (condition/disease) and the remaining samples are assigned to ‘c’ (control/normal).\n", "The code to group the samples is presented below:" ] }, { "cell_type": "code", "execution_count": 16, "id": "9ccaca6f", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Select disease samples\n", "disease_idx <- grep(\"AD\", samples$source_name_ch1)\n", "# Create a vector to store label\n", "groups <- rep(\"X\", nrow(samples))\n", "# Annotate disease samples as \"d\"\n", "groups[disease_idx] <- \"d\"\n", "# Control samples are labeled as \"c\"\n", "groups[which(groups!=\"d\")] <- \"c\"\n", "groups <- factor(groups)" ] }, { "cell_type": "markdown", "id": "a9bfe875", "metadata": {}, "source": [ "To see the number of *control* and *disease*, we can use the following command:" ] }, { "cell_type": "code", "execution_count": 17, "id": "bff2353e", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/plain": [ "groups\n", " c d \n", "39 15 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "table(groups)" ] }, { "cell_type": "markdown", "id": "b0209857-88b4-422e-b64b-fbea32ebf1d1", "metadata": {}, "source": [ "## Exporting the data\n", "When we have successfully retrieved expression data from GEO, we can export the expression data to a `.csv` file format for inspection in other software such as Excel using the `write_csv` function from `readr` package.\n", "In the code below, we will save the raw expression matrix, normalized expression matrix, and grouping information to `.csv` files." ] }, { "cell_type": "code", "execution_count": 18, "id": "0f39641e", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Convert raw and normalized expression matrix to datafames and save them to the csv file\n", "expression_data <- as.data.frame(expression_data)\n", "norm_expression_data <- as.data.frame(norm_expression_data)\n", "# Convert group to a dataframe\n", "groups <- as.data.frame(groups)\n", "# Create a sub-directory data folder to save the expression matrix if it is not available\n", "dir <- getwd()\n", "subDir <- \"/data\"\n", "path <- paste0(dir,subDir)\n", "if (!file.exists(path)){\n", " dir.create(file.path(path))\n", "}\n", "# Save expresion values and group to the csv files format in the local folder\n", "write.csv(expression_data, file=\"./data/raw_GSE48350.csv\")\n", "write.csv(norm_expression_data, file=\"./data/normalized_GSE48350.csv\")\n", "write.csv(groups, file=\"./data/groups_GSE48350.csv\")" ] }, { "cell_type": "markdown", "id": "23a731f6", "metadata": {}, "source": [ "The `.csv` format is a very simple format that might not suitable to store big datasets.\n", "We can export the expression data to `.rds` format, which is more memory efficient for loading and saving the data.\n", "We can save all the relevant data in a `list` and write to the disk using the built in `saveRDS` function." ] }, { "cell_type": "code", "execution_count": 19, "id": "665869cc", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Putting raw data, normalized data, and groups into a list\n", "dat <- list(expression_data = expression_data, norm_expression_data = norm_expression_data, groups = groups)\n", "# Save the data to the local disk using rds format\n", "saveRDS(dat, file=\"./data/GSE48350.rds\")\n" ] }, { "cell_type": "markdown", "id": "9ade9695", "metadata": {}, "source": [ "The user can copy the data from the local disk to the Bucket using the `gsutil cp` command on the Vertex AI Workbench terminal.\n", "To do this, the user has to click on the `+` symbol beside the name of the current tab in the vertex AI workbench interface to open a new Launcher tab.\n", "The next step is to click on `$_Terminal` open a Terminal and then type out the `gsutil cp` command below. The command to copy data to the Google Cloud Bucket is as follows:\n", "\n", "`gsutil cp PATH_TO_LOCAL_FOLDER gs://PATH_IN_GOOGLE_BUCKET`\n", "\n", "The steps to create a Google Bucket storage is shown in project GIT repository `https://github.com/tinnlab/NOSI-Google-Cloud-Training.git`. For example, if we want to copy the file `./data/GSE48350.rds` to our Google Bucket named `cpa-output`, we can use the following command." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "```{admonition} Tip: Saving data to the Google Cloud Bucket\n", "gsutil cp ./data/GSE48350.rds gs://cpa-output\n", "```" ] }, { "cell_type": "markdown", "id": "1ef6c72b", "metadata": {}, "source": [ "## Gene ID mapping\n", "\n", "![](./images/Module1/Gene_ID_Conversion.jpg)\n", "## Learning Objectives:\n", "1. Understand different probe set ID.\n", "2. Map probe IDs into gene identifiers and symbols." ] }, { "cell_type": "markdown", "id": "1e8f26b4", "metadata": {}, "source": [ "## Understanding different probe set ID\n", "\n", "Gene set or pathway analysis requires that gene sets and expression data use the same type of gene ID\n", "(Entrez Gene ID, Gene symbol or probe set ID etc). However, this is frequently not true for the data we have.\n", "For example, our gene sets mostly use Entrez Gene ID, but microarray datasets are frequently labeled by probe\n", "set ID (or RefSeq transcript ID, etc.). Therefore, we need to map the probe set IDs to Entrez gene ID.\n", "Here, we will use `GSE48350` dataset that we have used in the previous section for demonstration of gene ID mapping.\n", "In order to know what kind of probe set ID, we need to navigate to GEO record page of `GSE48350`. Under the `Platform`\n", "tab, we can find the probe ID information.\n", "![probID](./images/Module1/ProbeID.png)" ] }, { "cell_type": "markdown", "id": "3f2e5579", "metadata": {}, "source": [ "From the record page, we know that the dataset was generated from 1 platform using the Affymetrix Human Genome U133 Plus 2.0 Array.\n", "To convert or map the probe set IDs to Entrez gene ID, we need to find the corresponding annotation package from\n", "Bioconductor. For analyzed data, we need to use the `hgu133plus2.db` and `AnnotationDbi` databases.\n", "We can install the package using following R command:" ] }, { "cell_type": "code", "execution_count": 22, "id": "5fb6d5b0", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "suppressMessages({if (!require(\"BiocManager\", quietly = TRUE))\n", " suppressWarnings(install.packages(\"BiocManager\"))\n", " suppressWarnings(BiocManager::install(\"hgu133plus2.db\", update = F))\n", " suppressWarnings(BiocManager::install(\"AnnotationDbi\", update = F))\n", "})" ] }, { "cell_type": "markdown", "id": "8bee6afb", "metadata": {}, "source": [ "In the data processing section, we successfully downloaded the dataset `GSE48350` and saved it to the `data` sub-directory.\n", "Now we can load the dataset and check for the Probe Set ID names by using the following command." ] }, { "cell_type": "code", "execution_count": 23, "id": "bae8d0ed", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "
  1. '1007_s_at'
  2. '1053_at'
  3. '117_at'
  4. '121_at'
  5. '1255_g_at'
  6. '1294_at'
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item '1007\\_s\\_at'\n", "\\item '1053\\_at'\n", "\\item '117\\_at'\n", "\\item '121\\_at'\n", "\\item '1255\\_g\\_at'\n", "\\item '1294\\_at'\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. '1007_s_at'\n", "2. '1053_at'\n", "3. '117_at'\n", "4. '121_at'\n", "5. '1255_g_at'\n", "6. '1294_at'\n", "\n", "\n" ], "text/plain": [ "[1] \"1007_s_at\" \"1053_at\" \"117_at\" \"121_at\" \"1255_g_at\" \"1294_at\" " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data = readRDS(\"./data/GSE48350.rds\")\n", "expression_data = data$expression_data\n", "probeIDs = rownames(expression_data)\n", "head(probeIDs)" ] }, { "cell_type": "markdown", "id": "7d6c37a9", "metadata": {}, "source": [ "## Mapping probe IDs into gene identifiers and symbols.\n", "For pathway analysis, databases like Gene Ontology and KEGG use gene symbol annotation. Therefore, Probe set IDs should be mapped to gene symbol to be used in the later submodules. We can use the pre-annotated databases which are available from Bioconductor to map the probe IDs to the gene symbol. `AnnotationDbi` is an R package that provides an interface for connecting and querying various annotation databases using SQLite data storage while `hgu133plus2.db` is database to perform Probe IDs to gene symbol mapping for human data. We can use the following command to map the Probe set IDs:" ] }, { "cell_type": "code", "execution_count": 24, "id": "e75ae09a", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "suppressMessages({\n", " library(hgu133plus2.db)\n", " library(AnnotationDbi)\n", "})" ] }, { "cell_type": "code", "execution_count": 25, "id": "eb5a9160", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "suppressMessages({\n", "annotLookup <- AnnotationDbi::select(hgu133plus2.db, keys = probeIDs, columns = c('PROBEID','GENENAME','SYMBOL'))\n", "})" ] }, { "cell_type": "code", "execution_count": 26, "id": "ecea1cd0", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A data.frame: 6 × 3
PROBEIDGENENAMESYMBOL
<chr><chr><chr>
11007_s_atdiscoidin domain receptor tyrosine kinase 1 DDR1
21053_at replication factor C subunit 2 RFC2
3117_at heat shock protein family A (Hsp70) member 6HSPA6
4121_at paired box 8 PAX8
51255_g_atguanylate cyclase activator 1A GUCA1A
61294_at ubiquitin like modifier activating enzyme 7 UBA7
\n" ], "text/latex": [ "A data.frame: 6 × 3\n", "\\begin{tabular}{r|lll}\n", " & PROBEID & GENENAME & SYMBOL\\\\\n", " & & & \\\\\n", "\\hline\n", "\t1 & 1007\\_s\\_at & discoidin domain receptor tyrosine kinase 1 & DDR1 \\\\\n", "\t2 & 1053\\_at & replication factor C subunit 2 & RFC2 \\\\\n", "\t3 & 117\\_at & heat shock protein family A (Hsp70) member 6 & HSPA6 \\\\\n", "\t4 & 121\\_at & paired box 8 & PAX8 \\\\\n", "\t5 & 1255\\_g\\_at & guanylate cyclase activator 1A & GUCA1A\\\\\n", "\t6 & 1294\\_at & ubiquitin like modifier activating enzyme 7 & UBA7 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A data.frame: 6 × 3\n", "\n", "| | PROBEID <chr> | GENENAME <chr> | SYMBOL <chr> |\n", "|---|---|---|---|\n", "| 1 | 1007_s_at | discoidin domain receptor tyrosine kinase 1 | DDR1 |\n", "| 2 | 1053_at | replication factor C subunit 2 | RFC2 |\n", "| 3 | 117_at | heat shock protein family A (Hsp70) member 6 | HSPA6 |\n", "| 4 | 121_at | paired box 8 | PAX8 |\n", "| 5 | 1255_g_at | guanylate cyclase activator 1A | GUCA1A |\n", "| 6 | 1294_at | ubiquitin like modifier activating enzyme 7 | UBA7 |\n", "\n" ], "text/plain": [ " PROBEID GENENAME SYMBOL\n", "1 1007_s_at discoidin domain receptor tyrosine kinase 1 DDR1 \n", "2 1053_at replication factor C subunit 2 RFC2 \n", "3 117_at heat shock protein family A (Hsp70) member 6 HSPA6 \n", "4 121_at paired box 8 PAX8 \n", "5 1255_g_at guanylate cyclase activator 1A GUCA1A\n", "6 1294_at ubiquitin like modifier activating enzyme 7 UBA7 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "head(annotLookup)" ] }, { "cell_type": "code", "execution_count": 27, "id": "6aae161e", "metadata": { "tags": [ "remove-input" ], "vscode": { "languageId": "r" } }, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Run the following command to take the quiz\n", "IRdisplay::display_html('')" ] }, { "cell_type": "markdown", "id": "c3f0cef5", "metadata": {}, "source": [ "We can check the number of probe IDs that mapped to the unique gene symbols using the following command:" ] }, { "cell_type": "code", "execution_count": 28, "id": "0379ec35", "metadata": { "vscode": { "languageId": "r" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"There are 54675 probe IDs that mapped to 22243 gene symbols\"\n" ] } ], "source": [ "# Check number of the probe IDs\n", "print(paste0(\"There are \",length(unique(annotLookup$PROBEID)),\" probe IDs that mapped to \",length(unique(annotLookup$SYMBOL)),\" gene symbols\"))" ] }, { "cell_type": "markdown", "id": "6247efff", "metadata": {}, "source": [ "From the lookup table we can spot that a single gene symbol can be mapped to multiple probe IDs. In the next submodule, we will discuss how these mapped gene symbols can be used to identify which genes are differentially expressed. \n" ] } ], "metadata": { "celltoolbar": "Tags", "environment": { "kernel": "ir", "name": "r-cpu.4-1.m93", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/r-cpu.4-1:m93" }, "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.2.1" }, "toc-showcode": true }, "nbformat": 4, "nbformat_minor": 5 }