GEO data processing#

Learning Objectives:#

  1. Understand the type of data that is accessible from GEO.

  2. Demonstrate how to navigate the GEO website, search for dataset using accession number, and select samples.

  3. Use command-line to download data from GEO.

  4. Perform data pre-processing, normalization, loading and saving expression matrix.

Accessing public GEO sequencing data#

The Gene Expression Omnibus (GEO) is a public repository that archives and freely distributes comprehensive sets of microarray, next-generation sequencing, and other forms of high-throughput functional genomic data submitted by the scientific community. In addition to data storage, the website provides set of web-based tools and applications to help users query and download the studies and gene expression patterns stored in GEO.

Searching and accessing data on GEO#

Searching for GEO is relatively straight forward. First, users need to navigate to https://www.ncbi.nlm.nih.gov/geo/. There are multiple ways to search datasets but the simplest way is to provide the accession number in the search box. We will use an example dataset with the accession number GSE48350 for demonstration. The GEO website interface with searching procedure of the example dataset is shown in the figure below:

When the searching process is done, a webpage with detailed record of the example dataset such as published date, title, organism, experiment type, dataset summary etc. will be shown in the figure below:

We can see that the dataset GSE48350 was published in Apr 21, 2014 which focuses on Human Alzheimer’s Disease using microarray sequencing technology. Samples were primary collected from 4 brain regions: hippocampus (HC), entorhinal cortex (EC), superior frontal cortex (SCG), post-central gyrus (PCG) of normal and disease patients.

To display the quiz in all the learning sub-modules, it is necessary to install the IRdisplay package. This package allow quizzes written in html format to show up in the notebook. User can install the IRdisplay using the following command

suppressWarnings(if (!require("IRdisplay")) install.packages("IRdisplay"))
suppressWarnings(library(IRdisplay))
Loading required package: IRdisplay

Downloading data using web-interface#

At the bottom of the dataset page, users will find additional information about the dataset such as sequencing platform, number of samples, project ID and links to download the expression data. The exact places to find this information are shown in the figure below:

The dataset GSE48350 was sequenced using Affymetrix Human Genome U133 Plus 2.0 Array and the whole dataset contains 253 samples. The summary table at the bottom of the page show the all the data generated from the experiment. Users can click to “(http)” hyperlink to download all the samples or click “(custom)” to select and download the samples of interest. Note that, expression data download from this step is raw data and additional data processing needs to be done locally for further analysis.

Accessing GEO data using R command line#

Downloading and pre-processing the data#

If users have a programming background, they can use R’s command line to automate the download. Getting data from GEO is really quite easy using GEOquery R package available in Bioconductor. Before starting, user will need to install the GEOquery package using the following command.

suppressMessages({if (!require("BiocManager", quietly = TRUE))
    suppressWarnings(install.packages("BiocManager"))
    suppressWarnings(BiocManager::install("GEOquery", update = F))
})

We can use the getGEO function from the GEOquery package to download GEO dataset. First, users have to specify the accession ID of the dataset. For this demonstration, we will use the same dataset GSE48350.

suppressMessages(library("GEOquery"))
## change my_id to be the dataset that you want.
accession_ID <- "GSE48350"
suppressMessages({gse <- getGEO(accession_ID,GSEMatrix =TRUE, AnnotGPL=TRUE)})

Some datasets on GEO may be derived from different microarray platforms. Therefore the object gse can be a list of different datasets. You can find out how many platforms were used by checking the length of the gse object.

## check how many platforms used
if (length(gse) > 1) idx <- grep("GPL570", attr(gse, "names")) else idx <- 1
data <- gse[[idx]]
print(paste0("Number of platforms: ",length(data)))
[1] "Number of platforms: 1"

The result shows that we have only one dataset that belongs to microarray platform mentioned GEO dataset page. Next, we can access the samples information and genes using the command below:

# Get the samples information
samples <- pData(data)
# Get the genes information
genes <- fData(data)
# Check the number of samples and genes
print(paste0("The dataset contains ", dim(samples)[1] ," samples and ",dim(genes)[1]," genes"))
[1] "The dataset contains 253 samples and 54675 genes"

The samples contain the metadata of each sample such as title, status, GEO accession, submission data etc. while the genes dataframe show us the probeID, title, gene symbol, gene ID, and many more. We can inspect the detail of the first several rows and columns of samples and genes dataframe using the following command:

samples[1:5,1:4]
genes[1:5,1:5]
A data.frame: 5 × 4
titlegeo_accessionstatussubmission_date
<chr><chr><chr><chr>
GSM300166PostcentralGyrus_female_91yrs_indiv10 GSM300166Public on Oct 09 2008Jun 25 2008
GSM300167SuperiorFrontalGyrus_female_91yrs_indiv10GSM300167Public on Oct 09 2008Jun 25 2008
GSM300168Hippocampus_female_96yrs_indiv105 GSM300168Public on Oct 09 2008Jun 25 2008
GSM300169Hippocampus_male_82yrs_indiv106 GSM300169Public on Oct 09 2008Jun 25 2008
GSM300170Hippocampus_male_84yrs_indiv108 GSM300170Public on Oct 09 2008Jun 25 2008
A data.frame: 5 × 5
IDGene titleGene symbolGene IDUniGene title
<chr><chr><chr><chr><chr>
1007_s_at1007_s_atmicroRNA 4640///discoidin domain receptor tyrosine kinase 1MIR4640///DDR1100616237///780
1053_at1053_at replication factor C subunit 2 RFC2 5982
117_at117_at heat shock protein family A (Hsp70) member 6 HSPA6 3310
121_at121_at paired box 8 PAX8 7849
1255_g_at1255_g_atguanylate cyclase activator 1A GUCA1A 2978

To inspect the expression data of the first few rows and samples, we can use this command

head(exprs(data)[,1:5])
A matrix: 6 × 5 of type dbl
GSM300166GSM300167GSM300168GSM300169GSM300170
1007_s_at0.88801621.43551851.60960151.7549601.820730
1053_at0.66646040.88585211.85907771.0366661.421393
117_at0.85963841.06202983.09739452.2436555.060301
121_at0.97514951.05074480.98228381.1982371.039529
1255_g_at0.49125470.53752543.17964461.5142902.185801
1294_at0.83442680.92918901.51936941.5570891.641263

The summary function can then be used to print the distributions of each sample and the range function can help to check for the expression value range.

summary(exprs(data)[,1:5])
range(exprs(data))
   GSM300166          GSM300167          GSM300168         GSM300169       
 Min.   :  0.0100   Min.   :  0.0100   Min.   : 0.0100   Min.   : 0.01942  
 1st Qu.:  0.6366   1st Qu.:  0.6519   1st Qu.: 0.3342   1st Qu.: 0.32045  
 Median :  0.8622   Median :  0.8828   Median : 0.7756   Median : 0.75063  
 Mean   :  0.9193   Mean   :  0.9492   Mean   : 0.9274   Mean   : 0.87928  
 3rd Qu.:  1.0903   3rd Qu.:  1.0880   3rd Qu.: 1.1929   3rd Qu.: 1.15987  
 Max.   :335.6746   Max.   :531.7116   Max.   :99.4657   Max.   :26.46015  
   GSM300170       
 Min.   : 0.01098  
 1st Qu.: 0.32324  
 Median : 0.77234  
 Mean   : 0.93768  
 3rd Qu.: 1.21978  
 Max.   :83.41583  
  1. 0.01
  2. 1438.7626

From the summary of the data above, we clearly see that the maximum expression values can be in the scale of thousands while the average expression values in each sample are below one. Therefore, pre-processing and normalization are required for quality assurance. One common step is to perform quartile filtering to remove the outlier and missing expression values. Also, we will need to perform a \(log_2\) transformation to ensure the distributions of all samples are similar. Then, a boxplot can also be generated to see if the data have been correctly normalized. We can use the sample code below to perform all of those steps.

# Get expression matrix
expression_data <- exprs(data)
# Calculate data quantile and remove the NA value
qx <- as.numeric(quantile(expression_data, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0)

# Replace negative values by NA and perform log transformation
if (LogC) {
   expression_data[which(expression_data <= 0)] <- NaN #
    norm_expression_data <- log2(expression_data+1)
}
# Plot the boxplot of 10 samples
boxplot(norm_expression_data[,1:10],outline=FALSE)
_images/50d89146e849069e2db822c755cc7da6f23397b4afc94a0698eeebc5c7813d88.png

Samples and groups selection#

In this learning course, we focus on analyzing two groups: disease and control from the entorhinal cortex region of the GSE48350 dataset . Therefore, we need to select the samples that belong to entorhinal cortex region and form a group that has two classes. To select the samples that belong to the entorhinal cortex, we can use the following command:

# Select sample from Entorhinal Cortex region
idx <- grep("entorhinal", samples$`brain region:ch1`)
samples = samples[idx,]
# Get expression and normalized expression data for samples collected from entorhinal cortex region
expression_data = expression_data[,idx]
norm_expression_data = norm_expression_data[,idx]

To check how many samples belong to the entorhinal cortex region, we can use the following command:

# Print out new number genes and samples
print(paste0("#Genes: ",dim(norm_expression_data)[1]," - #Samples: ",dim(norm_expression_data)[2]))
[1] "#Genes: 54675 - #Samples: 54"

In order to perform Differential Expression analysis in the later submodule, we need to group patients into two groups: “disease” and “control”. From the samples source name, patients diagnosed with Alzheimer’s Disease are annotated with ‘AD’. Therefore, any samples that contain string ‘AD’ are labelled by ‘d’ (condition/disease) and the remaining samples are assigned to ‘c’ (control/normal). The code to group the samples is presented below:

# Select disease samples
disease_idx <- grep("AD", samples$source_name_ch1)
# Create a vector to store label
groups <- rep("X", nrow(samples))
# Annotate disease samples as "d"
groups[disease_idx] <- "d"
# Control samples are labeled as "c"
groups[which(groups!="d")] <- "c"
groups <- factor(groups)

To see the number of control and disease, we can use the following command:

table(groups)
groups
 c  d 
39 15 

Exporting the data#

When we have successfully retrieved expression data from GEO, we can export the expression data to a .csv file format for inspection in other software such as Excel using the write_csv function from readr package. In the code below, we will save the raw expression matrix, normalized expression matrix, and grouping information to .csv files.

# Convert raw and normalized expression matrix to datafames and save them to the csv file
expression_data <- as.data.frame(expression_data)
norm_expression_data <- as.data.frame(norm_expression_data)
# Convert group to a dataframe
groups <- as.data.frame(groups)
# Create a sub-directory data folder to save the expression matrix if it is not available
dir <- getwd()
subDir <- "/data"
path <- paste0(dir,subDir)
if (!file.exists(path)){
    dir.create(file.path(path))
}
# Save expresion values and group to the csv files format in the local folder
write.csv(expression_data, file="./data/raw_GSE48350.csv")
write.csv(norm_expression_data, file="./data/normalized_GSE48350.csv")
write.csv(groups, file="./data/groups_GSE48350.csv")

The .csv format is a very simple format that might not suitable to store big datasets. We can export the expression data to .rds format, which is more memory efficient for loading and saving the data. We can save all the relevant data in a list and write to the disk using the built in saveRDS function.

# Putting raw data, normalized data, and groups into a list
dat <- list(expression_data = expression_data, norm_expression_data = norm_expression_data, groups = groups)
# Save the data to the local disk using rds format
saveRDS(dat, file="./data/GSE48350.rds")

The user can copy the data from the local disk to the Bucket using the gsutil cp command on the Vertex AI Workbench terminal. To do this, the user has to click on the + symbol beside the name of the current tab in the vertex AI workbench interface to open a new Launcher tab. The next step is to click on $_Terminal open a Terminal and then type out the gsutil cp command below. The command to copy data to the Google Cloud Bucket is as follows:

gsutil cp PATH_TO_LOCAL_FOLDER gs://PATH_IN_GOOGLE_BUCKET

The steps to create a Google Bucket storage is shown in project GIT repository https://github.com/tinnlab/NOSI-Google-Cloud-Training.git. For example, if we want to copy the file ./data/GSE48350.rds to our Google Bucket named cpa-output, we can use the following command.

Tip: Saving data to the Google Cloud Bucket

gsutil cp ./data/GSE48350.rds gs://cpa-output

Gene ID mapping#

Learning Objectives:#

  1. Understand different probe set ID.

  2. Map probe IDs into gene identifiers and symbols.

Understanding different probe set ID#

Gene set or pathway analysis requires that gene sets and expression data use the same type of gene ID (Entrez Gene ID, Gene symbol or probe set ID etc). However, this is frequently not true for the data we have. For example, our gene sets mostly use Entrez Gene ID, but microarray datasets are frequently labeled by probe set ID (or RefSeq transcript ID, etc.). Therefore, we need to map the probe set IDs to Entrez gene ID. Here, we will use GSE48350 dataset that we have used in the previous section for demonstration of gene ID mapping. In order to know what kind of probe set ID, we need to navigate to GEO record page of GSE48350. Under the Platform tab, we can find the probe ID information. probID

From the record page, we know that the dataset was generated from 1 platform using the Affymetrix Human Genome U133 Plus 2.0 Array. To convert or map the probe set IDs to Entrez gene ID, we need to find the corresponding annotation package from Bioconductor. For analyzed data, we need to use the hgu133plus2.db and AnnotationDbi databases. We can install the package using following R command:

suppressMessages({if (!require("BiocManager", quietly = TRUE))
    suppressWarnings(install.packages("BiocManager"))
    suppressWarnings(BiocManager::install("hgu133plus2.db", update = F))
    suppressWarnings(BiocManager::install("AnnotationDbi", update = F))
})

In the data processing section, we successfully downloaded the dataset GSE48350 and saved it to the data sub-directory. Now we can load the dataset and check for the Probe Set ID names by using the following command.

data = readRDS("./data/GSE48350.rds")
expression_data = data$expression_data
probeIDs = rownames(expression_data)
head(probeIDs)
  1. '1007_s_at'
  2. '1053_at'
  3. '117_at'
  4. '121_at'
  5. '1255_g_at'
  6. '1294_at'

Mapping probe IDs into gene identifiers and symbols.#

For pathway analysis, databases like Gene Ontology and KEGG use gene symbol annotation. Therefore, Probe set IDs should be mapped to gene symbol to be used in the later submodules. We can use the pre-annotated databases which are available from Bioconductor to map the probe IDs to the gene symbol. AnnotationDbi is an R package that provides an interface for connecting and querying various annotation databases using SQLite data storage while hgu133plus2.db is database to perform Probe IDs to gene symbol mapping for human data. We can use the following command to map the Probe set IDs:

suppressMessages({
  library(hgu133plus2.db)
  library(AnnotationDbi)
})
suppressMessages({
annotLookup <- AnnotationDbi::select(hgu133plus2.db, keys = probeIDs, columns = c('PROBEID','GENENAME','SYMBOL'))
})
head(annotLookup)
A data.frame: 6 × 3
PROBEIDGENENAMESYMBOL
<chr><chr><chr>
11007_s_atdiscoidin domain receptor tyrosine kinase 1 DDR1
21053_at replication factor C subunit 2 RFC2
3117_at heat shock protein family A (Hsp70) member 6HSPA6
4121_at paired box 8 PAX8
51255_g_atguanylate cyclase activator 1A GUCA1A
61294_at ubiquitin like modifier activating enzyme 7 UBA7

We can check the number of probe IDs that mapped to the unique gene symbols using the following command:

# Check number of the probe IDs
print(paste0("There are ",length(unique(annotLookup$PROBEID))," probe IDs that mapped to ",length(unique(annotLookup$SYMBOL))," gene symbols"))
[1] "There are 54675 probe IDs that mapped to 22189 gene symbols"

From the lookup table we can spot that a single gene symbol can be mapped to multiple probe IDs. In the next submodule, we will discuss how these mapped gene symbols can be used to identify which genes are differentially expressed.