Differential gene expression analysis#

The next step in the pathway analysis workflow is differential expression (DE) analysis. The goal of DE testing is to determine which genes are expressed at different levels between two or more biological conditions. These genes can offer biological insight into the processes affected by the condition(s) of interest. DE analysis means taking the normalized read count data and performing statistical analysis to discover quantitative changes in expression levels between experimental groups. For example, we use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant: that is, whether it is greater than what would be expected just due to natural random variation. This learning submodule demonstrates a computational workflow for the detection of DE genes from RNA-Seq data.

Learning Objectives:#

  1. Assign samples into groups and set up design matrix.

  2. Perform differential expression (DE) analysis using limma, t-test, edgeR and DEseq R packages.

  3. Filter and export the results table.

  4. Further visualization

Loading the dataset.#

In this section we will perform differential expression (DE) analysis to analyze the GSE48350 dataset that we have used in the previous submodule. Recall from the GEO Data processing section, GSE48350 contains sequenced data of Human Alzheimer’s Disease using microarray sequencing technology collected from 4 brain regions. We have already created two groups: disease and control from the Entorhinal Cortex region. First, we need to load the data that has been saved from submodule 1.

# Loading data from the rds file
data <- readRDS("./data/GSE48350.rds")
expression_data <- data$expression_data
norm_expression_data <- data$norm_expression_data
groups <- data$groups

Now we have all the data needed to perform DE analysis. We will be using 4 different methods to test differential expression: limma, t-tests, edgeR, and DESeq2.

DE analysis and visualization using limma.#

By far, the most-popular package for performing differential expression is limma. The package provides a flexible framework for the analysis of gene expression data, with a focus on detecting differentially expressed genes between two or more conditions.

The package offers a variety of statistical methods and models to handle different experimental designs, including single and multiple groups, paired samples, and time-series data. It also includes various normalization and quality control procedures to preprocess raw data and improve the accuracy and reproducibility of downstream analysis.

The empirical Bayesian (EB) statistical methods in the limma R package refer to a set of techniques used to improve the estimation of gene-specific variances and fold changes in the analysis of gene expression data.

The traditional approach to the analysis of gene expression data is to treat the variances of each gene as independent and estimate them based on the limited number of samples available. However, this can lead to inaccurate estimates due to the small sample size and high variability of gene expression data.

The EB methods in limma package instead borrow information from other genes in the same dataset to provide a more accurate estimate of the gene-specific variances. This is accomplished by first estimating the overall distribution of variances across all genes and then using this distribution to estimate the variances of individual genes.

One of the key advantages of the EB methods in limma is that they can improve the detection of differentially expressed genes, particularly for genes with low expression levels or small sample sizes. The EB methods also provide more stable estimates of variances across multiple experiments, making it easier to compare results across different datasets.

suppressMessages({if (!require("BiocManager", quietly = TRUE))
    suppressWarnings(install.packages("BiocManager"))
  suppressWarnings(BiocManager::install("limma", update = T))
})
suppressPackageStartupMessages({
  library("limma")
})

The first step of DE analysis using limma is to separate the samples in our dataset to the sample groups of interest. A useful function is model.matrix, which will create a design matrix of 0 and 1s; one row for each sample and one column for each sample group. A 1 in a particular row and column indicates that a given sample (the row) belongs to a given group (column).

groups <- factor(groups$groups)
# Create design matrix with no intercept
design <- model.matrix(~0 + groups)
colnames(design) <- levels(groups)
head(design)
A matrix: 6 × 2 of type dbl
cd
110
210
310
410
510
610

The lmFit function is used to fit the model to the data. The result of which is to estimate the expression level of each gene in each of the groups that we specified.

fit <- lmFit(norm_expression_data, design)  # fit linear model

In order to perform the differential analysis, we have to define the contrast that we are interested in. In our case we only have two groups and one contrast of interest. Multiple contrasts can be defined in the makeContrasts function.

# set up contrasts of interest and recalculate model coefficients
cts <- paste("c", "d", sep="-")
cont.matrix <- makeContrasts(contrasts=cts, levels=design)
cont.matrix
A matrix: 2 × 1 of type dbl
c-d
c 1
d-1

Fit the contrast matrix:

fit2 <- contrasts.fit(fit, cont.matrix)

Finally, apply the empirical Bayes’ step to get our differential expression statistics and p-values. This step uses Bayesian statistics to infer the probability distribution using the observed expression levels. It can then calculate the log-fold changes in expression between experimental groups and the p-value:

# compute statistics and table of top significant genes
fit2 <- eBayes(fit2, 0.01)

To see the number of significantly up- and down-regulated genes, we can use the code below to generate the summary table.

dt <- decideTests(fit2,p.value=0.05)
summary(dt)
         c-d
Down     277
NotSig 54015
Up       383

Here, significance is defined using an adjusted p-value cutoff that is set at 5% by default. For the comparison between expression levels in “disease” (d) and “control” (c), 293 genes are found to be down-regulated and 489 genes are up-regulated. We also can extract a table of the top-ranked genes from a linear model fit using topTable function. By default, topTable arranges genes from smallest to largest adjusted p-value with associated gene information, log-FC, average log-CPM, moderated t-statistic, raw and adjusted p-value for each gene. The number of top genes displayed can be specified, where number=Inf includes all genes.

top_genes <- topTable(fit2, adjust="fdr", sort.by="B", number=Inf)
head(top_genes)
A data.frame: 6 × 6
logFCAveExprtP.Valueadj.P.ValB
<dbl><dbl><dbl><dbl><dbl><dbl>
222178_s_at-0.44737800.3653633-73.011064.019112e-552.197450e-5081.05538
224687_at-4.14316312.2572347-48.920365.489424e-461.500671e-4174.18354
207488_at-0.40175300.6243144-39.299334.735684e-418.630785e-3768.83493
239226_at 0.44833021.3245501 28.364997.626558e-341.042455e-2958.84908
234109_x_at-0.22897260.7508256-27.670102.635337e-332.881741e-2958.00438
212833_at-2.59765921.7308944-24.011722.914912e-302.656213e-2652.99567

In order to perform pathway and enrichment analyses, we only focus on the genes that are statistically significant and the probe IDs that have gene symbols. We can select those genes using the following command:

DE_Gene <- top_genes[which(top_genes$adj.P.Val<0.05),]
head(DE_Gene)
A data.frame: 6 × 6
logFCAveExprtP.Valueadj.P.ValB
<dbl><dbl><dbl><dbl><dbl><dbl>
222178_s_at-0.44737800.3653633-73.011064.019112e-552.197450e-5081.05538
224687_at-4.14316312.2572347-48.920365.489424e-461.500671e-4174.18354
207488_at-0.40175300.6243144-39.299334.735684e-418.630785e-3768.83493
239226_at 0.44833021.3245501 28.364997.626558e-341.042455e-2958.84908
234109_x_at-0.22897260.7508256-27.670102.635337e-332.881741e-2958.00438
212833_at-2.59765921.7308944-24.011722.914912e-302.656213e-2652.99567

Now, we can use the following script to save the DE analysis result to use in the later submodules.

# Add one column to store the probe IDs
top_genes <- tibble::rownames_to_column(top_genes,"PROBEID")
saveRDS(top_genes, file="./data/DE_genes.rds")
gsutil cp ./data/DE_genes.rds gs://cpa-output

DE analysis using t-test#

A t-test is a common statistical method used to identify differentially expressed (DE) genes in gene expression data. In the context of gene expression analysis, a t-test compares the mean expression level of a gene between two groups, typically a treatment group and a control group.

The basic idea behind a t-test is to compare the difference in means between the two groups to the variability within each group. A larger difference in means relative to the variability within each group indicates a higher likelihood that the difference is statistically significant.

In the context of gene expression data, a t-test is typically applied to each gene separately. For each gene, the mean expression level in the treatment group is compared to the mean expression level in the control group. The t-test produces a p-value that indicates the probability of observing the observed difference in means by chance.

In order to identify DE genes, a significance threshold (or “alpha value”) is chosen. Any gene with a p-value below this threshold is considered to be differentially expressed. The threshold is typically set at 0.05, meaning that there is a 5% chance of falsely identifying a gene as DE. We can use row_t_equalvar available in the matrixTests package to perform t-test between the two groups for each gene. Below is the code to install the necessary packages:

suppressMessages({
  suppressWarnings(install.packages("matrixTests",quiet= T))
})
# Load matrixTests package
suppressPackageStartupMessages({library("matrixTests")})

First we need to create two different expression matrices belong to two groups.

# Assign expression matrix to the count
count <- expression_data
# Divide the count matrix into two separate matrices X and Y. X is control group and Y is the disease group
X <- count[,groups=="c"]
Y <- count[,groups=="d"]
# Perform t-test
res <- row_t_equalvar(X,Y,alternative = "two.sided", mu = 0, conf.level = 0.95)

The output is a table where rows are genes and columns are statistical information of the test. In DE analysis, we only care for the genes that are significantly different. Therefore, we will rank the genes based on p-value and remove insignificant genes.

# Order the results based on the p-value
res <- res[order(res$pvalue),]
# Remove insignificant genes
res <- res[res$pvalue<0.05,]
# Show the result of top genes
head(res)
A data.frame: 6 × 18
obs.xobs.yobs.totmean.xmean.ymean.diffvar.xvar.yvar.pooledstderrdfstatisticpvalueconf.lowconf.highalternativemean.nullconf.level
<int><int><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><dbl><dbl>
222178_s_at3915540.18192680.6117714-0.42984469.737616e-050.00068207120.00025479410.00484968652-88.633502.040469e-58-0.4395762-0.4201130two.sided00.95
207488_at3915540.42702980.8854517-0.45842199.008167e-040.00238306950.00129988480.01095396652-41.849851.012965e-41-0.4804026-0.4364411two.sided00.95
234109_x_at3915540.61038780.8874354-0.27704766.857523e-040.00121716350.00082882450.00874682152-31.674091.151538e-35-0.2945994-0.2594958two.sided00.95
239226_at3915541.73249481.0013784 0.73111641.243088e-020.00141877320.00946608000.02955997352 24.733322.009417e-30 0.6718000 0.7904328two.sided00.95
216530_at3915540.83614700.5620650 0.27408211.299243e-030.00211993360.00152019780.01184592652 23.137244.856399e-29 0.2503115 0.2978526two.sided00.95
216490_x_at3915541.02438120.4847202 0.53966108.683469e-030.00257261760.00703824000.02548889452 21.172403.173331e-27 0.4885138 0.5908082two.sided00.95

To see how many genes left, we can use the following command

dim(res)
  1. 9286
  2. 18
# Saving the result to local folder
write.csv(res, file="./data/t-Test_Results.csv")
saveRDS(res, file="./data/t-Test_Results.rds")

Tip: Saving data to the Google Cloud Bucket

gsutil cp ./data/t-Test_Results.csv gs://cpa-output

gsutil cp ./data/t-Test_Results.rds gs://cpa-output

DE analysis using edgeR#

edgeR is a popular R package for differential expression (DE) analysis of RNA sequencing (RNA-seq) data. The package uses empirical Bayesian methods to account for biological variability and gene-specific sequencing depth, making it a robust and powerful tool for identifying DE genes.

The basic workflow for DE analysis using edgeR involves several steps:

  1. Data pre-processing: Raw sequencing reads are first pre-processed to filter out low-quality reads and align them to a reference genome. This step typically includes quality control, read trimming, and alignment.

  2. Count matrix generation: A count matrix is generated by counting the number of reads that map to each gene in each sample. This matrix represents the raw gene expression data and is used as input for DE analysis.

  3. Normalization: edgeR applies a normalization method called trimmed mean of M-values (TMM) to adjust for differences in sequencing depth between samples. This method calculates scaling factors for each sample based on the mean and variance of the log-ratios of gene expression levels between pairs of samples.

  4. Dispersion estimation: edgeR uses a negative binomial model to account for biological variability in gene expression data. This model estimates the dispersion of counts within and between samples, which reflects the amount of biological variability in gene expression data.

  5. Differential expression analysis: edgeR uses a statistical framework called generalized linear models (GLMs) to test for differential expression between two or more groups. This method models the relationship between gene expression and experimental factors (e.g., treatment, time, condition) to identify genes that are significantly differentially expressed.

  6. Multiple testing correction: To control for the high false positive rate associated with testing many genes simultaneously, edgeR applies a multiple testing correction method called the false discovery rate (FDR). This method adjusts p-values to control the expected proportion of false positives among all significant tests.

Overall, edgeR is a powerful and widely used tool for DE analysis of RNA-seq data, offering a flexible framework for the detection of DE genes that accounts for biological variability and gene-specific sequencing depth. It also provides several visualization tools to help users explore and interpret the results.

# Install edgeR package from Bioconductor
suppressMessages({ if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
  suppressWarnings(BiocManager::install("edgeR", update = F))
})
# Load edgeR package
suppressPackageStartupMessages({library("edgeR")})
# Assigning expression matrix to count
count <- expression_data
# Perform DE analysis using DGEList function
dge <- DGEList(counts = count, group = factor(groups))
# Calculating the normalize factor
dge <- calcNormFactors(object = dge)
dge <- estimateDisp(y = dge)
# Perform Fisher-exact test
et <- exactTest(object = dge)
# Getting top DE genes
top_degs = topTags(object = et, n = "Inf")
Using classic mode.
head(top_degs$table)
A data.frame: 6 × 4
logFClogCPMPValueFDR
<dbl><dbl><dbl><dbl>
224687_at 4.9049768.0190884.463490e-1582.440413e-153
212833_at 3.3635796.860991 1.652705e-51 4.518082e-47
1558620_at 2.8413106.619830 2.269510e-37 4.136182e-33
211318_s_at-2.7881047.033454 6.724743e-22 9.191883e-18
206552_s_at 3.2363857.042412 1.493408e-15 1.633042e-11
1553191_at 1.1586006.798425 1.489697e-06 1.357487e-02
# Saving the result to local folder
write.csv(et$table, file="./data/edgeR_Results.csv")
saveRDS(et$table, file="./data/edgeR_Results.rds")

Terminal Command to save to Google Cloud Bucket

gsutil cp ./data/edgeR_Results.csv gs://cpa-output


gsutil cp ./data/edgeR_Results.rds gs://cpa-output

DE analysis using DESeq2#

DESeq2 is a popular R package for differential expression (DE) analysis of RNA sequencing (RNA-seq) data. The package uses a negative binomial distribution to model the gene expression counts and applies shrinkage estimation to improve the accuracy of differential expression analysis.

The basic workflow for DE analysis using DESeq2 involves several steps:

  1. Data pre-processing: Raw sequencing reads are first pre-processed to filter out low-quality reads and align them to a reference genome. This step typically includes quality control, read trimming, and alignment.

  2. Count matrix generation: A count matrix is generated by counting the number of reads that map to each gene in each sample. This matrix represents the raw gene expression data and is used as input for DE analysis.

  3. Normalization: DESeq2 applies a normalization method called size factors to adjust for differences in sequencing depth between samples. This method calculates scaling factors for each sample based on the total number of reads in each sample.

  4. Dispersion estimation: DESeq2 estimates the dispersion of counts within and between samples using a negative binomial distribution. This model accounts for biological variability in gene expression data and can handle low counts and overdispersion.

  5. Differential expression analysis: DESeq2 uses a statistical framework called the Wald test to test for differential expression between two or more groups. This method models the relationship between gene expression and experimental factors (e.g., treatment, time, condition) to identify genes that are significantly differentially expressed.

  6. Multiple testing correction: To control for the high false positive rate associated with testing many genes simultaneously, DESeq2 applies a multiple testing correction method called the Benjamini-Hochberg (BH) procedure. This method adjusts p-values to control the expected proportion of false positives among all significant tests.

DESeq2 also includes several visualization tools, such as principal component analysis (PCA) and heatmaps, to help users explore and interpret the results.

Overall, DESeq2 is a powerful and widely used tool for DE analysis of RNA-seq data, offering a flexible framework for the detection of DE genes that accounts for biological variability and gene-specific sequencing depth. It is particularly useful for analyzing low-count genes and handling complex experimental designs.

# Installing and loading the library
suppressMessages({ if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
  suppressWarnings(BiocManager::install("DESeq2", update = F))
})
suppressPackageStartupMessages({library("DESeq2")})
# Contructing a group table that has information of group members for all patients.
coldata <- data.frame(
  sample = colnames(expression_data),
  condition = as.factor(groups),
  row.names = "sample" )
# Checking the group information of few first patients
head(coldata)
A data.frame: 6 × 1
condition
<fct>
GSM300173c
GSM300177c
GSM300181c
GSM300186c
GSM300189c
GSM300192c
# Perform DE analysis using DESeqDataSetFromMatrix function
suppressMessages({
  dds <- DESeqDataSetFromMatrix(countData = round(count), colData = coldata,
                              design = ~ condition)
  dds <- dds[rowSums(counts(dds)) >= 10,]
  dds$condition <- relevel(dds$condition, ref = "c")
  dds <- DESeq(dds)
  resultsNames(dds)
  res <- results(dds)
  res <- as.data.frame(res[order(res$padj),])
})
head(res)
A data.frame: 6 × 6
baseMeanlog2FoldChangelfcSEstatpvaluepadj
<dbl><dbl><dbl><dbl><dbl><dbl>
224687_at11.590865 5.0892890.264716619.2254252.267736e-821.154096e-77
212833_at 4.138384 3.4300130.284203612.0688601.542556e-333.925188e-29
1558620_at 3.162742 2.9792290.289263710.2993527.093785e-251.203390e-20
206552_s_at 4.890227 3.3536130.4815026 6.9648913.286586e-124.181523e-08
211318_s_at 4.936976-2.6811790.3982190-6.7329271.662830e-111.692495e-07
1553191_at 3.876015 1.1765700.2596787 4.5308665.874237e-064.982528e-02
# Saving the result to local storage
write.csv(res, file="./data/DESeq2_Results.csv")
saveRDS(res, file="./data/DESeq2_Results.rds")

{admonition} Tip: Saving data to the Google Cloud Bucket

gsutil cp ./data/DESeq2_Results.csv gs://cpa-output

gsutil cp ./data/DESeq2_Results.rds gs://cpa-output

Visualization of of differential expression results.#

To visually summarize results for all genes, mean-difference plots, which display log-fold changes from the linear model fit against the average log-CPM values can be generated using the plotMD function in limma package, with the differentially expressed genes highlighted. The function plotMD uses fit2 object generated from DE analysis using limma in the above section.

plotMD(fit2, column=1, status=dt[,1], main=colnames(fit2)[1],
       xlim=c(-8,13),pch=20, cex=1)
abline(h=0)
_images/5b9717091480ac75b9a1f77459a8466ceefda0bd61de0f007b37b8de9e4ef400.png

Another way to visualize the number of regulated and unregulated genes is to use Venn diagram. The command to generate the diagram is presented below

# summarize test results as "up", "down" or "not expressed"
dT <- decideTests(fit2, adjust.method="fdr", p.value=0.05)
# Venn diagram of results
vennDiagram(dT, circle.col=palette())
_images/28f06ca2926f44f373840d8d5b7d10e8799f1f57004abc23a6ba5b59092bbaa0.png

We can visualize the adjusted p-value distribution for all gene using a histogram plot:

# Visualize and quality control test results.
# Build histogram of P-values for all genes. Normal test
# assumption is that most genes are not differentially expressed.
tT2 <- topTable(fit2, adjust="fdr", sort.by="B", number=Inf)
hist(tT2$adj.P.Val, col = "grey", border = "white", xlab = "P-adj",
     ylab = "Number of genes", main = "P-adj value distribution")
_images/d4775eba905c4dbd5e8cc64d7121e24a87de9547fbcd828b10b9c721bc139621.png

Q-Q plot allows researchers to visually assess whether the distribution of the moderated t-statistic follows the normal distribution. In a Q-Q plot for moderated t-statistic, the expected quantiles of a normal distribution are plotted against the observed quantiles of the moderated t-statistic. If the distribution of moderated t-statistic follows the normal distribution, the points on the Q-Q plot will fall along a straight line. Deviations from a straight line may indicate departures from normality.

# create Q-Q plot for t-statistic
t.good <- which(!is.na(fit2$F)) # filter out bad probes
qqt(fit2$t[t.good], fit2$df.total[t.good], main="Moderated t statistic")
_images/9df1e059aac21c6d5117ed96dd159764b31cefe508acd58187e0dc0c93120184.png

The Volcano Plot function is a common way of visualising the results of a DE analysis. The x axis shows the log-fold change and the y axis is some measure of statistical significance, which in this case is the log-odds, or “B” statistic. A characteristic “volcano” shape should be seen.

# volcano plot (log P-value vs log fold change)
colnames(fit2) # list contrast names
ct <- 1        # choose contrast of interest
volcanoplot(fit2, coef=ct, main=colnames(fit2)[ct], pch=20,
            highlight=length(which(dT[,ct]!=0)), names=rep('+', nrow(fit2)))
'c-d'
_images/a6c4a24b7def881be0041f112d4955555ba3040e298367de68a855bc55eab225.png

The next submodule will show how to identify the pathways that these genes are involved in.