Enrichment Analysis

Enrichment Analysis#

Enrichment analysis (EA) is a technique used to derive biological insight from lists of significantly altered genes. The list of genes can be obtained from Differential Expression (DE) analysis or users’ interest. The EA methods rely on the knowledge databases (e.g. KEGG and GO) to identify biological pathways or terms that are enriched in a gene list more than would be expected by chance. The outcome of the EA would be the in-depth and contextualized findings to help understand the mechanisms of disease, genes and proteins associated with the etiology of a specific disease or drug target.

Over more than a decade, there are over 50 methods have been developed for EA. In this module, we will focus on pathway analysis using three popular methods including Over Representation Analysis (ORA), Fast Gene Set Enrichment Analysis (FGSEA), and Gene Set Analysis (GSA).

Learning Objectives:#

Data preparation
Perform enrichment analysis using ORA, FGSEA and GSA
Visualize and interpret the outputs

Gene mapping#

In this module, we will use the DE genes with statistic generated in the submodule 02 using limma R package. We can use the following command to load DE analysis result.

DE.df <- readRDS("./data/DE_genes.rds")
rownames(DE.df) <- DE.df$PROBEID
head(DE.df)

A data.frame: 6 × 7
	PROBEID	logFC	AveExpr	t	P.Value	adj.P.Val	B
	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
222178_s_at	222178_s_at	-0.4473780	0.3653633	-73.01106	4.019112e-55	2.197450e-50	81.05538
224687_at	224687_at	-4.1431631	2.2572347	-48.92036	5.489424e-46	1.500671e-41	74.18354
207488_at	207488_at	-0.4017530	0.6243144	-39.29933	4.735684e-41	8.630785e-37	68.83493
239226_at	239226_at	0.4483302	1.3245501	28.36499	7.626558e-34	1.042455e-29	58.84908
234109_x_at	234109_x_at	-0.2289726	0.7508256	-27.67010	2.635337e-33	2.881741e-29	58.00438
212833_at	212833_at	-2.5976592	1.7308944	-24.01172	2.914912e-30	2.656213e-26	52.99567

We can see that the genes are saved with probe IDs, we need to convert them into gene symbols so that they can be analyzed using enrichment analysis method later in this module. We will use the same approach presented in the submodule 01 with hgu133plus2.db and AnnotationDbi databases.

# Load the databases
suppressMessages({
  library(hgu133plus2.db)
  library(AnnotationDbi)
})

Then, we can retrieve vector of probe IDs to perform gene symbol mapping using the following command:

probeIDs = DE.df$PROBEID
suppressMessages({
annotLookup <- AnnotationDbi::select(hgu133plus2.db, keys = probeIDs, columns = c('PROBEID','GENENAME','SYMBOL'))
})
# View the first few genes in the mapping table
head(annotLookup)

A data.frame: 6 × 3
	PROBEID	GENENAME	SYMBOL
	<chr>	<chr>	<chr>
1	222178_s_at	NA	NA
2	224687_at	ankyrin repeat and IBR domain containing 1	ANKIB1
3	207488_at	NA	NA
4	239226_at	NA	NA
5	234109_x_at	one cut homeobox 3	ONECUT3
6	212833_at	solute carrier family 25 member 46	SLC25A46

Now, we can merge DE data frame with annotation table by the PROBEID and remove NA gene symbols using the following command:

DE.df = merge(annotLookup, DE.df, by="PROBEID")

# Remove NA gene symbol
DE.df = DE.df[!is.na(DE.df$SYMBOL),]
# Remove duplicated genes
DE.df = DE.df[!duplicated(DE.df$SYMBOL,fromLast=FALSE),]
head(DE.df)

A data.frame: 6 × 9
	PROBEID	GENENAME	SYMBOL	logFC	AveExpr	t	P.Value	adj.P.Val	B
	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	1007_s_at	discoidin domain receptor tyrosine kinase 1	DDR1	-0.23858465	1.0967772	-2.08257585	0.04210211	0.2531265	-4.284228
2	1053_at	replication factor C subunit 2	RFC2	0.05830367	0.9657758	1.42311356	0.16052793	0.4674726	-5.359907
3	117_at	heat shock protein family A (Hsp70) member 6	HSPA6	-0.01138494	1.1501582	-0.08100885	0.93573821	0.9861536	-6.337785
4	121_at	paired box 8	PAX8	0.02301084	1.0581847	0.87166663	0.38729816	0.7236528	-5.968512
5	1255_g_at	guanylate cyclase activator 1A	GUCA1A	0.36999950	1.4080351	1.78615022	0.07976123	0.3351480	-4.812134
6	1294_at	ubiquitin like modifier activating enzyme 7	UBA7	-0.11221434	1.0859983	-1.61593670	0.11200995	0.3925252	-5.082923

At the result, we obtain the new DE table with two more columns containing gene names and gene symbols. The use of this DE table can be varied based on the selected enrichment analysis tools.

Enrichment analysis using Over-representation Analysis#

Over-representation analysis (ORA) is a statistical method that determines whether genes from pre-defined gene set of a specific GO term or KEGG pathway are presented more than would be expected (over-represented) in a subset of your data. In our learning module, this subset refers to the list of DE gene generated from limma method. For each gene set, an enrichment p-value is calculated using the Binomial distribution, Hypergeometric distribution, the Fisher exact test, or the Chi-square test. Hypergeometric distribution is a popular approach used to calculate enrichment p-value. The formula can be presented as follows:

\[ P(X\geq x) = 1 - P(X \leq x-1) = 1 - \sum\limits_{i=0}^{x-1}\frac{\hphantom{}{M \choose i }{N - M \choose n-i}}{N \choose n} \]

where N is the number of background genes (all genes presented in the expression matrix), n is the number of “interesting” genes (DE genes), M is the number of genes that are annotated to a particular gene set S (list of genes in a specific KEGG pathway or GO term), and x is the number of “interesting” genes that are annotated to S (genes presented in DE genes list and a specific KEGG pathway or GO term).

For example, suppose we have an expression matrix with 20000 genes, of which 500 are differently expressed. Also suppose that 100 of the 20000 genes are annotated to a particular gene set S. Of these 100 genes, 20 are members of DE genes list. The probability that 20 or more (up to 100) genes annotated to S are in DE genes list by chance is given by

\[ P(X\geq 20) = 1 - P(X \leq 19) = 1-\sum \limits_{i=0}^{19}\frac{\hphantom{}{100 \choose i}{20000 - 100 \choose 500-i}}{20000 \choose 500} = 5.26 \times 10^{-13} \]

The p-value indicates it is very rare to observe 20 of the 100 genes from this set are in the DE genes list by chance.

Data preparation#

To conduct enrichment analysis using ORA, there are several input data that we need to prepare. First, we need to select a set of genes that are significantly altered (p-value < 0.05) in the DE genes generated from limma method.

# Selecting a list of significant DE genes
DEGenes <- DE.df[DE.df$adj.P.Val <= 0.05,]
# Select genes with symbol
DEGenes <- DEGenes$SYMBOL

Next, we need to define a list of background genes. In this analysis, they are all the genes generated from the DE analysis.

#Defining background genes
backgroundSet <- DE.df$SYMBOL

Then, we need to obtain a list of geneset from knowledge databases such as GO and KEGG. In this learning module, the geneset will be retrieved from the .gmt files that were processed from the submodule 03. To load the geneset, we will use gmt2geneset below:

# install tidyverse
suppressWarnings(if (!require("tidyverse")) install.packages("tidyverse"))

Loading required package: tidyverse

Installing package into ‘/home/runner/.rpackages’
(as ‘lib’ is unspecified)

also installing the dependencies ‘sass’, ‘rappdirs’, ‘rematch’, ‘highr’, ‘yaml’, ‘xfun’, ‘bslib’, ‘fontawesome’, ‘jquerylib’, ‘tinytex’, ‘backports’, ‘gargle’, ‘cellranger’, ‘ids’, ‘timechange’, ‘systemfonts’, ‘textshaping’, ‘knitr’, ‘rmarkdown’, ‘selectr’, ‘broom’, ‘conflicted’, ‘dbplyr’, ‘dtplyr’, ‘forcats’, ‘googledrive’, ‘googlesheets4’, ‘haven’, ‘jsonlite’, ‘lubridate’, ‘modelr’, ‘ragg’, ‘readxl’, ‘reprex’, ‘rstudioapi’, ‘rvest’

# Loading tidyverse library that provides a function to read the .gmt file
suppressMessages({
  library(tidyverse)
})

Error in library(tidyverse): there is no package called ‘tidyverse’
Traceback:

1. suppressMessages({
 .     library(tidyverse)
 . })
2. withCallingHandlers(expr, message = function(c) if (inherits(c, 
 .     classes)) tryInvokeRestart("muffleMessage"))
3. library(tidyverse)   # at line 3 of file <text>

Here, we also write a function to perform over representation analysis based on the hyper-geometric testing formula presented above. The ORA method will perform hyper-geometric testing for each geneset obtained from GO or KEGG using the function phyper available for stats R base package. The output of the ORA function is the table that contains a column of terms or pathway names and a column of p-value.

# A customized function to prase a gmt file to a list of genesets
gmt2geneset <- function(path){
  genesets <- read_tsv(path, col_names = F) %>% apply(MARGIN = 1, function(r){
    genes = unique(r[-(1:2)])
    list(
      id = r[1],
      description = r[2],
      genes = genes[!is.na(genes)]
    )
  })

  gs <- lapply(genesets, function(g) g$genes %>% as.character())
  names(gs) <- lapply(genesets, function(g) g$description)
  gs
}

ORA <- function(geneset,DEGenes,backgroundSet,DE.df){
    res <- sapply(geneset, function(gs){
    wBallDraw <- intersect(gs, DEGenes) %>%  length() - 1
    if (wBallDraw < 0) return(1)
    wBall <- length(DEGenes)
    bBall <- nrow(DE.df) - length(DEGenes)
    ballDraw <- length(intersect(gs, backgroundSet))
    1 - phyper(wBallDraw, wBall, bBall, ballDraw)
  })
  res
}

Enrichment analysis using ORA and GO terms#

In this section, we will perform ORA using genesets obtained from the GO database and the function ORA defined above. The detailed code is presented below:

# Loading geneset from GO database
suppressWarnings({suppressMessages({geneset <- gmt2geneset("./data/GO_terms.gmt")})})
# Perform ORA
res.ORA <- ORA(geneset,DEGenes,backgroundSet,DE.df)
# Save the result to a table where the first column is GO term name and the second column is the p-value
res.df <- data.frame(
  GOterms = names(res.ORA),
  pvalue = res.ORA
)
# Order the table based on the p-value
res.df <- res.df[order(res.df$pvalue),]
# View the most significant GO terms
head(res.df)

A data.frame: 6 × 2
	GOterms	pvalue
	<chr>	<dbl>
positive regulation of B cell activation	positive regulation of B cell activation	0.0005237933
inner dynein arm assembly	inner dynein arm assembly	0.0010578236
cell activation involved in immune response	cell activation involved in immune response	0.0062357297
leukocyte activation involved in immune response	leukocyte activation involved in immune response	0.0062357297
immunoglobulin production involved in immunoglobulin-mediated immune response	immunoglobulin production involved in immunoglobulin-mediated immune response	0.0062357297
positive regulation of leukocyte activation	positive regulation of leukocyte activation	0.0062357297

Fromt the result table we might conclude that the cause of the disease may be related to positive regulation of B cell activation.

Enrichment analysis using ORA and KEGG pathways#

# Loading the KEGG pathways
suppressWarnings({suppressMessages({geneset <- gmt2geneset("./data/KEGG_pathways.gmt")})})

# Perform Hyper-geometric testing
res.ORA <- ORA(geneset,DEGenes,backgroundSet,DE.df)
# Save the results to a data frame
res.df <- data.frame(
  pathways = names(res.ORA ),
  pvalue = res.ORA
)

res.df <- res.df[order(res.df$pvalue),]
head(res.df)

A data.frame: 6 × 2
	pathways	pvalue
	<chr>	<dbl>
Purine metabolism - Homo sapiens (human)	Purine metabolism - Homo sapiens (human)	0.03829845
AMPK signaling pathway - Homo sapiens (human)	AMPK signaling pathway - Homo sapiens (human)	0.03979454
Complement and coagulation cascades - Homo sapiens (human)	Complement and coagulation cascades - Homo sapiens (human)	0.03979454
Primary immunodeficiency - Homo sapiens (human)	Primary immunodeficiency - Homo sapiens (human)	0.06663166
Proteasome - Homo sapiens (human)	Proteasome - Homo sapiens (human)	0.09342446
Virion - Herpesvirus - Homo sapiens (human)	Virion - Herpesvirus - Homo sapiens (human)	0.09822352

Enrichment Analysis using FGSEA#

This submodule describes FGSEA, one of the methods for evaluating pathway enrichment in transcriptional data and it stands for Fast preranked Gene Set Enrichment Analysis (GSEA). FGSEA quickly calculates arbitrarily low GSEA P-values for a collection of gene sets and is based on an algorithm that allows it to make more permutations and get accurate p-values. It extends the GSEA algorithm which calculates the enrichment score (the degree to which a set S is over-represented at the top or bottom of the ranked list L), estimates the p-value of the Enrichment Score using a permutation test and adjusts the estimated significance level to account for multiple hypotheses in addition to calculating the q-values for FDR control. Typically, GSEA requires inputs of a list of gene sets (GO term or pathway with a set of genes), a vector DE genes with statistic. In addition, user can limit the number of terms or pathways of interest by adjusting minSize and maxSize parameters. The code to install and load FGSEA R package is shown below:

# To perform enrichment analysis using FGSEA, we first install the fgsea package
suppressMessages({if (!require("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
  suppressWarnings(BiocManager::install("fgsea", update = F))
})

# Loading the package
suppressPackageStartupMessages({
  library("fgsea")
})

Warning message:
"package 'BiocManager' was built under R version 4.2.2"

Next, we need to prepare a vector of DE genes with its statistic.

# Get the gene list and their statistic from DE results
stats = DE.df$adj.P.Val
names(stats) = DE.df$SYMBOL
head(stats)

DDR1: 0.25312654919025
RFC2: 0.467472580609517
HSPA6: 0.986153555292704
PAX8: 0.723652752171903
GUCA1A: 0.335147977832318
UBA7: 0.392525231374342

Enrichment analysis using FGSEA and GO terms#

The package fgsea has its own function named gmtPathways to load the genesets from the gmt file. It is recommended to use this function for compatibility. We can load GO terms geneset and view the first five GO terms with associated genes using the following command:

# Load the pathways into a named list
GO_term_hallmark <- gmtPathways("./data/GO_terms.gmt")
# Show the first few GO terms, and within those, show only the first few genes.
tmp = lapply(GO_term_hallmark,head)
tmp[1:5]

$`GO:0000002`

'AKT3'
'DNA2'
'DNAJA3'
'ENDOG'
'FLCN'
'LIG3'

$`GO:0000003`

'A1CF'
'A2M'
'AAAS'
'ABAT'
'ABCC8'
'ABHD2'

$`GO:0000012`

'APLF'
'APTX'
'ERCC6'
'ERCC8'
'LIG4'
'PARP1'

$`GO:0000018`

'ACTB'
'ACTL6A'
'ACTR2'
'ALYREF'
'ANKLE1'
'APLF'

$`GO:0000027`

'BOP1'
'BRIX1'
'DDX28'
'DHX30'
'FASTKD2'
'MDN1'

Running the FGSEA can be done by calling the fgsea with two required inputs.

# Set seed for reproducibility
set.seed(1)
# Running fgsea
suppressWarnings(fgseaRes <- fgsea(pathways = GO_term_hallmark,
                  stats    = stats
                  ))

The output of the FGSEA method is a table where row are GO terms IDs or pathway IDs. The columns are p-values, adjusted p-values, Enrichment Score, etc.

head(fgseaRes[order(pval), ][,-8])

A data.table: 6 × 7
pathway	pval	padj	log2err	ES	NES	size
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
GO:0050907	1.160144e-05	0.07960911	0.5933255	0.3786406	1.577494	132
GO:0050911	5.922388e-05	0.15094599	0.5573322	0.3926074	1.597791	94
GO:0009593	6.599213e-05	0.15094599	0.5384341	0.3526733	1.488034	167
GO:0030850	1.133702e-04	0.19448654	0.5384341	0.4585205	1.746790	51
GO:0050906	1.643773e-04	0.21558062	0.5188481	0.3274274	1.399464	202
GO:0006959	2.051876e-04	0.21558062	0.5188481	0.3160059	1.370283	271

From the result table, we can select top five up regulated GO terms and top five down regulated GO terms. Then we can plot them using the built-in function plotGseaTable

topGOUp <- fgseaRes[ES > 0][head(order(pval), n=5), pathway]
topGODown <- fgseaRes[ES < 0][head(order(pval), n=5), pathway]
topGO <- c(topGOUp, rev(topGODown))
plotGseaTable(GO_term_hallmark[topGO], stats, fgseaRes,
              gseaParam=0.5)

_images/fd5a8ee8efc49c3617c8240b58d3963edc33b2cbb7ff17f041e161af8dc379ab.png

Enrichment analysis using FGSEA and KEGG pathways#

We can perform enrichment analysis using FGSEA with KEGG pathway using the same procedure mentioned above. The only thing we need to change is the list of geneset that available in KEGG database.

# Load the pathways into a named list
KEGG_hallmark <- gmtPathways("./data/KEGG_pathways.gmt")
# Show the first few GO terms, and within those, show only the first few genes.
tmp = lapply(KEGG_hallmark,head)
tmp[1:5]

$hsa00010

'HK3'
'HK1'
'HK2'
'HKDC1'
'GCK'
'GPI'

$hsa00020

'CS'
'ACLY'
'ACO2'
'ACO1'
'IDH1'
'IDH2'

$hsa00030

'GPI'
'G6PD'
'PGLS'
'H6PD'
'PGD'
'RPE'

$hsa00040

'GUSB'
'KL'
'UGT2A1'
'UGT2A3'
'UGT2B17'
'UGT2B11'

$hsa00051

'MPI'
'PMM2'
'PMM1'
'GMPPB'
'GMPPA'
'GMDS'

# Set seed for reproducibility
set.seed(1)
# Running fgsea
suppressWarnings(fgseaRes <- fgsea(pathways = KEGG_hallmark,
                  stats    = stats))

head(fgseaRes[order(pval), ][,-8])

A data.table: 6 × 7
pathway	pval	padj	log2err	ES	NES	size
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
hsa04060	2.356019e-06	0.0007916224	0.6272567	0.3384951	1.455766	291
hsa04740	2.349394e-03	0.3211443869	0.4317077	0.3405858	1.396453	132
hsa04061	2.867361e-03	0.3211443869	0.4317077	0.3525643	1.408411	98
hsa00500	6.600792e-03	0.5544665187	0.4070179	0.4181609	1.499026	36
hsa04080	9.789528e-03	0.6578563020	0.3807304	0.2775366	1.206519	358
hsa04350	1.607110e-02	0.8999817543	0.3524879	0.3290370	1.312952	97

topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=5), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=5), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
#Viewing the 5 most significantly up-regulated and down-regulated pathways each with the FGSEA internal plot function
plotGseaTable(KEGG_hallmark[topPathways], stats, fgseaRes,
              gseaParam=0.5)

_images/63113f43d853dd23b29255b4501b0f6eeef20f2299bb5d8c7d2ca8967ab8b3e0.png

Gene Set Enrichment Analysis using GSA#

Gene Set Analysis (GSA), an Enrichment Analysis, is a method that is commonly used to summarize high-dimensional gene expression data sets into sets according to its biological relevance. GSA takes the ranked gene lists from the initial stage of a gene expression analysis and aggregates the genes into sets based on shared biological or functional properties as specified by a reference knowledge base. Such databases often contain phenotype associations, molecular interactions and regulation and are referenced in the analysis of the resultant gene sets to find the relevance of the gene properties to the phenotype of interest.

Quiz_Submodule4-4

Data preparation#

The GSA method is freely available as standalone package in CRAN repository. We can use the following code to install the package.

# Install GSA from CRAN
suppressMessages({if (!require("GSA"))
        suppressWarnings(install.packages("GSA"))
})
suppressMessages({
  library(GSA)
})

The GSA method require an expression matrix, a numeric vector containing class of each sample and a vector of genes the inputs. We can easily get those inputs by load the data that we processed in the Module 01 .

# Loading expression data with groups
data <- readRDS("./data/GSE48350.rds")
expression_data <- data$expression_data
norm_expression_data <- data$norm_expression_data
groups <- data$groups

We can also use the sample approach available in the Module 01 to map the probe IDs to gene symbols. The step-by-step coding instruction is shown below:

# Get the probe IDs
expression_data$PROBEID <- rownames(expression_data)
probeIDs <- rownames(expression_data)
# Perform gene mapping
suppressMessages({
  annotLookup <- AnnotationDbi::select(hgu133plus2.db, keys = probeIDs, columns = c('PROBEID', 'GENENAME', 'SYMBOL'))
})
# Merge DE result  data frame with annotation table
new_expression_data = merge(annotLookup, expression_data, by="PROBEID")
# Remove NA value
new_expression_data <- new_expression_data[!is.na(new_expression_data$SYMBOL),]
# Remove duplicated genes symbol
new_expression_data <-  new_expression_data[!duplicated(new_expression_data$SYMBOL,fromLast=FALSE),]
rownames(new_expression_data) <- new_expression_data$SYMBOL
# Drop PROBEID, GENENAME, and SYMBOL columns
new_expression_data <- new_expression_data[,-c(1:3)]
genenames= rownames(new_expression_data)

GSA Enrichment analysis using GO terms#

Using data obtaiend from the previous step, we can run the GSA method by callinf the function GSA. We can reuse GO_term_hallmark and KEGG_hallmark loaded in FGSEA to perform analysis. The code details are shown below:

# Getting
genesets = GO_term_hallmark
GSA.obj<-GSA(as.matrix(new_expression_data),as.numeric(groups$groups), genenames=genenames, genesets=genesets, resp.type="Two class unpaired",nperms=100,random.seed=1)

perm= 10 / 200 
perm= 20 / 200 
perm= 30 / 200 
perm= 40 / 200 
perm= 50 / 200 
perm= 60 / 200 
perm= 70 / 200 
perm= 80 / 200 
perm= 90 / 200 
perm= 100 / 200 
perm= 110 / 200 
perm= 120 / 200 
perm= 130 / 200 
perm= 140 / 200 
perm= 150 / 200 
perm= 160 / 200 
perm= 170 / 200 
perm= 180 / 200 
perm= 190 / 200 
perm= 200 / 200 

# List the results from a GSA analysis
res <- GSA.listsets(GSA.obj, geneset.names=names(genesets),FDRcut=.5)

A table of the negative gene sets. “Negative” means that lower expression of most genes in the gene set correlates with higher values of the phenotype y. Eg for two classes coded 1,2, lower expression correlates with class 2.

Please note that the output generated from GSA would be varied depending on R environment and software version. Especially, when the users run the scripts using User-managed Notebooks instances which have a preinstalled suite of packages.

neg.table <-res$negative
head(neg.table)

A matrix: 6 × 5 of type chr
Gene_set	Gene_set_name	Score
9	GO:0000045	-0.3745
55	GO:0000422	-0.5436
620	GO:0006091	-0.4743
631	GO:0006120	-1.1936
729	GO:0006457	-0.4546
785	GO:0006605	-0.4504

A table of the positive gene sets. “Positive” means that higher expression of most genes in the gene set correlates with higher values of the phenotype y.

pos.table <-res$positive
head(pos.table)

A matrix: 6 × 5 of type chr
Gene_set	Gene_set_name	Score
1115	GO:0007422	0.361
1215	GO:0008360	0.3628
1688	GO:0014037	0.6887
1689	GO:0014044	0.95
1885	GO:0017145	0.5955
2100	GO:0022011	1.0103

# Individual gene scores from a gene set analysis
# look at 10th gene set
GSA.genescores(10, genesets, GSA.obj, genenames)

A matrix: 12 × 2 of type chr
Gene	Score
CEBPA	1.553
NAGS	0.686
OTC	0.396
ORC1	0.2
CPS1	-0.097
ARG1	-0.632
SLC25A2	-0.782
ORC2	-0.872
ASL	-0.893
ASS1	-1.177
SLC25A15	-1.925
ARG2	-2.268

# Plot the result, this function makes a plot of the significant gene sets, based on a call to the GSA (Gene set analysis) function.
suppressWarnings(GSA.plot(GSA.obj, fac=1, FDRcut = 0.5))

_images/50242688bb14fbc3715f7f775c508ae52b37fc4cb876d317d48b0699fc15cb2d.png

GSA Enrichment analysis using KEGG pathways#

We can use the same procedure to per enrichment analyis with KEGG pathway. All the codes are similar but genesets are assigned from KEGG_hallmark. The code is shown below.

genesets = KEGG_hallmark
GSA.obj<-GSA(as.matrix(new_expression_data),as.numeric(groups$groups), genenames=genenames, genesets=genesets, resp.type="Two class unpaired", nperms=100, random.seed=1)

perm= 10 / 100 
perm= 20 / 100 
perm= 30 / 100 
perm= 40 / 100 
perm= 50 / 100 
perm= 60 / 100 
perm= 70 / 100 
perm= 80 / 100 
perm= 90 / 100 
perm= 100 / 100 

# List the results from a GSA analysis
res <- GSA.listsets(GSA.obj, geneset.names=names(genesets),FDRcut=.5)

A table of the negative gene sets. “Negative” means that lower expression of most genes in the gene set correlates with higher values of the phenotype y. Eg for two classes coded 1,2, lower expression correlates with class 2.

neg.table <-res$negative
head(neg.table)

A matrix: 6 × 5 of type chr
Gene_set	Gene_set_name	Score	p-value	FDR
218	hsa04260	-0.6738	0	0
328	hsa05022	-0.5187	0	0
256	hsa04714	-0.5235	0.01	0.4543
323	hsa05012	-0.738	0.01	0.4543
325	hsa05016	-0.7177	0.01	0.4543
327	hsa05020	-0.6895	0.01	0.4543

A table of the positive gene sets. “Positive” means that higher expression of most genes in the gene set correlates with higher values of the phenotype y. See “negative” above for more info.

pos.table <-res$positive
head(pos.table)

A matrix: 2 × 5 of type chr
Gene_set	Gene_set_name	Score	p-value	FDR
169	hsa04520	0.5032	0	0
173	hsa04810	0.2173	0	0

# Individual gene scores from a gene set analysis
# look at 10th gene set
GSA.genescores(10, genesets, GSA.obj, genenames)

A matrix: 30 × 2 of type chr
Gene	Score
OGDH	1.545
IDH2	1.427
SUCLG2	0.717
PCK1	-0.022
PCK2	-0.16
PC	-0.213
DLST	-0.235
PDHA2	-0.239
ACO1	-0.6
SDHC	-0.676
IDH1	-0.851
IDH3B	-1.241
PDHB	-1.449
DLD	-1.453
IDH3G	-1.621
SDHB	-1.751
SDHD	-1.842
DLAT	-1.904
MDH1	-2.045
SDHA	-2.068
ACLY	-2.094
SUCLG1	-2.193
IDH3A	-2.202
CS	-2.314
OGDHL	-2.378
FH	-2.453
MDH2	-2.547
SUCLA2	-2.584
PDHA1	-3.078
ACO2	-3.559

# Plot the result, this function makes a plot of the significant gene sets, based on a call to the GSA (Gene set analysis) function.
suppressWarnings(GSA.plot(GSA.obj, fac=1, FDRcut = 0.5))

_images/18efb78629812fcc13b83b9224d7972fd88bcb0ade8e385321e4d4d752c77513.png

Quiz_Submodule4

Enrichment Analysis

Contents

Enrichment Analysis#

Learning Objectives:#

Gene mapping#

Enrichment analysis using Over-representation Analysis#

Data preparation#

Enrichment analysis using ORA and GO terms#

Enrichment analysis using ORA and KEGG pathways#

Enrichment Analysis using FGSEA#

Enrichment analysis using FGSEA and GO terms#

Enrichment analysis using FGSEA and KEGG pathways#

Gene Set Enrichment Analysis using GSA#

Data preparation#

GSA Enrichment analysis using GO terms#

GSA Enrichment analysis using KEGG pathways#