Enrichment Analysis#

Enrichment analysis (EA) is a technique used to derive biological insight from lists of significantly altered genes. The list of genes can be obtained from Differential Expression (DE) analysis or users’ interest. The EA methods rely on the knowledge databases (e.g. KEGG and GO) to identify biological pathways or terms that are enriched in a gene list more than would be expected by chance. The outcome of the EA would be the in-depth and contextualized findings to help understand the mechanisms of disease, genes and proteins associated with the etiology of a specific disease or drug target.

Over more than a decade, there are over 50 methods have been developed for EA. In this module, we will focus on pathway analysis using three popular methods including Over Representation Analysis (ORA), Fast Gene Set Enrichment Analysis (FGSEA), and Gene Set Analysis (GSA).

Learning Objectives:#

  1. Data preparation

  2. Perform enrichment analysis using ORA, FGSEA and GSA

  3. Visualize and interpret the outputs

Gene mapping#

In this module, we will use the DE genes with statistic generated in the submodule 02 using limma R package. We can use the following command to load DE analysis result.

DE.df <- readRDS("./data/DE_genes.rds")
rownames(DE.df) <- DE.df$PROBEID
head(DE.df)
A data.frame: 6 × 7
PROBEIDlogFCAveExprtP.Valueadj.P.ValB
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
222178_s_at222178_s_at-0.44737800.3653633-73.011064.019112e-552.197450e-5081.05538
224687_at224687_at -4.14316312.2572347-48.920365.489424e-461.500671e-4174.18354
207488_at207488_at -0.40175300.6243144-39.299334.735684e-418.630785e-3768.83493
239226_at239226_at 0.44833021.3245501 28.364997.626558e-341.042455e-2958.84908
234109_x_at234109_x_at-0.22897260.7508256-27.670102.635337e-332.881741e-2958.00438
212833_at212833_at -2.59765921.7308944-24.011722.914912e-302.656213e-2652.99567

We can see that the genes are saved with probe IDs, we need to convert them into gene symbols so that they can be analyzed using enrichment analysis method later in this module. We will use the same approach presented in the submodule 01 with hgu133plus2.db and AnnotationDbi databases.

# Load the databases
suppressMessages({
  library(hgu133plus2.db)
  library(AnnotationDbi)
})

Then, we can retrieve vector of probe IDs to perform gene symbol mapping using the following command:

probeIDs = DE.df$PROBEID
suppressMessages({
annotLookup <- AnnotationDbi::select(hgu133plus2.db, keys = probeIDs, columns = c('PROBEID','GENENAME','SYMBOL'))
})
# View the first few genes in the mapping table
head(annotLookup)
A data.frame: 6 × 3
PROBEIDGENENAMESYMBOL
<chr><chr><chr>
1222178_s_atNA NA
2224687_at ankyrin repeat and IBR domain containing 1ANKIB1
3207488_at NA NA
4239226_at NA NA
5234109_x_atone cut homeobox 3 ONECUT3
6212833_at solute carrier family 25 member 46 SLC25A46

Now, we can merge DE data frame with annotation table by the PROBEID and remove NA gene symbols using the following command:

DE.df = merge(annotLookup, DE.df, by="PROBEID")
# Remove NA gene symbol
DE.df = DE.df[!is.na(DE.df$SYMBOL),]
# Remove duplicated genes
DE.df = DE.df[!duplicated(DE.df$SYMBOL,fromLast=FALSE),]
head(DE.df)
A data.frame: 6 × 9
PROBEIDGENENAMESYMBOLlogFCAveExprtP.Valueadj.P.ValB
<chr><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl>
11007_s_atdiscoidin domain receptor tyrosine kinase 1 DDR1 -0.238584651.0967772-2.082575850.042102110.2531265-4.284228
21053_at replication factor C subunit 2 RFC2 0.058303670.9657758 1.423113560.160527930.4674726-5.359907
3117_at heat shock protein family A (Hsp70) member 6HSPA6 -0.011384941.1501582-0.081008850.935738210.9861536-6.337785
4121_at paired box 8 PAX8 0.023010841.0581847 0.871666630.387298160.7236528-5.968512
51255_g_atguanylate cyclase activator 1A GUCA1A 0.369999501.4080351 1.786150220.079761230.3351480-4.812134
61294_at ubiquitin like modifier activating enzyme 7 UBA7 -0.112214341.0859983-1.615936700.112009950.3925252-5.082923

At the result, we obtain the new DE table with two more columns containing gene names and gene symbols. The use of this DE table can be varied based on the selected enrichment analysis tools.

Enrichment analysis using Over-representation Analysis#

Over-representation analysis (ORA) is a statistical method that determines whether genes from pre-defined gene set of a specific GO term or KEGG pathway are presented more than would be expected (over-represented) in a subset of your data. In our learning module, this subset refers to the list of DE gene generated from limma method. For each gene set, an enrichment p-value is calculated using the Binomial distribution, Hypergeometric distribution, the Fisher exact test, or the Chi-square test. Hypergeometric distribution is a popular approach used to calculate enrichment p-value. The formula can be presented as follows:

\[ P(X\geq x) = 1 - P(X \leq x-1) = 1 - \sum\limits_{i=0}^{x-1}\frac{\hphantom{}{M \choose i }{N - M \choose n-i}}{N \choose n} \]

where N is the number of background genes (all genes presented in the expression matrix), n is the number of “interesting” genes (DE genes), M is the number of genes that are annotated to a particular gene set S (list of genes in a specific KEGG pathway or GO term), and x is the number of “interesting” genes that are annotated to S (genes presented in DE genes list and a specific KEGG pathway or GO term).

For example, suppose we have an expression matrix with 20000 genes, of which 500 are differently expressed. Also suppose that 100 of the 20000 genes are annotated to a particular gene set S. Of these 100 genes, 20 are members of DE genes list. The probability that 20 or more (up to 100) genes annotated to S are in DE genes list by chance is given by

\[ P(X\geq 20) = 1 - P(X \leq 19) = 1-\sum \limits_{i=0}^{19}\frac{\hphantom{}{100 \choose i}{20000 - 100 \choose 500-i}}{20000 \choose 500} = 5.26 \times 10^{-13} \]

The p-value indicates it is very rare to observe 20 of the 100 genes from this set are in the DE genes list by chance.

Data preparation#

To conduct enrichment analysis using ORA, there are several input data that we need to prepare. First, we need to select a set of genes that are significantly altered (p-value < 0.05) in the DE genes generated from limma method.

# Selecting a list of significant DE genes
DEGenes <- DE.df[DE.df$adj.P.Val <= 0.05,]
# Select genes with symbol
DEGenes <- DEGenes$SYMBOL

Next, we need to define a list of background genes. In this analysis, they are all the genes generated from the DE analysis.

#Defining background genes
backgroundSet <- DE.df$SYMBOL

Then, we need to obtain a list of geneset from knowledge databases such as GO and KEGG. In this learning module, the geneset will be retrieved from the .gmt files that were processed from the submodule 03. To load the geneset, we will use gmt2geneset below:

# install tidyverse
suppressWarnings(if (!require("tidyverse")) install.packages("tidyverse"))
Loading required package: tidyverse
Installing package into ‘/home/runner/.rpackages’
(as ‘lib’ is unspecified)
also installing the dependencies ‘sass’, ‘rappdirs’, ‘rematch’, ‘highr’, ‘yaml’, ‘xfun’, ‘bslib’, ‘fontawesome’, ‘jquerylib’, ‘tinytex’, ‘backports’, ‘gargle’, ‘cellranger’, ‘ids’, ‘timechange’, ‘systemfonts’, ‘textshaping’, ‘knitr’, ‘rmarkdown’, ‘selectr’, ‘broom’, ‘conflicted’, ‘dbplyr’, ‘dtplyr’, ‘forcats’, ‘googledrive’, ‘googlesheets4’, ‘haven’, ‘jsonlite’, ‘lubridate’, ‘modelr’, ‘ragg’, ‘readxl’, ‘reprex’, ‘rstudioapi’, ‘rvest’
# Loading tidyverse library that provides a function to read the .gmt file
suppressMessages({
  library(tidyverse)
})
Error in library(tidyverse): there is no package called ‘tidyverse’
Traceback:

1. suppressMessages({
 .     library(tidyverse)
 . })
2. withCallingHandlers(expr, message = function(c) if (inherits(c, 
 .     classes)) tryInvokeRestart("muffleMessage"))
3. library(tidyverse)   # at line 3 of file <text>

Here, we also write a function to perform over representation analysis based on the hyper-geometric testing formula presented above. The ORA method will perform hyper-geometric testing for each geneset obtained from GO or KEGG using the function phyper available for stats R base package. The output of the ORA function is the table that contains a column of terms or pathway names and a column of p-value.

# A customized function to prase a gmt file to a list of genesets
gmt2geneset <- function(path){
  genesets <- read_tsv(path, col_names = F) %>% apply(MARGIN = 1, function(r){
    genes = unique(r[-(1:2)])
    list(
      id = r[1],
      description = r[2],
      genes = genes[!is.na(genes)]
    )
  })

  gs <- lapply(genesets, function(g) g$genes %>% as.character())
  names(gs) <- lapply(genesets, function(g) g$description)
  gs
}
ORA <- function(geneset,DEGenes,backgroundSet,DE.df){
    res <- sapply(geneset, function(gs){
    wBallDraw <- intersect(gs, DEGenes) %>%  length() - 1
    if (wBallDraw < 0) return(1)
    wBall <- length(DEGenes)
    bBall <- nrow(DE.df) - length(DEGenes)
    ballDraw <- length(intersect(gs, backgroundSet))
    1 - phyper(wBallDraw, wBall, bBall, ballDraw)
  })
  res
}

Enrichment analysis using ORA and GO terms#

In this section, we will perform ORA using genesets obtained from the GO database and the function ORA defined above. The detailed code is presented below:

# Loading geneset from GO database
suppressWarnings({suppressMessages({geneset <- gmt2geneset("./data/GO_terms.gmt")})})
# Perform ORA
res.ORA <- ORA(geneset,DEGenes,backgroundSet,DE.df)
# Save the result to a table where the first column is GO term name and the second column is the p-value
res.df <- data.frame(
  GOterms = names(res.ORA),
  pvalue = res.ORA
)
# Order the table based on the p-value
res.df <- res.df[order(res.df$pvalue),]
# View the most significant GO terms
head(res.df)
A data.frame: 6 × 2
GOtermspvalue
<chr><dbl>
positive regulation of B cell activationpositive regulation of B cell activation 0.0005237933
inner dynein arm assemblyinner dynein arm assembly 0.0010578236
cell activation involved in immune responsecell activation involved in immune response 0.0062357297
leukocyte activation involved in immune responseleukocyte activation involved in immune response 0.0062357297
immunoglobulin production involved in immunoglobulin-mediated immune responseimmunoglobulin production involved in immunoglobulin-mediated immune response0.0062357297
positive regulation of leukocyte activationpositive regulation of leukocyte activation 0.0062357297

Fromt the result table we might conclude that the cause of the disease may be related to positive regulation of B cell activation.

Enrichment analysis using ORA and KEGG pathways#

# Loading the KEGG pathways
suppressWarnings({suppressMessages({geneset <- gmt2geneset("./data/KEGG_pathways.gmt")})})
# Perform Hyper-geometric testing
res.ORA <- ORA(geneset,DEGenes,backgroundSet,DE.df)
# Save the results to a data frame
res.df <- data.frame(
  pathways = names(res.ORA ),
  pvalue = res.ORA
)

res.df <- res.df[order(res.df$pvalue),]
head(res.df)
A data.frame: 6 × 2
pathwayspvalue
<chr><dbl>
Purine metabolism - Homo sapiens (human)Purine metabolism - Homo sapiens (human) 0.03829845
AMPK signaling pathway - Homo sapiens (human)AMPK signaling pathway - Homo sapiens (human) 0.03979454
Complement and coagulation cascades - Homo sapiens (human)Complement and coagulation cascades - Homo sapiens (human)0.03979454
Primary immunodeficiency - Homo sapiens (human)Primary immunodeficiency - Homo sapiens (human) 0.06663166
Proteasome - Homo sapiens (human)Proteasome - Homo sapiens (human) 0.09342446
Virion - Herpesvirus - Homo sapiens (human)Virion - Herpesvirus - Homo sapiens (human) 0.09822352

Enrichment Analysis using FGSEA#

This submodule describes FGSEA, one of the methods for evaluating pathway enrichment in transcriptional data and it stands for Fast preranked Gene Set Enrichment Analysis (GSEA). FGSEA quickly calculates arbitrarily low GSEA P-values for a collection of gene sets and is based on an algorithm that allows it to make more permutations and get accurate p-values. It extends the GSEA algorithm which calculates the enrichment score (the degree to which a set S is over-represented at the top or bottom of the ranked list L), estimates the p-value of the Enrichment Score using a permutation test and adjusts the estimated significance level to account for multiple hypotheses in addition to calculating the q-values for FDR control. Typically, GSEA requires inputs of a list of gene sets (GO term or pathway with a set of genes), a vector DE genes with statistic. In addition, user can limit the number of terms or pathways of interest by adjusting minSize and maxSize parameters. The code to install and load FGSEA R package is shown below:

# To perform enrichment analysis using FGSEA, we first install the fgsea package
suppressMessages({if (!require("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
  suppressWarnings(BiocManager::install("fgsea", update = F))
})

# Loading the package
suppressPackageStartupMessages({
  library("fgsea")
})
Warning message:
"package 'BiocManager' was built under R version 4.2.2"

Next, we need to prepare a vector of DE genes with its statistic.

# Get the gene list and their statistic from DE results
stats = DE.df$adj.P.Val
names(stats) = DE.df$SYMBOL
head(stats)
DDR1
0.25312654919025
RFC2
0.467472580609517
HSPA6
0.986153555292704
PAX8
0.723652752171903
GUCA1A
0.335147977832318
UBA7
0.392525231374342

Enrichment analysis using FGSEA and GO terms#

The package fgsea has its own function named gmtPathways to load the genesets from the gmt file. It is recommended to use this function for compatibility. We can load GO terms geneset and view the first five GO terms with associated genes using the following command:

# Load the pathways into a named list
GO_term_hallmark <- gmtPathways("./data/GO_terms.gmt")
# Show the first few GO terms, and within those, show only the first few genes.
tmp = lapply(GO_term_hallmark,head)
tmp[1:5]
$`GO:0000002`
  1. 'AKT3'
  2. 'DNA2'
  3. 'DNAJA3'
  4. 'ENDOG'
  5. 'FLCN'
  6. 'LIG3'
$`GO:0000003`
  1. 'A1CF'
  2. 'A2M'
  3. 'AAAS'
  4. 'ABAT'
  5. 'ABCC8'
  6. 'ABHD2'
$`GO:0000012`
  1. 'APLF'
  2. 'APTX'
  3. 'ERCC6'
  4. 'ERCC8'
  5. 'LIG4'
  6. 'PARP1'
$`GO:0000018`
  1. 'ACTB'
  2. 'ACTL6A'
  3. 'ACTR2'
  4. 'ALYREF'
  5. 'ANKLE1'
  6. 'APLF'
$`GO:0000027`
  1. 'BOP1'
  2. 'BRIX1'
  3. 'DDX28'
  4. 'DHX30'
  5. 'FASTKD2'
  6. 'MDN1'

Running the FGSEA can be done by calling the fgsea with two required inputs.

# Set seed for reproducibility
set.seed(1)
# Running fgsea
suppressWarnings(fgseaRes <- fgsea(pathways = GO_term_hallmark,
                  stats    = stats
                  ))

The output of the FGSEA method is a table where row are GO terms IDs or pathway IDs. The columns are p-values, adjusted p-values, Enrichment Score, etc.

head(fgseaRes[order(pval), ][,-8])
A data.table: 6 × 7
pathwaypvalpadjlog2errESNESsize
<chr><dbl><dbl><dbl><dbl><dbl><int>
GO:00509071.160144e-050.079609110.59332550.37864061.577494132
GO:00509115.922388e-050.150945990.55733220.39260741.597791 94
GO:00095936.599213e-050.150945990.53843410.35267331.488034167
GO:00308501.133702e-040.194486540.53843410.45852051.746790 51
GO:00509061.643773e-040.215580620.51884810.32742741.399464202
GO:00069592.051876e-040.215580620.51884810.31600591.370283271

From the result table, we can select top five up regulated GO terms and top five down regulated GO terms. Then we can plot them using the built-in function plotGseaTable

topGOUp <- fgseaRes[ES > 0][head(order(pval), n=5), pathway]
topGODown <- fgseaRes[ES < 0][head(order(pval), n=5), pathway]
topGO <- c(topGOUp, rev(topGODown))
plotGseaTable(GO_term_hallmark[topGO], stats, fgseaRes,
              gseaParam=0.5)
_images/fd5a8ee8efc49c3617c8240b58d3963edc33b2cbb7ff17f041e161af8dc379ab.png

Enrichment analysis using FGSEA and KEGG pathways#

We can perform enrichment analysis using FGSEA with KEGG pathway using the same procedure mentioned above. The only thing we need to change is the list of geneset that available in KEGG database.

# Load the pathways into a named list
KEGG_hallmark <- gmtPathways("./data/KEGG_pathways.gmt")
# Show the first few GO terms, and within those, show only the first few genes.
tmp = lapply(KEGG_hallmark,head)
tmp[1:5]
$hsa00010
  1. 'HK3'
  2. 'HK1'
  3. 'HK2'
  4. 'HKDC1'
  5. 'GCK'
  6. 'GPI'
$hsa00020
  1. 'CS'
  2. 'ACLY'
  3. 'ACO2'
  4. 'ACO1'
  5. 'IDH1'
  6. 'IDH2'
$hsa00030
  1. 'GPI'
  2. 'G6PD'
  3. 'PGLS'
  4. 'H6PD'
  5. 'PGD'
  6. 'RPE'
$hsa00040
  1. 'GUSB'
  2. 'KL'
  3. 'UGT2A1'
  4. 'UGT2A3'
  5. 'UGT2B17'
  6. 'UGT2B11'
$hsa00051
  1. 'MPI'
  2. 'PMM2'
  3. 'PMM1'
  4. 'GMPPB'
  5. 'GMPPA'
  6. 'GMDS'
# Set seed for reproducibility
set.seed(1)
# Running fgsea
suppressWarnings(fgseaRes <- fgsea(pathways = KEGG_hallmark,
                  stats    = stats))
head(fgseaRes[order(pval), ][,-8])
A data.table: 6 × 7
pathwaypvalpadjlog2errESNESsize
<chr><dbl><dbl><dbl><dbl><dbl><int>
hsa040602.356019e-060.00079162240.62725670.33849511.455766291
hsa047402.349394e-030.32114438690.43170770.34058581.396453132
hsa040612.867361e-030.32114438690.43170770.35256431.408411 98
hsa005006.600792e-030.55446651870.40701790.41816091.499026 36
hsa040809.789528e-030.65785630200.38073040.27753661.206519358
hsa043501.607110e-020.89998175430.35248790.32903701.312952 97
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=5), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=5), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
#Viewing the 5 most significantly up-regulated and down-regulated pathways each with the FGSEA internal plot function
plotGseaTable(KEGG_hallmark[topPathways], stats, fgseaRes,
              gseaParam=0.5)
_images/63113f43d853dd23b29255b4501b0f6eeef20f2299bb5d8c7d2ca8967ab8b3e0.png

Gene Set Enrichment Analysis using GSA#

Gene Set Analysis (GSA), an Enrichment Analysis, is a method that is commonly used to summarize high-dimensional gene expression data sets into sets according to its biological relevance. GSA takes the ranked gene lists from the initial stage of a gene expression analysis and aggregates the genes into sets based on shared biological or functional properties as specified by a reference knowledge base. Such databases often contain phenotype associations, molecular interactions and regulation and are referenced in the analysis of the resultant gene sets to find the relevance of the gene properties to the phenotype of interest.

Quiz_Submodule4-4

Data preparation#

The GSA method is freely available as standalone package in CRAN repository. We can use the following code to install the package.

# Install GSA from CRAN
suppressMessages({if (!require("GSA"))
        suppressWarnings(install.packages("GSA"))
})
suppressMessages({
  library(GSA)
})

The GSA method require an expression matrix, a numeric vector containing class of each sample and a vector of genes the inputs. We can easily get those inputs by load the data that we processed in the Module 01 .

# Loading expression data with groups
data <- readRDS("./data/GSE48350.rds")
expression_data <- data$expression_data
norm_expression_data <- data$norm_expression_data
groups <- data$groups

We can also use the sample approach available in the Module 01 to map the probe IDs to gene symbols. The step-by-step coding instruction is shown below:

# Get the probe IDs
expression_data$PROBEID <- rownames(expression_data)
probeIDs <- rownames(expression_data)
# Perform gene mapping
suppressMessages({
  annotLookup <- AnnotationDbi::select(hgu133plus2.db, keys = probeIDs, columns = c('PROBEID', 'GENENAME', 'SYMBOL'))
})
# Merge DE result  data frame with annotation table
new_expression_data = merge(annotLookup, expression_data, by="PROBEID")
# Remove NA value
new_expression_data <- new_expression_data[!is.na(new_expression_data$SYMBOL),]
# Remove duplicated genes symbol
new_expression_data <-  new_expression_data[!duplicated(new_expression_data$SYMBOL,fromLast=FALSE),]
rownames(new_expression_data) <- new_expression_data$SYMBOL
# Drop PROBEID, GENENAME, and SYMBOL columns
new_expression_data <- new_expression_data[,-c(1:3)]
genenames= rownames(new_expression_data)

GSA Enrichment analysis using GO terms#

Using data obtaiend from the previous step, we can run the GSA method by callinf the function GSA. We can reuse GO_term_hallmark and KEGG_hallmark loaded in FGSEA to perform analysis. The code details are shown below:

# Getting
genesets = GO_term_hallmark
GSA.obj<-GSA(as.matrix(new_expression_data),as.numeric(groups$groups), genenames=genenames, genesets=genesets, resp.type="Two class unpaired",nperms=100,random.seed=1)
perm= 10 / 200 
perm= 20 / 200 
perm= 30 / 200 
perm= 40 / 200 
perm= 50 / 200 
perm= 60 / 200 
perm= 70 / 200 
perm= 80 / 200 
perm= 90 / 200 
perm= 100 / 200 
perm= 110 / 200 
perm= 120 / 200 
perm= 130 / 200 
perm= 140 / 200 
perm= 150 / 200 
perm= 160 / 200 
perm= 170 / 200 
perm= 180 / 200 
perm= 190 / 200 
perm= 200 / 200 
# List the results from a GSA analysis
res <- GSA.listsets(GSA.obj, geneset.names=names(genesets),FDRcut=.5)

A table of the negative gene sets. “Negative” means that lower expression of most genes in the gene set correlates with higher values of the phenotype y. Eg for two classes coded 1,2, lower expression correlates with class 2.

Please note that the output generated from GSA would be varied depending on R environment and software version. Especially, when the users run the scripts using User-managed Notebooks instances which have a preinstalled suite of packages.
neg.table <-res$negative
head(neg.table)
A matrix: 6 × 5 of type chr
Gene_setGene_set_nameScorep-valueFDR
9 GO:0000045-0.374500
55 GO:0000422-0.543600
620GO:0006091-0.474300
631GO:0006120-1.193600
729GO:0006457-0.454600
785GO:0006605-0.450400

A table of the positive gene sets. “Positive” means that higher expression of most genes in the gene set correlates with higher values of the phenotype y.

pos.table <-res$positive
head(pos.table)
A matrix: 6 × 5 of type chr
Gene_setGene_set_nameScorep-valueFDR
1115GO:00074220.361 00
1215GO:00083600.362800
1688GO:00140370.688700
1689GO:00140440.95 00
1885GO:00171450.595500
2100GO:00220111.010300
# Individual gene scores from a gene set analysis
# look at 10th gene set
GSA.genescores(10, genesets, GSA.obj, genenames)
A matrix: 12 × 2 of type chr
GeneScore
CEBPA 1.553
NAGS 0.686
OTC 0.396
ORC1 0.2
CPS1 -0.097
ARG1 -0.632
SLC25A2 -0.782
ORC2 -0.872
ASL -0.893
ASS1 -1.177
SLC25A15-1.925
ARG2 -2.268
# Plot the result, this function makes a plot of the significant gene sets, based on a call to the GSA (Gene set analysis) function.
suppressWarnings(GSA.plot(GSA.obj, fac=1, FDRcut = 0.5))
_images/50242688bb14fbc3715f7f775c508ae52b37fc4cb876d317d48b0699fc15cb2d.png

GSA Enrichment analysis using KEGG pathways#

We can use the same procedure to per enrichment analyis with KEGG pathway. All the codes are similar but genesets are assigned from KEGG_hallmark. The code is shown below.

genesets = KEGG_hallmark
GSA.obj<-GSA(as.matrix(new_expression_data),as.numeric(groups$groups), genenames=genenames, genesets=genesets, resp.type="Two class unpaired", nperms=100, random.seed=1)
perm= 10 / 100 
perm= 20 / 100 
perm= 30 / 100 
perm= 40 / 100 
perm= 50 / 100 
perm= 60 / 100 
perm= 70 / 100 
perm= 80 / 100 
perm= 90 / 100 
perm= 100 / 100 
# List the results from a GSA analysis
res <- GSA.listsets(GSA.obj, geneset.names=names(genesets),FDRcut=.5)

A table of the negative gene sets. “Negative” means that lower expression of most genes in the gene set correlates with higher values of the phenotype y. Eg for two classes coded 1,2, lower expression correlates with class 2.

neg.table <-res$negative
head(neg.table)
A matrix: 6 × 5 of type chr
Gene_setGene_set_nameScorep-valueFDR
218hsa04260-0.67380 0
328hsa05022-0.51870 0
256hsa04714-0.52350.010.4543
323hsa05012-0.738 0.010.4543
325hsa05016-0.71770.010.4543
327hsa05020-0.68950.010.4543

A table of the positive gene sets. “Positive” means that higher expression of most genes in the gene set correlates with higher values of the phenotype y. See “negative” above for more info.

pos.table <-res$positive
head(pos.table)
A matrix: 2 × 5 of type chr
Gene_setGene_set_nameScorep-valueFDR
169hsa045200.503200
173hsa048100.217300
# Individual gene scores from a gene set analysis
# look at 10th gene set
GSA.genescores(10, genesets, GSA.obj, genenames)
A matrix: 30 × 2 of type chr
GeneScore
OGDH 1.545
IDH2 1.427
SUCLG20.717
PCK1 -0.022
PCK2 -0.16
PC -0.213
DLST -0.235
PDHA2 -0.239
ACO1 -0.6
SDHC -0.676
IDH1 -0.851
IDH3B -1.241
PDHB -1.449
DLD -1.453
IDH3G -1.621
SDHB -1.751
SDHD -1.842
DLAT -1.904
MDH1 -2.045
SDHA -2.068
ACLY -2.094
SUCLG1-2.193
IDH3A -2.202
CS -2.314
OGDHL -2.378
FH -2.453
MDH2 -2.547
SUCLA2-2.584
PDHA1 -3.078
ACO2 -3.559
# Plot the result, this function makes a plot of the significant gene sets, based on a call to the GSA (Gene set analysis) function.
suppressWarnings(GSA.plot(GSA.obj, fac=1, FDRcut = 0.5))
_images/18efb78629812fcc13b83b9224d7972fd88bcb0ade8e385321e4d4d752c77513.png
Quiz_Submodule4