^{1}

^{2}

^{1}

^{1}

^{3}

The authors have declared that no competing interests exist.

Present knowledge indicates a multilayered hierarchical gene regulatory network (ML-hGRN) often operates above a biological pathway. Although the ML-hGRN is very important for understanding how a pathway is regulated, there is almost no computational algorithm for directly constructing ML-hGRNs.

A backward elimination random forest (BWERF) algorithm was developed for constructing the ML-hGRN operating above a biological pathway. For each pathway gene, the BWERF used a random forest model to calculate the importance values of all transcription factors (TFs) to this pathway gene recursively with a portion (e.g. 1/10) of least important TFs being excluded in each round of modeling, during which, the importance values of all TFs to the pathway gene were updated and ranked until only one TF was remained in the list. The above procedure, termed BWERF. After that, the importance values of a TF to all pathway genes were aggregated and fitted to a Gaussian mixture model to determine the TF retention for the regulatory layer immediately above the pathway layer. The acquired TFs at the secondary layer were then set to be the new bottom layer to infer the next upper layer, and this process was repeated until a ML-hGRN with the expected layers was obtained.

BWERF improved the accuracy for constructing ML-hGRNs because it used backward elimination to exclude the noise genes, and aggregated the individual importance values for determining the TFs retention. We validated the BWERF by using it for constructing ML-hGRNs operating above mouse pluripotency maintenance pathway and

There are at least a few hundred metabolic pathways and a few thousand biological processes known to be present in plants and animals, but unfortunately, our knowledge on how these pathways or biological processes are regulated is very limited. For example,

Although ML-hGRNs are important, there is a lack of statistical or computational methods for directly constructing ML-hGRNs from high-throughput gene expression data [

GENIE3 is gene network inferring algorithm based on the random forest method, and it is the champion of DREAM4

In order to show the effect of backward elimination, we tested it with a simulated “toys data set”. This data set simulates the expression values of 1 pathway gene (_{1}, _{2},…,_{1006}}, in these 1006 TFs, The first 6 TFs {_{1}, _{2},…,_{6}} are true TFs that regulate the pathway gene and the other 1000 TFs are noise. The setting is as follows:

X_{i}∼N(1, 1) for i = 1,2,3

X_{i}∼N(3, 1) for i = 4,5,6

X_{i}∼N(0, 1) for i = 7,…,1006

z∼N(0, 0.1)

_{1} + _{2} + _{3} + _{4} + _{5} + _{6} +

In such a setting, the 6 true TFs can be divided into two groups {_{1},_{2},_{3}} and {_{4},_{5},_{6}}. The TFs in first group have weaker signal than the second group, and are more likely to be inundated by the noise TFs. The sample size is set to be 100.

As shown in _{5} and _{6} were correctly identified as the two most important top TFs by both methods. However, for GENIE3, the noise TF _{536} appeared in the third place, its importance value surpassed the importance values of the other 4 true TFs. In addition, true TF _{3} was inundated by three other noise TFs. When BWERF was applied to this toys data set, the strong regulatory TFs {_{4},_{5},_{6}} and weak regulatory TF _{1} had the larger importance values than any noise variables, and true TF _{3} was only surpassed by noise _{536}, which clearly manifested the roles of backward elimination in elevating the positions of true positive TFs. We also noticed that the backward elimination implicitly increased the importance values and also their range as backward elimination was advanced, leading to the true regulatory variables becoming more differentiable from noise variables.

Red: true regulators, blue: noise variables. The boxplots were based on 30 runs of GENIE3 and BWERF.

In order to evaluate the performance of BWERF in recognizing regulatory relationships, we downloaded mouse time course gene expression data and ChIP-seq data from embryonic stem cells Atlas of Pluripotency Evidence (ESCAPE). Mouse pluripotency maintenance pathway was selected for demonstrating the effect of BWERF. The 24 genes involved in pluripotency maintenance renewal were chosen for this

The 24 stem cell pluripotency maintenance genes regulated by the 35 transcription factors (TFs) were obtained from mouse ChIP-seq data. A microarray data set of these pathway genes and TFs yielded from a time course in which the pluripotent cells were subjected to undirected differentiation were downloaded from ESCAPE web portal (

BWERF | GENIE3 | |||||
---|---|---|---|---|---|---|

Dataset 1 | Dataset 2 | Dataset 3 | Dataset 1 | Dataset 2 | Dataset 3 | |

PR | 0.2405 | 0.1457 | 0.1958 | 0.1312 | 0.0705 | 0.1958 |

ROC | 0.7778 | 0.8150 | 0.8134 | 0.6499 | 0.6868 | 0.6820 |

We built a four-layered hGRN as shown in

A. The four-layered hGRN constructed with the BWERF algorithm. B. The four-layered hGRN built from GENIE3. The input files for BWERF and GENIE3 included the expression profiles of 1602 transcription factors and 22 lignocellulosic pathway genes (green nodes at bottom layers). The nodes with red color highlighted in both networks are known regulatory TFs regulating lignocellulosic pathway in existing knowledgebase. The data, edge list as well as gene IDs represented by each symbol can be found in

In contrast, GENIE3 identified 10 positive TFs that include SND1, 2, MYB58, NST2, MYB63, MYB69, SND3, MYB46 and MYB85 [

At the time being, little is known about regulatory layers above the majority of over 500 metabolic and canonical pathways. With the availability of terabyte of gene expression data, it is imperative to develop novel methods to construct hGRNs via reverse-engineering approaches. This is because the new methods provide the means to recognize the novel TFs in layered hGRNs that can greatly facilitate our understanding and decipher how biological pathways are regulated. In our earlier work, a bottom-up GGM algorithm was developed to accomplish this goal through implementation of a hypothesis testing (Wald test-based approach) to each combined triple genes: one TF from TF pool and two current bottom-layered genes, leading to identification of the significant triple regulatory blocks for building a ML-hGRN. The bottom-up GGM algorithm evaluates the triple genes by comparing the difference between the correlation coefficient of the two bottom-layered genes and the partial correlation coefficient of the two bottom-layered genes given a TF at the immediately upper layer. When the difference is statistically significant, the TF was defined to be the regulator of the two bottom-layered genes [

BWERF is based on the random forest model, and thus inherits the advantages of the random forest model in determining the important regulatory variables to each pathway gene. Firstly, the random forest model is applicable to the data sets where the number of variables is much larger than the number of samples. The random forest model uses randomly selected subset of variables at each splitting node, and thus important variables can be correctly separated from a large number of non-important regulatory variables. Secondly, the random forest model can detect non-linear regulatory relationships among variables. Thirdly, if only the parameter of

We used GENIE3 as a comparison method because it also uses the random forest model to construct gene regulatory networks. However, GENIE3 is not tailored for constructing ML-hGRNs. GENIE3 creates the random forest model for each pathway gene only once while BWERF creates the random forest model for each pathway gene multiple times as it conducts backward elimination in a recursive manner and BWERF also considers the high correlation among a group of pathway genes and uses it by aggregating the importance values of TFs for ranking TFs. As we know that the gene regulatory relationships are intricate and the number of TFs is much larger than the number of samples. Non-regulatory TFs to the pathway of interest can severely twist the importance values of regulatory TFs if they are not appropriately modeled. The backward elimination step has its value in two aspects: first, it helps the regulatory genes with medium-level regulatory strength to emerge from noise variables, as shown with the “toys data”; second, the backward elimination can enlarged the range of importance values of the variables, which makes true regulatory variables more differentiable from noise variables. Although it takes more computational time to run the backward elimination, we found that the running time is acceptable with moderate computational power. For example, when we used our Linux server with 22 cores to construct a ML-hGNR from a compendium

Compared to GENIE3, the best performer in DREAM4

We certainly compliment GENIE3 for its great value in constructing well-stretched and well-connected “single-layered” GRN comprising of large number of genes. For constructing a multiple-layered hGRN operating over a given pathway, BWERF is tuned to have a higher accuracy and reliability for gene selection at low-level layer. Small derivation at the lower-level layered gene selection can lead to significant discrepancies at the high–level layers during construction of ML-hGRN. We will anticipate the experimental validation of our method in the next decade though we have begun to receive some positive feedback, for example, as what have been shown in our recent publication[

A random forest algorithm with backward elimination was developed for constructing a ML-hGRN that operates above a given metabolic or canonical pathway using microarray or RNA-seq data sets. The algorithm was evaluated with both a synthetic “toys data set” and two real gene expression data sets from

The

The mouse microarray data sets were download from Embryonic Stem Cells Atlas of Pluripotency Evidence (ESCAPE) website (

In order to construct ML-hGRN that govern a biological pathway, the algorithm first placed pathway genes at the bottom (first) layer, and tried to identify the most significant regulatory genes (TFs) that are associated with the pathway genes with causal relationships, and then built the second layer of network. After that, the regulatory genes shown at the second layer were removed from the pool of input genes, and then used as the new bottom layer, and the remaining input genes were used as an input to construct the third layer. This process was repeated until the designated number of layers is achieved or no more layer can be built owing to lack of TFs that have causal relationships with the current bottom layer.

Assume we have

A. Input for BWERF included a pathway gene expression matrix and a TFs expression matrix. B. For each pathway gene, recursively constructing of random forest model with backward elimination. C. Aggregation of the importance values of a TF to all pathway genes to produce a unified the importance value of this TF to the pathway. D. The Expectation-maximization (EM) algorithm was implemented to fit a Gaussian mixture model to the importance values. E. The most important TFs were identified and used as a layer. F. By using the new TF layer as bottom layer, we repeated all above procedure to obtain the next layer until the designated number of layer was achieved or the program was terminated due to the lack of significant TFs as input for upper layers.

Random forest is a machine learning technique developed by Leo Breiman [

To rank TFs associated with the genes in specific pathway, we need to learn the importance value of each TF to the pathway genes. Random forest can return the importance value of each independent variable in a natural manner. For a node of decision tree, the importance of the splitting variable is defined as the value of decreased variance using formula: _{p} − _{1}_{c1} − _{2}_{c2} where _{p} represents the variance of all _{c1} and _{c2} represents the variance of all _{1} samples split at children node C1 and variance of all _{2} samples split at child node C2 respectively. The importance of other variables is defined as zero for this node. For a decision tree, the importance of each independent variable is the sum the importance in all nodes of the tree. For a random forest, the importance of each independent variable is the average of importance of the variable in all trees.

Suppose _{1},_{2},…,_{N} are the importance values of TFs that were obtained from BWERF. We assumed they are independent and identically distributed (i.i.d.) samples from a finite mixture of K > 1 Gaussian distributions. The density of each _{i} can be written as
_{k} is Gaussian density function. Let

_{i} belongs to component k.
_{i}, 1 ≤ i ≤ N and all mixture components 1 ≤ k ≤ K. This yields an N × K matrix of membership weights, where each of the rows sum to 1.

^{th} component. Then,

When the parameters were obtained, the TFs that had the highest probability belonging to the component that had the largest mean in the Gaussian mixture model were identified to be the putative regulatory TFs.

We compared BWERF with GENIE3, the latter is the best performer in DREAM4

In order to evaluate and compare the efficiency of BWERF and GENIE3, we plotted the precision recall (PR) curve and receiver operating characteristic (ROC) curves for the results of two algorithms, and calculated the statistics AUPR and AUROC. The PR curve is created by plotting the precision against the recall at various threshold settings, while the ROC curve is created by plotting the true positive rate (TPR) versus the false positive rate (FPR). The definitions of precision, recall, TPR and FPR are as follows:

(ZIP)

(ZIP)

(DOCX)

(ZIP)