^{1}

^{2}

^{*}

^{1}

^{2}

^{1}

^{2}

^{1}

^{2}

Conceived and designed the experiments: VAHT PG. Performed the experiments: VAHT. Analyzed the data: VAHT AI LW PG. Wrote the paper: VAHT AI LW PG.

The authors have declared that no competing interests exist.

One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4

Genetic regulatory networks (GRNs)

The simplest models of genetic regulatory networks are based on Boolean logic. Because of their simplicity, these Boolean network models have provided high-level insights into the design principles and emerging properties of GRNs

Models based on the statistical analysis of dependencies between expression patterns have an intermediate complexity, and have already been successfully applied to the inference of large GRNs. Early models used correlation coefficients between expression patterns of all pairs of genes to infer “coexpression networks”

Probabilistic graphical models have been widely used to model GRNs

Within this context, this article presents GENIE3 (for “GEne Network Inference with Ensemble of trees”), a new GRN inference method based on variable selection with ensembles of regression trees. This method was best performer in the DREAM4

We address the problem of recovering regulatory networks from gene expression data. The targeted networks are directed graphs with

The goal of (unsupervised) gene regulatory network inference is to recover the network solely from measurements of the expression of the genes in various conditions. Given the dynamic and combinatorial nature of genetic regulation, measurements of different kinds can be obtained, including steady-state expression profiles resulting from the systematic knockout or knockdown of genes or time series measurements resulting from random perturbations. In this paper, we focus on multifactorial perturbation data as generated for the DREAM4

In what follows, we define a (multifactorial) learning sample from which to infer the network as a sample of

From this learning sample, the goal of network inference algorithms is to make a prediction of the underlying regulatory links between genes. Most network inference algorithms work first by providing a ranking of the potential regulatory links from the most to the less significant. A practical network prediction is then obtained by setting a threshold on this ranking. In this paper, we focus only on the first task, which is also targeted by the evaluation procedure of the DREAM4 challenge. The question of the choice of an optimal confidence threshold, although important, will be left open.

A network inference algorithm is thus defined in this paper as a procedure that exploits a

The basic idea of our procedure is to decompose the problem of recovering a network involving

We first describe our procedure to solve the network inference problem using feature selection techniques and then specialize it to the case of tree-based ensemble methods.

Our method makes the assumption that the expression of each gene in a given condition is a function of the expression of the other genes in the network (plus some random noise). Denoting by

The proposed network inference procedure is illustrated in

For

Generate the learning sample of input-output pairs for gene

Use a feature selection technique on

Aggregate the

For each gene

Note that depending of the interpretation of the weights

The nature of the problem and the proposed solution put some constraints on candidate feature selection techniques. The nature of the functions

We first briefly describe these methods and their built-in feature ranking mechanism and then discuss their use in the context of the network inference procedure described in the previous section.

Each subproblem, defined by a learning sample

Single trees are usually very much improved by ensemble methods, which average the predictions of several trees. In our network inference procedure, we compare two tree-based ensemble methods based on randomization, namely Random Forests

One of the most interesting characteristics of tree-based methods is that it is possible to compute from a tree a variable importance measure that allows to rank the input features according to their relevance for predicting the output. Several variable importance measures have been proposed in the literature for tree-based methods. In our experiment, we consider a measure which at each test node

Breiman

Each tree-based model yields a separate ranking of the genes as potential regulators of a target gene in the form of weights

The computational complexity of the Random Forests and Extra-Trees algorithms is on the order of

To fix ideas, with our MatLab implementation of GENIE3, it takes 6.5 minutes to infer the five networks of the DREAM4 challenge and 7 hours to infer the

Note that, if needed, the algorithm can be easily parallelized as the

We report below two series of experiments: first on the DREAM4

The DREAM (for “Dialogue for Reverse Engineering Assessments and Methods”) initiative organizes an annual reverse engineering competition called the DREAM challenge

The goal of the

All networks and data were generated with GeneNetWeaver (GNW) version 2.0

In addition, we carried out experiments with our method on the inference of the regulatory network of

The dataset of expression profiles we used was retrieved from the Many Microbe Microarrays (

Our algorithm provides a ranking of the regulatory links from the most confident to the less confident. To evaluate such a ranking independently of the choice of a specific threshold, we used both precision-recall (PR) curve and receiver operating characteristic (ROC) curve. The former plots for varying thresholds on the importance scores the proportion of true positives among all predictions (precision) versus the percentage of true positives that are retrieved (recall), whereas a ROC curve plots the true positive rate versus the false positive rate.

To summarize these curves, the DREAM organizers proposed different statistics:

AUPR: The area under the PR curve.

AUROC: The area under the ROC curve.

AUPR p-value: The probability that a given or larger AUPR is obtained by random ordering of the potential network edges.

AUROC p-value: The probability that a given or larger AUROC is obtained by random ordering of the potential network edges.

An overall score was used to evaluate the predictions for the five networks of each subchallenge:

We took part in the DREAM4

Among twelve challengers, GENIE3 got the best performance with an overall score of 37.428. As a comparison, the score of the first runner-up was 28.165.

Method | NET1 | NET2 | NET3 | NET4 | NET5 | |

AUPR | GENIE3-RF-sqrt | 0.154 | 0.155 | 0.231 | 0.208 | 0.197 |

2nd best | 0.108 | 0.147 | 0.185 | 0.161 | 0.111 | |

AUROC | GENIE3-RF-sqrt | 0.745 | 0.733 | 0.775 | 0.791 | 0.798 |

2nd best | 0.739 | 0.694 | 0.748 | 0.736 | 0.745 |

Method | NET1 | NET2 | NET3 | NET4 | NET5 | Overall p-value | |

AUPR p-value | GENIE3-RF-sqrt | 3.3e-34 | 7.9e-54 | 1.8e-54 | 5.5e-47 | 4.6e-44 | 1.0e-46 |

2nd best | 5.6e-23 | 9.7e-50 | 6.6e-43 | 1.5e-35 | 4.4e-23 | 7.4e-35 | |

AUROC p-value | GENIE3-RF-sqrt | 3.3e-18 | 1.1e-28 | 9.7e-34 | 6.7e-33 | 1.9e-34 | 1.4e-29 |

2nd best | 1.7e-17 | 5.4e-21 | 4.9e-28 | 1.9e-23 | 1.1e-24 | 6.3e-23 |

We have subsequently applied GENIE3 on these same datasets, using the Extra-Trees algorithm, and also setting

RF-sqrt | RF-all | ET-sqrt | ET-all | |

Overall score | 37.428 | 40.471 | 35.881 | 40.111 |

To have a more precise picture of the quality of the predictions obtained with GENIE3,

Ranking of the regulators for all genes. Each row corresponds to a gene. Dots in each row represent the positions in the Random Forests ranking of the regulators of this gene. Genes are ordered on the y-axis according to their number of regulators in the gold standard network; those having the same number of regulators are grouped inside an horizontal block (from no regulator at the top to 6 regulators at the bottom). Inside each block, genes are ordered according to the median rank of their regulators. The ranking of interactions was obtained with Random Forests and

As can be seen from this figure, GENIE3 is able to retrieve the best regulator for about two thirds of the genes that have only one regulator. For genes with two regulators, the method retrieves one of the two regulators for about the same proportion of genes but is less good at retrieving the second regulator (only for one gene, the two regulators are at the top of the ranking). For genes with three or more regulators, even one regulator seems to be difficult to retrieve.

This suggests that the performance of GENIE3 at retrieving a regulator of one gene degrades as the number of regulators of this gene increases, as also observed from the analysis of the results of the DREAM3 challenge in

The in-degree of a target is its number of regulators. The dot corresponding to in-degree

One interesting feature of GENIE3 is its potential ability to predict directed networks, while methods based on mutual information or correlation are only able to predict undirected networks.

To see to what extent the networks predicted by our method are asymmetric, we show in

NET1 | NET2 | NET3 | NET4 | NET5 | |

GENIE3-RF-all | 50% | 58% | 48% | 48% | 58% |

Gold standard | 92% | 94% | 97% | 96% | 98% |

Of course, the fact that GENIE3 predicts asymmetric networks does not ensure that the prediction of these asymmetric links is really informative; asymmetric predictions might precisely correspond to spurious predictions. To check this, we swapped the weights

To further assess the ability of our method to predict link directions, we computed the proportion of edges

Recall | 5% | 25% | 50% | 75% | 100% |

Error rate | 20% | 28% | 27% | 27% | 26% |

Finally, we compared GENIE3 to three existing approaches based on the computation of mutual information (MI), namely CLR

We carried two evaluations, the first one against the undirected gold standard (

GENIE3-RF-all | CLR | ARACNE | MRNET | GGM | |

Overall score | 36.736 | 35.838 | 32.632 | 34.124 | 26.846 |

GENIE3-RF-all | CLR | ARACNE | MRNET | GGM | |

Overall score | 40.471 | 31.57 | 28.488 | 30.435 | 23.705 |

As a first experiment on the real

Only known transcription factors were used as input genes. A. Comparison between the four different settings of the tree procedure. B. Comparison to other approaches.

As a second experiment, we simulated conditions similar to the DREAM4 challenge, where transcription factors were unknown and tried to infer the network using as input features in each step of our procedure all 1471 genes except the target gene itself. For this experiment, precision never exceeded 6%, even for the smallest values of recall. This indicates that the predictions are extremely poor and only slightly better than random guessing.

With respect to the results obtained in the DREAM4 challenge, these results are disappointing. The larger number of genes in this case does not explain everything since it also comes with an increase of the number of observations. Actually in both cases, the number of observations is comparable to the number of genes. However, since the

We developed GENIE3, a procedure that aims at recovering a gene regulatory network from multifactorial expression data. This procedure decomposes the problem of inferring a network of size

GENIE3 got the best performance on the DREAM4

Our algorithm can be improved along several directions. As tree-based ensemble methods, we used the Random Forests and the Extra-Trees algorithms, that both gave comparable results. However, the performances of these methods depend to some extent on their main parameter, the number

There is also a potential room for improvement on the way variable importance scores are normalized. One apparent drawback of the measure we proposed is that it does not take into account the quality of the trees in generalization. Indeed since our trees are fully grown, importance weights satisfy equation (4) which, given our normalization, attributes equal weights to all tree models irrespective of their quality when used to predict the expression values of the target gene. We tried to correct for this bias by normalizing the variable importance scores by the effective variance reduction brought by the model as estimated by cross-validation but it actually deteriorated the performances. The question of the optimal normalization remains thus open at this stage.

In this paper, we focused on providing a ranking of the regulatory interactions. In some practical applications however, one would like to determine a threshold on this ranking to obtain a practical predicted network. To address this question, we have tried to exploit cross-validation estimates of the mean-square error as a criterion to determine such a threshold but we have not been successfull so far. As future work, we therefore would like to extend the technique developed in

Our experiments on the DREAM4 dataset show that GENIE3 is able to predict the direction of the edges to some extent, even though it only exploits steady-state measurements. This is an interesting result as this is commonly admitted to be a difficult problem. Bayesian networks also potentially allow to predict edge directionality. A comparison with this family of methods would be an interesting future work direction. Note that with respect to our approach, Bayesian networks do not allow for the presence of cycles in the predicted network, which could be a limiting factor for networks such as those in DREAM4 that contain cycles by construction.

Several procedures using regression trees have already been proposed to solve the regulatory network inference problem. Most of these procedures exploits other kinds of data in addition to expression data, e.g. counts of regulatory motifs that serve as binding sites for transcription factors

Finally, although we exploited tree-based ensemble methods, our framework is general and other feature selection techniques could have been used as well. Actually, several existing methods for network inference can be interpreted as special instances of this framework. In particular, mutual information as used in Relevance Networks

Our GENIE3 software is available from

PR and ROC curves for each DREAM4 Multifactorial network. Left: PR curves. Right: ROC curves. Prec: Precision. FPR: False Positive Rate. TPR: True Positive Rate. The rankings of interactions were obtained using Random Forests and

(3.04 MB TIF)

Ranking of the regulators for all genes on DREAM4 networks. Each row in a figure corresponds to a gene. Dots in each row represent the positions in the Random Forests ranking of the regulators of this gene. Genes are ordered on the y-axis according to their number of regulators in the gold standard network; those having the same number of regulators are grouped inside an horizontal block. Inside each block, genes are ordered according to the median rank of their regulators. The rankings of interactions were obtained with Random Forests and

(8.11 MB TIF)

Ranking of the regulators for all genes on the _{TF}

(3.06 MB TIF)