^{1}

^{2}

^{2}

^{1}

^{2}

^{3}

^{*}

Conceived and designed the experiments: KYY RPA K-KY MG. Performed the experiments: KYY. Analyzed the data: KYY. Wrote the paper: KYY.

The authors have declared that no competing interests exist.

We performed computational reconstruction of the

The expression of genes is tightly controlled by the regulatory machinery in the cell. A major part of which involves regulator proteins such as transcription factors (TFs). Transcription regulation can be modeled as a directed network with each node representing a gene and the proteins that it encodes, and an edge from one node to another if the former is a regulator of the latter. In addition to the directionality, the edges are also signed, with a positive sign indicating a positive regulation (activation) and a negative sign indicating a negative regulation (suppression).

Methods have been proposed for computationally reconstructing regulatory networks. One common approach is to use differential equations to model how the expression levels of genes change according to the abundance of their regulator proteins over time

In the resulting dataset, each data point measures the expression level of a gene in a specific condition at a certain time point. Each such observed value is determined by a mixture of different factors, including the previous expression level of the gene, the activity of its regulators, decay of mRNA transcripts, randomness, and measurement errors. The many entangled parameters make it difficult to reconstruct the regulatory network based on this type of data alone.

To decode this kind of complex systems, one strategy is to reduce it to a series of subsystems with manageable sizes by keeping the values of most parameters constant and varying only a small number of them. Thanks to the creation of large-scale deletion libraries

Sophisticated computational methods have been developed in previous studies to use deletion data to infer regulatory networks. For example, Bayesian approaches have been used to model biological pathways and the effects of gene deletion

While deletion data is good for detecting simple, direct regulatory events, they may not be sufficient for decoding those that are more complicated. For example, if a gene is up-regulated by two TFs in the form of an OR circuit, so that the gene is expressed as long as one of the TFs is active, these edges in the regulatory network cannot be uncovered by single-gene deletion data. In such a scenario, traditional time course data could supplement the deletion data in detecting the missing edges. For instance, if at a certain time point both the TFs have a low abundance and the expression rate of the gene is observed to be impaired, this observation could potentially help reconstruct the OR circuit.

As another example, if a regulator is normally not expressed, deleting its gene would not cause an observable effect to the expression of other genes. Yet if in a certain perturbation the expression of the regulator is induced by the external stimuli, its regulation of other genes could be detected.

Therefore, the two types of data are complementary in reconstructing regulatory networks. In this study we demonstrate how they can be used in combination to improve network reconstruction. We first propose methods for predicting regulatory edges from each type of data, and then describe a meta-method for combining their predictions. Using a set of fifteen benchmark datasets, we show the effectiveness of our approach, which led our team to get the first place in the public challenge of the third Dialogue for Reverse Engineering Assessments and Methods (DREAM)

We first formally define our problem of reconstructing regulatory networks. The target network is a directed network with

We use two types of data features: perturbation time series data and deletion data. Deletion data are further sub-divided into homozygous deletion and heterozygous deletion.

In a perturbation time series dataset, an initial perturbation is performed at time 0, which sets the expression levels of each gene to a certain level. Then the regulatory system is allowed to adjust the internal state of the cell by up- and down-regulating genes according to the abundance of the TFs. The expression level of each gene is taken at subsequent time points. Thus, for each perturbation experiment, each gene is associated with a vector of real numbers that correspond to its expression level at different time points after the initial perturbation. If there are

In a deletion dataset, a gene is deleted, and the resulting expression level of each gene at steady state is measured. By deleting each gene one by one, and adding the wild-type (no deletion) as control, each gene is associated with a vector of

We assume that both types of deletion data, as well as perturbation data, are available, although it is trivial to modify our algorithm by simply removing the corresponding subroutines if any type of data is missing.

Our basic strategy is to learn the simple regulation cases from deletion data by using noise models, and to learn the more complex ones from perturbation data using differential equation models. We first describe the two kinds of models and how we learn the parameter values from data, then discuss our way to combine the two lists of predicted edges into a final list of predictions.

We consider a simple noise model for deletion data, that each data point is the superposition of the real signal and a reasonably small Gaussian noise independent of the gene and the time point. The Gaussian noise models the random nature of the biological system and the measurement error. Based on this model, the larger is the change of expression of gene

Notice that the regulation could be direct (

Given the observed expression level

To implement the above procedure, it is necessary to estimate

We propose an iterative procedure to progressively refine our estimation of

Calculate the probability of regulation

Use the set

For each gene

After the iterations, the probability of regulation

Notice that we have chosen to use a “conservative” p-value of 0.05 in the following sense: when the number of genes in the network,

The above iterative procedure can be applied to both homozygous and heterozygous deletion data, although the regulation signals are expected to be less clear in the heterozygous case since deleting only one copy of a regulator gene may induce only a mild effect to its targets. The final p-values computed from homozygous data are thus expected to be more reliable. Yet the ones learned from heterozygous data can still be useful references in resolving ambiguous cases, as we will discuss in more detail when describing our approach to combining the predictions learned from the different types of data.

Comparing to previous methods, our approach to using deletion data in inferring regulatory events is relatively simple. On the one hand, this is to cope with the limited types of data provided in the DREAM challenge. For instance, direct binding data is not available, and thus cannot be used to setup prior distributions for parameter values, as in some previous studies

For time series data after an initial perturbation, we use differential equations to model the gene expression rates. The general form is as follows:

The linear model assumes a linear relationship between the expression level of the regulators and the resulting expression rate of the target. It is a rough first approximation of the expression rate. An advantage of it is the small number of parameters (

Our goal is to try different possible regulator sets

The objective function is not convex with respect to the parameters. We use Newton's method

We try two types of regulator sets

We also tried double regulator sets with all pairs of potential regulators. Yet the resulting models did not appear to provide much additional information on top of the single regulator set models, while requiring much longer computational time. We therefore decided to consider only the single regulator sets and guided single regulator sets.

For a regulator set

Our main idea for combining the predictions of the different models learned from deletion and perturbation data is to rank the predictions according to our confidence that they are correct. Specifically, we make predictions in batches, with the first batch containing the most confident predictions, and each subsequent batch containing the most confident predictions that have not been covered by the previous batches. Within each batch, the predictions are ordered by the confidence of the models, which corresponds to the probability of regulation

Batch 1: all predictions with a probability of regulation larger than 0.99 according to the noise model learned from homozygous deletion data

Batch 2: all predictions with an objective score two standard deviations below the average according to all types (linear AND sigmoidal) of differential equation models learned from perturbation data

Batch 3: all predictions with an objective score two standard deviations below the average according to all types of guided differential equation models learned from perturbation data, where the regulator sets contain regulators predicted in the previous batches, plus one extra potential regulator

Batch 4: as in batch 2, but requiring the predictions to be made by only one type (linear OR sigmoidal) of the differential equation models as opposed to all of them

Batch 5: as in batch 3, but requiring the predictions to be made by only one type of the differential equation models as opposed to all of them

Batch 6: all predictions with a probability of regulation larger than 0.95 according to both the noise models learned from homozygous and heterozygous deletion data, and have the same edge sign predicted by both models

Batch 7: all remaining gene pairs, with their ranks within the batch determined by their probability of regulation according to the noise model learned from homozygous deletion data

In general, we put the greatest confidence in the noise model learned from homozygous deletion data as the signals from this kind of data are clearest among the three types of data. We are also more confident with predictions that are consistently made, either by the different types of differential equation models (batches 2 and 3 over batches 4 and 5) or by the noise models learned from homozygous and heterozygous deletion data (batch 6).

We used the algorithm described above to take part in the third Dialogue for Reverse Engineering Assessments and Methods Challenge (DREAM3)

The predictions are compared against the actual edges in the networks by the DREAM organizer using four different metrics for evaluating the accuracy:

AUPR: The area under the precision-recall curve

AUROC: The area under the receiver-operator characteristics curve

pAUPR: The p-value of AUPR based on the distribution of AUPR values in 100,000 random network link permutations

pAUROC: The p-value of AUROC based on the distribution of AUROC values in 100,000 random network link permutations

While the statistics related to the ROC curve are commonly used to evaluate prediction results, those related to the PR curve could be more sensitive when there is a much larger negative set than positive set.

These metrics are further aggregated into an overall p-value for each size using the geometric mean of the five p-values from the five networks, and finally an overall score equal

In the evaluation by the DREAM organizer, edge signs (activation vs. suppression) are not considered. We note that our algorithm can actually detect edge signs. In the noise model, a regulation is determined as an activation if the resulting expression is higher than the estimated wild-type expression, and a suppression otherwise. For different equation models, a positive sign of the coefficient

The challenge of size 10 has attracted 29 teams to participate, the one of size 50 has 27 teams and the one of size 100 has 22 teams. The large number of participants makes the challenge currently the largest benchmark for gene network reverse engineering

Our algorithm ended up in first place on all three network sizes. The complete set of performance scores for all teams can be found at the DREAM3 web site

Ecoli1 | Ecoli2 | Yeast1 | Yeast2 | Yeast3 | |

Size 10 | 0.928 | 0.912 | 0.949 | 0.747 | 0.714 |

Size 50 | 0.930 | 0.924 | 0.917 | 0.792 | 0.805 |

Size 100 | 0.948 | 0.960 | 0.915 | 0.856 | 0.783 |

Ecoli1 | Ecoli2 | Yeast1 | Yeast2 | Yeast3 | Overall AUROC | |

Size 10 | 9.771e-07 | 2.629e-07 | 9.941e-07 | 2.931e-04 | 1.046e-03 | 9.523e-06 |

Size 50 | 2.396e-27 | 4.328e-31 | 1.477e-25 | 1.808e-21 | 1.386e-29 | 5.210e-27 |

Size 100 | 1.226e-52 | 5.876e-42 | 4.087e-70 | 5.755e-99 | 1.722e-92 | 3.112e-71 |

We notice that in some cases our first predictions are already very close to the actual network.

(a) The actual network. (b) Our top-10 predictions.

The overall scores are 5.124, 39.828, and

Ecoli1 | Ecoli2 | Yeast1 | Yeast2 | Yeast3 | ||||||

Batch | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct |

1 | 11 | 7 | 16 | 12 | 11 | 9 | 13 | 9 | 12 | 8 |

2 | 6 | 1 | 4 | 0 | 5 | 0 | 5 | 1 | 5 | 4 |

3 | 0 | 0 | 1 | 1 | 3 | 0 | 1 | 0 | 1 | 0 |

4 | 5 | 1 | 8 | 0 | 7 | 0 | 4 | 2 | 4 | 0 |

5 | 4 | 0 | 8 | 1 | 6 | 0 | 10 | 3 | 5 | 1 |

6 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

7 | 63 | 1 | 53 | 1 | 58 | 1 | 57 | 10 | 63 | 9 |

Total | 90 | 11 | 90 | 15 | 90 | 10 | 90 | 25 | 90 | 22 |

Ecoli1 | Ecoli2 | Yeast1 | Yeast2 | Yeast3 | ||||||

Batch | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct |

1 | 96 | 52 | 133 | 69 | 145 | 57 | 176 | 83 | 201 | 100 |

2 | 76 | 2 | 85 | 1 | 80 | 8 | 87 | 12 | 102 | 16 |

3 | 77 | 0 | 78 | 1 | 69 | 1 | 56 | 1 | 64 | 2 |

4 | 196 | 0 | 153 | 1 | 185 | 1 | 156 | 5 | 113 | 3 |

5 | 178 | 1 | 169 | 1 | 167 | 2 | 177 | 6 | 149 | 2 |

6 | 5 | 0 | 16 | 0 | 9 | 0 | 11 | 0 | 6 | 0 |

7 | 1822 | 7 | 1816 | 9 | 1795 | 8 | 1787 | 53 | 1815 | 50 |

Total | 2450 | 62 | 2450 | 82 | 2450 | 77 | 2450 | 160 | 2450 | 173 |

Ecoli1 | Ecoli2 | Yeast1 | Yeast2 | Yeast3 | ||||||

Batch | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct |

1 | 410 | 101 | 377 | 108 | 483 | 118 | 656 | 257 | 710 | 302 |

2 | 387 | 11 | 319 | 1 | 317 | 20 | 282 | 22 | 311 | 31 |

3 | 162 | 0 | 198 | 0 | 129 | 0 | 145 | 3 | 135 | 3 |

4 | 650 | 0 | 685 | 1 | 575 | 2 | 604 | 12 | 638 | 13 |

5 | 683 | 1 | 656 | 2 | 746 | 3 | 739 | 10 | 667 | 24 |

6 | 53 | 0 | 72 | 0 | 82 | 2 | 67 | 0 | 59 | 2 |

7 | 7555 | 12 | 7593 | 7 | 7568 | 21 | 7407 | 85 | 7380 | 176 |

Total | 9900 | 125 | 9900 | 119 | 9900 | 166 | 9900 | 389 | 9900 | 551 |

As hypothesized, the noise models learned from homozygous deletion data made very accurate predictions. In many cases, most actual edges were already predicted correctly in batch 1. Also, if an actual edge is not predicted in batch 1, it is also likely missed by subsequent batches. For instance, for the 173 actual edges in the Yeast3-size50 network, 100 are detected in batch 1, and among the remaining 73, only 21 are detected in batches 2 to 6.

While the above results suggest the importance of the noise models learned from homozygous data, it is still not clear whether these models are indeed more effective than the other models. It could still be the case that other models could also make the same predictions made in batch 1, just that as these predictions had already been covered in batch 1 that subsequent batches were not allowed to make the same predictions again. To verify if this was the case, we swapped the order of the first two batches for the size 10 networks, so that the first batch is composed of predictions made by the differential equation models and the second batch is composed of predictions made by the noise model learned from homozygous deletion data and not covered by the first batch. The results are shown in

Ecoli1 | Ecoli2 | Yeast1 | Yeast2 | Yeast3 | ||||||

Batch | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct | Predicted | Correct |

1 | 6 | 1 | 5 | 1 | 5 | 0 | 5 | 1 | 5 | 4 |

2 | 11 | 7 | 15 | 12 | 11 | 9 | 13 | 9 | 12 | 8 |

Comparing

This analysis reveals two interesting observations. First, as the noise models learned from deletion data gave higher accuracy than the differential equation models, our decision to use the former to make the first batch predictions is justified. Second, while the differential equation models had a lower accuracy, they had some small contributions to the prediction accuracy as they made some unique correct predictions that were missed by the noise models. As discussed, these are probably indirect or more complex regulation events.

To evaluate quantitatively the importance of the differential equation models, we use hypergeometric distribution to compute the probability of having at least the observed number of correctly predicted regulation events in batches 2–6 by chance, given the total number of predictions in these batches. For example, for the Ecoli1-Size10 network, we compute the probability of having 3 correct predictions (in batches 2–6) out of the 4 missed by batch 1, when making 16 predictions out of 89 node pairs (see

Ecoli1 | Ecoli2 | Yeast1 | Yeast2 | Yeast3 | |

Size-10 | 0.0247 | 0.1922 | 1 | 0.1925 | 0.0923 |

Size-50 | 0.4015 | 0.3036 | 0.0003 | 0.0273 | 0.0078 |

Size-100 | 0.0012 | 0.1670 | 0.0000 | 0.0000 | 0.0001 |

Overall, in about half of the cases, the predictions made in batches 2–6 are significantly better than random at the 0.05 level. We observe that for networks with a large portion of real edges missed by batch 1 (such as Yeast3-Size100), the predictions of batches 2–6 are more significant. Our results thus suggest that the two types of models, based on two different types of data, are potentially capable of complementing each other and make some orthogonal contributions to the overall predictions.

(a) The actual Ecoli1-size10 netowrk. (b) The homozygous deletion profile of G7 in the Ecoli1-size10 network. (c) A perturbation time series of G7 in the Ecoli1-size10 network. (d) The actual Ecoli2-size10 network. (e) A perturbation time series of G6 in the Ecoli2-size10 network. (f) A perturbation time series of G6 in the Ecoli2-size10 network.

The second example is related to the Ecoli2-size10 network (

We have also briefly studied if the differential equation models can be improved by considering pairs of potential regulators instead of one single regulator at a time. For the five size-10 networks, we use the same algorithm as before, except that in batches 2–6 each model involves two potential regulators. The resulting AUC values for Ecoli1, Ecoli2, Yeast1, Yeast2 and Yeast3 are 0.887, 0.913, 0.943, 0.697 and 0.655 respectively. Comparing these numbers with those in

Our prediction results demonstrate the advantage of combining multiple types of data. While the perturbation data allow the learning of differential equation models that could capture complex interactions in the regulatory network, deletion data also facilitate the detection of some simple interactions using only very basic noise models. As technological advancements are made rapidly, new data types are expected to come out from time to time. For method developers who try to improve existing prediction methods, besides deriving more advanced algorithms using the same data, it is also rewarding to investigate what kinds of information emerging data could provide, and how such information can be extracted to supplement existing methods.

As mentioned earlier, in this study we did not attempt to address the issue of indirect regulation. Indeed we observed that indirect regulation is one of the factors that confounded our method and caused it to make some wrong predictions. We expect that in a complete network with thousands of nodes, long regulation chains are prevalent and the problem of indirect regulation would be more serious. It is therefore interesting to see if filtering indirect regulation (for example by some existing techniques

In some previous work, more sophisticated noise models allowing for gene-specific and experiment-specific errors are proposed, with the aid of extra control experiments

In this study, we adopt an unsupervised learning setting, in compliance with the setup of the DREAM3 challenge. For organisms with some known regulation edges as domain knowledge, they can be used as training examples to train a supervised learner, or be used to transform the existing method into a semi-supervised one

One issue that we have not touched on is the computational cost. Using a high-end cluster, our predictions for networks of size 10, 50 and 100 took about 2 minutes, 13 hours, and 78 hours, respectively. While there is room for optimizing our code, fitting the differential equation models intrinsically requires a lot of computational power. Given that most correct predictions are made by the noise models, which only took a tiny portion of the computational time, when working on complete networks it is possible to tradeoff some accuracy for much shorter running time. Alternatively, since a lot of the models are learned independently of each other, it is fairly straightforward to parallelize the computation and reduce the total running time by adding in extra machines.

We thank the DREAM3 organizer for creating the interesting challenge, and Daniel Marbach, Gustavo Stolovitzky, Jiri Vohradsky, and our two anonymous reviewers for their useful comments. We acknowledge the Yale University Biomedical High Performance Computing Center.