The authors have declared that no competing interests exist.

Mapping gene expression as a quantitative trait using whole genome-sequencing and transcriptome analysis allows to discover the functional consequences of genetic variation. We developed a novel method and ultra-fast software Findr for higly accurate causal inference between gene expression traits using cis-regulatory DNA variations as causal anchors, which improves current methods by taking into consideration hidden confounders and weak regulations. Findr outperformed existing methods on the DREAM5 Systems Genetics challenge and on the prediction of microRNA and transcription factor targets in human lymphoblastoid cells, while being nearly a million times faster. Findr is publicly available at

Understanding how genetic variation between individuals determines variation in observable traits or disease risk is one of the core aims of genetics. It is known that genetic variation often affects gene regulatory DNA elements and directly causes variation in expression of nearby genes. This effect in turn cascades down to other genes via the complex pathways and gene interaction networks that ultimately govern how cells operate in an ever changing environment. In theory, when genetic variation and gene expression levels are measured simultaneously in a large number of individuals, the causal effects of genes on each other can be inferred using statistical models similar to those used in randomized controlled trials. We developed a novel method and ultra-fast software Findr which, unlike existing methods, takes into account the complex but unknown network context when predicting causality between specific gene pairs. Findr’s predictions have a significantly higher overlap with known gene networks compared to existing methods, using both simulated and real data. Findr is also nearly a million times faster, and hence the only software in its class that can handle modern datasets where the expression levels of ten-thousands of genes are simultaneously measured in hundreds to thousands of individuals.

Genetic variation in non-coding genomic regions, including at loci associated with complex traits and diseases identified by genome-wide association studies (GWAS), predominantly plays a gene-regulatory role [

It is believed that genetic variation can be used to infer the causal directions of regulation between coexpressed genes, based on the principle that genetic variation causes variation in nearby gene expression and acts as a causal anchor for identifying downstream genes [

To investigate and address these issues, we developed Findr (Fast Inference of Networks from Directed Regulations), an ultra-fast software package that incorporates existing and novel statistical causal inference tests. The novel tests were designed to take into account the presence of unknown confounding effects, and were evaluated systematically against multiple existing methods using both simulated and real data.

Findr performs six likelihood ratio tests involving pairs of genes (or exons or transcripts) _{i}, _{i} ≤ 1,

Findr’s computational speed allowed us to systematically evaluate traditional causal inference methods for the first time. We obtained five datasets with 999 samples simulated from synthetic gene regulatory networks of 1,000 genes with known genetic architecture from the DREAM5 Systems Genetics Challenge, and subsampled each dataset to observe how performance depends on sample size (_{0}) does not incorporate genotype information and was used as a benchmark for performance evaluations in terms of areas under the receiver operating characteristic (AUROC) and precision-recall (AUPR) curves (_{2}) and independence (_{3}) tests sequentially (_{2} and _{2} _{3} separately against the correlation test. Both the secondary test alone and the traditional causal inference test combination were found to

(_{2}, _{2} _{3}) and newly proposed (_{4}, _{2} _{5}, _{0}). Every marker corresponds to the average AUROC or AUPR at specific sample sizes. Random subsampling at every sample size was performed 100 times. Half widths of the lines and shades are the standard errors and standard deviations respectively. _{i} corresponds to test _{T}), and correlation (Findr-_{0}) tests, for CIT and for the best scores on the DREAM challenge leaderboad. For individual results on all 15 datasets, see

We believe that the failure of the traditional causal inference test is due to an elevated false negative rate (FNR) coming from two sources. First, the secondary test is less powerful in identifying weak interactions than the correlation test. In a true regulation

To further support this claim, we examined the inference precision among the top predictions from the traditional test, separately for gene pairs directly unconfounded or confounded by at least one gene (

To overcome the limitations of traditional causal inference methods, Findr incorporates two additional tests (_{4}) verifies that _{5}) ensures that the correlation between _{4} performed best in terms of AUROC, and AUPR with small sample sizes, whilst the combination _{2} _{5} achieved highest AUPR for larger sample sizes (_{0}), particularly for AUPR. This demonstrates conclusively in a comparative setting that the inclusion of genotype data indeed can improve regulatory network inference. These observations are consistent across all five DREAM datasets (

We combined the advantages of _{4} and _{2} _{5} by averaging them in a composite test (_{4} and _{2} _{5} at all sample sizes (_{T}) (_{T}), correlation test (_{0}), CIT, and every participating method of the DREAM5 Systems Genetics Challenge (

Specifically, Findr’s new test was able to address the inflated FNR of the traditional method due to confounded interactions. It performed almost equally well on confounded and unconfounded gene pairs and, compared to the traditional test, significantly fewer real interactions fell outside the top 1% of predictions (55% vs. 92% for confounded and 45% vs. 86% for unconfounded interactions,

The traditional causal inference method based on the conditional indepedence test results in false negatives for confounded interactions, whose effect was shown significant for the simulated DREAM datasets. However, the traditional test surprisingly reported more confounded gene pairs than the new test in its top predictions (albeit with lower precision), and correspondingly fewer unconfounded gene pairs (

We hypothesized that this inconsistency originated from yet another source of false negatives, where measurement error can confuse the conditional independence test. Measurement error in an upstream variable (called ^{(t)} → ^{(t)} → ^{(t)}, remains unknown. When the measurement error (in ^{(t)} → ^{(t)} cannot remove all the correlation between

We verified our hypothesis with a simple simulation (^{(t)}, the conditional independence test reported false negatives (likeilihood ratio p-value ≪1, i.e. rejecting the null hypothesis of conditional indepencence, cf.

(^{(t)} → ^{(t)} → ^{(t)}’s variance coming from ^{(t)}’s variance from other sources and ^{−2} (left, bottom) to 10^{2} (right, top) were taken for ^{−6} were discarded. Contour lines are for the log-average of smoothened tile values. Note that for the conditional independence test (

This observation goes beyond the well-known problems that arise from a large measurement error in all variables, which acts like a hidden confounder [^{(t)} than

In order to evaluate Findr on a real dataset, we performed causal inference on miRNA and mRNA sequencing data in lymphoblastoid cell lines from 360 European individuals in the Geuvadis study [_{T}) causal inference tests and the correlation test (_{0}) (

Shown are the AUROC (

We considered 3,172 genes with significant cis-eQTLs in the Geuvadis data [_{T}), new (_{0}) tests, and CIT. Groundtruths of experimentally confirmed causal gene interactions in human, and mammalian systems more generally, are of limited availability and mainly concern transcription or transcription-associated DNA-binding factors (TFs). Here we focused on a set of 25 TFs in the set of eQTL-genes for which either differential expression data following siRNA silencing (6 TFs) or TF-binding data inferred from ChIP-sequencing and/or DNase footprinting (20 TFs) in a lymphoblastoid cell line (GM12878) was available [_{T}. On the other hand, the correlation test significantly overestimated precisions because it is unable to distinguish causal, reversed causal or confounded interactions, which raises its FDR. The same results were observed when alternative groundtruth ChIP-sequencing networks were considered (

The precision (i.e. 1-FDR) of TF target predictions is shown at probability cutoffs 0.1 to 0.9 (blue to yellow) with respect to known functional targets from siRNA silencing of 6 TFs (

We used the following datasets/databases for evaluating causal inference methods:

Simulated genotype and transcriptome data of synthetic gene regulatory networks from the DREAM5 Systems Genetics challenge A (DREAM for short), generated by the SysGenSIM software [

Genotype and transcriptome sequencing data on 465 human lymphoblastoid cell line samples from the Geuvadis project [

Genotype data (ArrayExpress accession E-GEUV-1).

Gene quantification data for 23722 genes from nonredundant unique samples and after quality control and normalization (ArrayExpress accession E-GEUV-1).

Quantification data of miRNA, with the same standard as gene quantification data (ArrayExpress accession E-GEUV-2).

Best eQTLs of mRNAs and miRNAs (ArrayExpress accessions E-GEUV-1 and E-GEUV-2).

We restricted our analysis to 360 European samples which are shared by gene and miRNA quantifications. Excluding invalid eQTLs from the Geuvadis analysis, such as single-valued genotypes, 55 miRNA-eQTL pairs and 3172 gene-eQTL pairs were retained.

For validation of predicted miRNA-gene interactions, we extracted the “strong” ground-truth table from miRLAB [

For verification of predicted gene-gene interactions, we obtained differential expression data following siRNA silencing of 59 transcription-associated factors (TFs) and DNA-binding data of 201 TFs for 8872 genes in a reference lymphoblastoid cell line (GM12878) from [

Consider a set of observations sampled from a mixture distribution of a null and an alternative hypothesis. For instance in gene regulation, every observation can correspond to expression levels of a pair of genes wich are sampled from a bivariate normal distribution with zero (null hypothesis) or non-zero (alternative hypothesis) correlation coefficient. In Findr, we predict the probability that any sample follows the alternative hypothesis with the following algorithm (based on and modified from [

For robustness against outliers, we convert every continuous variable into standard normally distributed

We propose a null and an alternative hypothesis for every likelihood ratio test (LRT) of interest where, by definition, the null hypothesis space is a subset of the alternative hypothesis. Model parameters are replaced with their maximum likelihood estimators (MLEs) to obtain the log likelihood ratio (LLR) between the alternative and null hypotheses.

We derive the analytical expression for the probablity density function (PDF) of the LLR when samples follow the null hypothesis.

We convert LLRs into posterior probabilities of the hypothesis of interest with the empirical estimation of local FDR.

Implementational details can be found in Findr’s source code.

Consider correlated genes _{i} and _{i} for the expression levels of gene _{i}, where _{a} alleles, so _{i} ∈ {0, …, _{a}}. We define the null and alternative hypotheses for a total of six tests, as shown in

Maximum likelihood estimators (MLE) for the model parameters _{A0}, and _{B0} are

_{j}, _{a}, and _{A}, as
_{j} is the sample count by genotype category,
_{xy} = 1 for _{a}), we only pick those that exist (_{j} > 0) throughout this article. Since the null hypothesis is simply that _{i} is sampled from a genotype-independent normal distribution, with MLEs of mean zero and standard deviation one due to supernormalization, the LLR for test 1 becomes
^{(1)}, we select

_{j}, _{j}, _{A}, _{B}, _{j}, _{A}, _{B} of

The null distribution of LLR,

_{i} with zero mean and unit variance as:
_{1}, _{2}, _{3} are independent, and
_{1}, _{2}, _{3} gives

We define distribution _{1}/2, _{2}/2), i.e.
_{1}, _{2}) = 0. Here

_{i} with zero mean and unit variance.

The expression of LLR^{(1)} then becomes:
_{j} > 0 for _{a}. Transform _{i}} are pairwise independent, and
_{i}} by defining independent random variables
_{v} ≡ ∑_{j∈{j∣nj>0}} 1 as the number of different genotype values across all samples. Then

^{(2)} follows the same null distribution as

_{i} according to ^{(3)} can be obtained with similar but more complex computations from

^{(4)} can be obtained similarly by randomizing _{i} according to Eqs

_{i} according to

We verified our analytical method of deriving null distributions by comparing the analytical null distribution v.s. null distribution from permutation for the relevance test.

After obtaining the PDFs for the LLRs from real data and the null hypotheses, we can convert LLR values into posterior probabilities

To be precise, consider an arbitrary likelihood ratio test. The fundamental assumption is that in the limit LLR → 0^{+}, all test cases come from the null hypothesis (^{+} side. This provides all the prerequisites to perform Bayesian inference and obtain any _{i} from

In practice, PDFs are approximated with histograms. This requires proper choices of histogram bin widths, ^{3} to 10^{4} candidate targets (“

Lastly, in a typical application of Findr, inputs of (_{1} = 1 for all considered pairs, and skip the primary test.

Based on the six tests in

The correlation test is introduced as a benchmark, against which we can compare other methods involving genotype information. Pairwise correlation is a simple measure for the probability of two genes being functionally related either through direct or indirect regulation, or through coregulation by a third factor. Bayesian inference additionally considers different gene roles. Its predicted posterior probability for regulation is _{0}.

The traditional causal inference test, as explained in [_{2} and _{2} _{3} separately, in order to assess the individual effects of secondary and independence tests. As discussed above, we expect a set of significant eQTLs and their associated genes as input, and therefore _{1} = 1 is assured and not calculated in this paper or the package Findr. Note that _{T} is the estimated local precision, i.e. the probability that tests 2 and 3 are both true. Correspondinly, its local FDR (the probability that one of them is false) is 1 − _{T}.

The novel test, aimed specifically at addressing the failures of the traditional causal inference test, combines the tests differently:
_{2} _{5} (with _{1} = 1) verifies the hypothesis that

On the other hand, the relevance test in the second term of _{2} close to 0). This term still grants higher-than-null significance to weak interactions, and verifies that _{2} = 0 but _{4} ≠ 0, the novel test

The composite design of the novel test aims not to miss any genuine regulation whilst distinguishing the full spectrum of possible interactions. When the signal level is too weak for tests 2 and 5, we expect _{4} to still provide distinguishing power better than random predictions. When the interaction is strong, _{2} _{5} is then able to pick up true targets regardless of the existence of hidden confounders.

Given the predicted posterior probabilities for every pair (

In order to assess the effect of sample size on the performances of inference methods, we performed subsampling evaluations. This is made practically possible by the DREAM datasets which contain 999 samples with sufficient variance, as well as the computational efficiency from Findr which makes subsampling computationally feasible. With a given dataset and ground-truth table, the total number of samples

Randomly select

Infer regulations only based on the selected samples.

Compute and record the evaluation metrics of interest (e.g. AUROC and AUPR) with the inference results and ground-truths.

Evaluation metrics are recorded in every loop, and their means, standard deviations, and standard errors over the

In order to demonstrate the inferential precision among top predictions for any inference test (here the traditional and novel tests separately), we first ranked all (ordered) gene pairs (

We investigated how each statistical test tolerates measurement errors with simulations in a controlled setting. We modelled the causal relation ^{(t)} → ^{(t)} → ^{(t)} is the true expression level of gene

For simplicity, we only considered monoallelic species. Therefore the genotype ^{(t)}, ^{(t)} → ^{(t)} → _{A1}, _{A2}, and _{B} are parameters of the model. Note that _{i}} respectively.

Given the five parameters of the model (the number of samples, the minor allele frequency, _{A1}, _{A2}, and _{B}), we could simulate the observed data for

We then chose different configurations on the number of samples, the minor allele frequency, and _{B}. For each configuration, we varied _{A1} and _{A2} in a wide range to obtain a 2-dimensional heatmap plot for the p-value of each test, thereby exploring how each test was affected by measurement errors of different strengths. Only tiles with a significant

We developed a highly efficient, scalable software package Findr (Fast Inference of Networks from Directed Regulations) implementing novel and existing causal inference tests. Application of Findr on real and simulated genome and transcriptome variation data showed that our novel tests, which account for weak secondary linkage and hidden confounders at the potential cost of an increased number of false positives, resulted in a significantly improved performance to predict known gene regulatory interactions compared to existing methods, particularly traditional methods based on conditional independence tests, which had highly elevated false negative rates.

Causal inference using eQTLs as causal anchors relies on crucial assumptions which have been discussed in-depth elsewhere [

Although the newly proposed test avoids the elevated FNR from the conditional independence test, it is not without its own limitations. Unlike the conditional independence test, the relevance and controlled tests (

In this paper we have addressed the challenge of pairwise causal inference, but to reconstruct the actual pathways and networks that affect a phenotypic trait, two important limitations have to be considered. First, linear pathways propagate causality, and may thus appear as densely connected sets of triangles in pairwise causal networks. Secondly, most genes are regulated by multiple upstream factors, and hence some true edges may only have a small posterior probability unless they are considered in an appropriate multivariate context. The most straightforward way to address these issues would be to model the real directed interaction network as a Bayesian network with sparsity constraints. A major advantage of Findr is that it outputs probability values which can be directly incorporated as prior edge probabilities in existing network inference softwares.

In conclusion, Findr is a highly efficient and accurate open source software tool for causal inference from large-scale genome-transcriptome variation data. Its nonparametric nature ensures robust performances across datasets without parameter tuning, with easily interpretable output in the form of accurate precision and FDR estimates. Findr is able to predict causal interactions in the context of complex regulatory networks where unknown upstream regulators confound traditional conditional independence tests, and more generically in any context with discrete or continuous causal anchors.

(PDF)

Real, analytical null, and permuted null distributions are demonstrated in the figure, together with the curve of inferred posterior probability of alternative hypothesis. Permutations were randomly conducted on all potential target genes for 100 times. The alignment between analytical and permuted null distributions and the consistent incremental trend of posterior probability verify our method in deriving analytical null distributions.

(PDF)

Every marker corresponds to the average AUROC or AUPR at specific sample sizes. At every sample size we performed 100 subsampling. Half widths of the lines and shades are the standard errors and standard deviations respectively, of AUROC or AUPR. Figures from top to bottom correspond to datasets 1, 2, 3, 5. For dataset 4, see

(PDF)

Every marker corresponds to the AUROC or AUPR of one dataset. CIT is an R package that includes the conditional independence test, along with tests 2 and 5, while also comparing

(PDF)

When

(PDF)

(PDF)

The real precision was computed according to the groundtruth, whilst the estimated precision was obtained from the estimated FDR from the respective inference method (precision = 1 − FDR). Only genes with cis-eQTLs were considered as primary targets in prediction and validation. Both the novel (

(PDF)

(^{(t)} →

(PDF)

The solid black lines correspond to expected performances from random predictions. A higher curve indicates better prediction performance.

(PDF)

AUROC and AUPR metrics are measured for three inference tasks. MiRNA compares miRNA target predictions based on Geuvadis miRNA and mRNA expression levels against groundtruths from miRLAB. SiRNA and TF-binding compares gene-gene interaction predictions based on Geuvadis gene expression levels against groundtruths from siRNA silencing and TF-binding measurements respectively. ENCODE compares the same gene-gene interaction predictions against TF-binding networks derived from ENCODE data. Dashed lines indicate expected performances from random predictions.

(PDF)

The number above each bar indicates the number of positive predictions at the corresponding threshold. The dashed line is precision from random predictions.

(PDF)

All cis- and trans-genes are included. DREAM challenge constrained the maximum number of submitted regulations by 100,000, which were also applied in our evaluation. Findr’s new test consistently obtained higher AUROC and AUPR than all other methods, including the leaders of DREAM challenge.

(PDF)

Higher AUROC and AUPR values signify stronger predictive power. Program running times have units in seconds (s), minutes (m), hours (h), or days (d). Findr outperformed other methods in statistical power and speed, with or without genotype information.

(PDF)

The three gold standards could not agree on method.

(PDF)