^{1}

^{2}

^{3}

^{3}

^{1}

^{1}

^{3}

^{3}

^{1}

^{*}

Conceived and designed the experiments: XZ DH. Performed the experiments: XZ WC SH DH. Analyzed the data: XZ WC JL DH. Contributed reagents/materials/analysis tools: CK. Wrote the paper: XZ WC WW DH.

The authors have read the journal's policy and have the following conflicts: David Heckerman, Jennifer Listgarten, and Carl Kadie are with Microsoft Research. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology. The challenges in this work are 1) the lack of complete knowledge of the regulatory relationship between the regulators and the associated genes, 2) the potential for spurious associations due to confounding factors, and 3) the number of parameters to learn is usually larger than the number of available microarray experiments. We present a sparse (L1 regularized) graphical model to address these challenges. Our model incorporates known transcription factors and introduces hidden variables to represent possible unknown transcription and confounding factors. The expression level of a gene is modeled as a linear combination of the expression levels of known transcription factors and hidden factors. Using gene expression data covering 39,296 oligonucleotide probes from 1109 human liver samples, we demonstrate that our model better predicts out-of-sample data than a model with no hidden variables. We also show that some of the gene sets associated with hidden variables are strongly correlated with Gene Ontology categories. The software including source code is available at

Transcriptional regulatory networks govern the expression levels of thousands of genes as part of a diverse biological processes. Regulatory proteins called transcription factors (TF) are the main players in the regulatory network. TFs bind to promoter regions at the start of other genes and thereby initiate or inhibit gene expression. Determining accurate models for transcriptional regulatory interactions is an important challenge in computational biology. With the development of high-throughput DNA microarray technologies, it is possible to simultaneously monitor the expression levels of essentially all genes. Extensive research has been done to build quantitative regulatory models by associating gene expression levels (see

One challenge in this work is that not all TFs have been identified and the regulatory relationship between TFs and their associated genes may not be available (except for some well studied model organisms such as yeast). Another challenge is the potential for spurious associations between regulators and affected genes due to confounding factors such as expression heterogeneity

Various methods have been proposed to learn the regulatory relationship between TFs and their associated genes

In this paper, we propose a linear-Gaussian graphical model to address the challenges in learning regulatory relationships. Our model consists of two layers of nodes as shown in

Known and potential TFs are assumed to be mutually independent. Regulated genes are assumed to be mutually independent given the TFs.

To learn the parameters of the model from data, which is usually of high dimension and low sample size, we use L1 regularization as is done in

We apply our model to large scale human gene expression data and show that our model has better prediction accuracy than do other alternatives. We examine each gene set defined by those in the lower layer connected to a single hidden variable in the upper layer. We find that some of these gene sets are strongly correlated with GO categories, suggesting that the hidden variables at least in part represent unknown TFs. The software including source code is publically available at

Our model can be thought of as a combination of linear regression and probabilistic principal component analysis (PPCA)

Throughout the paper, we assume that all vectors are column vectors. Let

The idea of PPCA is similar to that of linear regression. The difference is that the expression level of a gene

To incorporate both known/putative TFs and unknown factors, our model combines linear regression and PPCA. We model the expression level of a gene to be a linear function of the expression levels of both known/putative TFs and hidden factors.

Next, we use multivariate notation to formalize and derive the likelihood function of our model. Let

The parameter space in our model is

To optimize the likelihood function with L1 norm, we use the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) algorithm described in

Besides the loss function, and the penalized parameters, the OWL-QN algorithm also needs the gradient of the loss function, which (without detailed derivation) is

We note that sparse PCA is not convex

The gene expression data is taken from 1109 human liver samples. Each RNA sample was profiled on a custom Agilent 44,000 feature microarray composed of 39,296 oligonucleotide probes targeting transcripts representing 34,266 known and predicted genes, including high-confidence, non-coding RNA sequences. The gene expression data was originally collected to characterize the genetic architecture of gene expression in human liver

We evaluated three models: (1) one with hidden variables, (2) one with no hidden variables, and (3) a reference model that assumes the non-TF genes are mutually independent (i.e., a model with no top layer in the corresponding graph). We evaluated the models by measuring out-of-sample log likelihoods via ten-fold cross validation. More specifically, we partition the samples into 10 subsets of equal size. In each fold, we use samples in 9 subsets as training data and test the learned model in the remaining 1 subset of samples. By measuring out-of-sample versus in-sample predictions, we avoid rewarding models that over fit the data. Within each cross, optimal values for

Hidden variables can model the effect of unknown regulators or hidden confounders. To better understand the effect of the hidden variables, we look for correlations between genes associated with a given hidden variable and sets of genes in GO categories (Biological Process Ontology)

In

Gene Set Size | Raw p-value | Adjusted p-value | FDR | GO Categories |

19649 | 1.17×10^{−15} |
0 | 0 | cellular protein metabolic process |

19431 | 2.31×10^{−13} |
0 | 0 | protein metabolic process |

22301 | 1.71×10^{−10} |
0 | 0 | transport |

23608 | 2.53×10^{−9} |
0 | 0 | transport |

20500 | 9.47×10^{−9} |
0 | 0 | cellular protein metabolic process |

26332 | 1.55×10^{−8} |
0 | 0 | transport |

21264 | 2.20×10^{−5} |
0.001 | 0.003 | response to chemical stimulus |

19395 | 1.87×10^{−5} |
0.004 | 0.01 | organic acid metabolic process |

21098 | 1.51×10^{−4} |
0.01 | 0.022 | organic acid metabolic process |

29240 | 2.03×10^{−3} |
0.026 | 0.052 | synaptic transmission |

20199 | 3.76×10^{−4} |
0.03 | 0.054 | positive regulation of phosphate metabolic process |

24175 | 1.04×10^{−3} |
0.048 | 0.08 | phosphoinositide mediated signaling |

17480 | 6.73×10^{−4} |
0.064 | 0.1 | cation homeostasis |

20331 | 9.45×10^{−4} |
0.07 | 0.1 | digestion |

22477 | 1.29×10^{−3} |
0.075 | 0.1 | locomotory behavior |

22644 | 2.74×10^{−3} |
0.204 | 0.255 | organic acid transport |

18732 | 4.00×10^{−3} |
0.393 | 0.462 | positive regulation of t_cell proliferation |

16294 | 7.86×10^{−3} |
0.707 | 0.786 | inorganic anion transport |

The first column of

Among the gene sets associated with known/putative regulators, there are 803 gene sets with size greater or equal to 5. The maximum size is 8820.

Our method aims to learn the transcriptional regulatory relationship without any prior knowledge of the network topology. As discussed in the Introduction section, various methods have been proposed to learn transcription factor activity assuming that the regulatory network topology is known

The NCA algorithm needs three criteria to ensure the decomposition to be unique

To apply NCA to infer the regulatory structure, we use a random matrix as the input matrix

We apply GO enrichment analysis on the gene sets learned by NCA and our method.

Method | Average raw p-value | Number of gene sets with calibrated p-values |

NCA | 0.024 | 5 |

Our model | 0.007 | 219 |

Reconstructing gene transcriptional regulatory networks is a central problem in computational systems biology. Challenging issues include the incorporation of knowledge about TFs and modeling unknown TFs and confounders. We have developed a probabilistic graphical model that includes the known TFs as observed variables, uses hidden variables to model unknown TFs and confounders, and uses L1 regularization to address the high dimensionality and relatively low sample size of the data. Using human gene expression data, we have shown that the proposed model predicts significantly better than does the model without hidden variables. In addition, we have found that some of gene sets corresponding to hidden variables have significant correlations with GO categories, suggesting that the hidden variables at least in part represent unknown TFs.