^{1}

^{2}

^{3}

^{*}

^{1}

^{2}

^{1}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AR FJ. Performed the experiments: MG. Analyzed the data: MG. Contributed reagents/materials/analysis tools: MG AR FJ. Wrote the paper: MG AR FJ.

Gene network inference from transcriptomic data is an important methodological challenge and a key aspect of systems biology. Although several methods have been proposed to infer networks from microarray data, there is a need for inference methods able to model RNA-seq data, which are count-based and highly variable. In this work we propose a hierarchical Poisson log-normal model with a Lasso penalty to infer gene networks from RNA-seq data; this model has the advantage of directly modelling discrete data and accounting for inter-sample variance larger than the sample mean. Using real microRNA-seq data from breast cancer tumors and simulations, we compare this method to a regularized Gaussian graphical model on log-transformed data, and a Poisson log-linear graphical model with a Lasso penalty on power-transformed data. For data simulated with large inter-sample dispersion, the proposed model performs better than the other methods in terms of sensitivity, specificity and area under the ROC curve. These results show the necessity of methods specifically designed for gene network inference from RNA-seq data.

In recent years, high-throughput sequencing technology has become an essential tool for genomic studies. In particular, it allows the transcriptome to be directly sequenced (RNA sequencing), which provides count-based measures of gene expression. Typically, the first biological question arising from these data is to identify genes differently expressed across biological conditions. Because RNA-seq data are known to exhibit a large amount of variability among biological replicates, most methods for differential analysis are based either on overdispersed Poisson

In order to study the relationships between these large numbers of genes, several authors have worked on co-expression networks and used methods based on Pearson correlation

The simplest idea is to perform an appropriate transformation of the data, using for example a Box-Cox transformation

We first define the notation that will be used throughout this paper. Let _{ij}_{ij}_{ij}

The underlying assumption of this model is that the data are normally distributed. In the case of untransformed RNA-seq data, this assumption is not valid since data counts cannot take negative values. We investigated a variety of Box-Cox transformations to lead to approximately normal data

Since gene expression data may contain zero counts, we usually use (

Let

A common assumption in the context of gene networks is that the matrix

Network inference using a Gaussian graphical model has been extensively studied and used over the past years. Many methods exist to compute the penalized maximum likelihood estimate of the

The choice of the regularization parameter

Note that a single parameter

A log-linear Poisson graphical model specifically designed for network inference from count data has been recently proposed

Let _{ij}

with

The notation

Similar to the previous model, we assume that the vector

We note that the Poisson model presented above requires a transformation of the data to account for the high dispersion. Here we propose to deal with it directly with a hierarchical log-normal Poisson model. The count expression of gene

As before, the notation

In this model, the likelihood for gene

Similar to the previous model, we assume that the vector

Estimation of parameters

An important aspect of this method is the choice of the regularization parameter

In order to simulate multivariate Poisson data, we use a method described by Karlis

RNA-seq data are known to be overdispersed relative to a Poisson distribution with the sample variance of a gene expression vector larger than the sample mean. In our simulation study, we also consider the possibility of inflating the variance of the independent Poisson random variables used in the

The three methods were compared on two sets of simulations: multivariate Poisson data and overdispersed multivariate Poisson. For each type of data, we simulated 50 different adjacency matrices

To evaluate the different methods, we tried to infer the adjacency matrix

ROC curves, averaged over the 50 simulated datasets, are presented in

Results are presented for the Gaussian graphical model on log-transformed data (blue), the log-linear Poisson graphical model on power-transformed data (red) and the hierarchical log-normal Poisson model on raw data (black) on multivariate Poisson data (A) and multivariate Poisson data with inflated variance (B). The dotted black lines represent the diagonals.

Sensitivity and specificity obtained by each method for the chosen regularization parameters are represented in diamond-shape squares on the ROC curves (

GGM | Log-linear Poisson | Hierarchical model | ||

Multivariate Poisson Data | Sens. | 0.568 (0.069) | 0.714 (0.036) | 0.838 (0.050) |

Spec. | 0.984 (0.003) | 0.990 (0.003) | 0.967 (0.006) | |

Over-dispersed Poisson Data | Sens. | 0.107 (0.045) | 0.046 (0.033) | 0.383 (0.064) |

Spec. | 0.965 (0.003) | 0.991 (0.004) | 0.982 (0.027) |

Results are presented for the log-linear Poisson graphical model without over-dispersion (A) and with over-dispersion (B), for the proposed hierarchical log-normal Poisson graphical model without over-dispersion (C) and with over-dispersion (D). Black dotted lines represent the diagonal, and red lines represent loess curves.

To ensure that these results do not depend on the scale-free structure of the graphs, we have drawn ROC curves and performed similar model selection on data simulated with an Erdös-Rényi structure

Results are presented for the Gaussian graphical model on log-transformed data (blue), the log-linear Poisson graphical model on power-transformed data (red) and the hierarchical log-normal Poisson model on raw data (black) on multivariate Poisson data (A Erdös-Rényi) and multivariate Poisson data with inflated variance (B Erdös-Rényi). The dotted black lines represent the diagonals.

GGM | Log-linear Poisson | Hierarchical model | ||

Multivariate Poisson Data | Sens. | 0.571 (0.059) | 0.691 (0.061) | 0.763 (0.093) |

Spec. | 0.992 (0.003) | 0.990 (0.003) | 0.975 (0.005) | |

Over-dispersed Poisson Data | Sens. | 0.112 (0.065) | 0.050 (0.041) | 0.198 (0.060) |

Spec. | 0.971 (0.003) | 0.990 (0.003) | 0.958 (0.009) |

The three methods were applied to a publicly available microRNA-seq data set available at The Cancer Genome Atlas (TCGA) Data Portal (

Shapiro-Wilk tests on miRNA expression vectors showed that the data, even for highly expressed miRNAs, could not be directly modelled as a normal distribution

Curve obtained with the R package MASS.

For these data, the Poisson assumption is not verified either, as shown in

The Gaussian graphical model with the BIC criterion detected 48 edges, the log-linear Poisson graphical model with the StARS criterion

The representation was obtained using the software Gephi

miRNA | reference |

hsa-mir-451 | BC |

hsa-let-7b | BC |

hsa-mir-486 | BC |

hsa-let-7f-2 | cancer |

hsa-mir-150 | no reference |

hsa-mir-145 | BC |

hsa-mir-24-2 | BC |

hsa-mir-200c | BC |

hsa-mir-143 | BC |

hsa-mir-142 | no reference |

Network inference from RNA-seq data is an important methodological challenge. This work is a pioneer study to provide some guidelines on the best methods to achieve this goal. There are two main approaches. The first and simplest idea is to perform a transformation of the data and apply previously proposed methods for microarray studies based on Gaussian graphical models, for example using a Box-Cox transformation. Another possibility is to apply methods specifically developed for the analysis of count data using Poisson graphical models, either with a power transformation of the data or by accounting for over-dispersion directly in the model using for example a hierarchical log-normal Poisson graphical model as proposed here. We found in both simulation study and real data application that the power transformation did not work well to correct for over-dispersion. It has to be noted that the same

It has to be pointed out that in high-dimensional settings (number of genes much larger than the number of biological samples), all methods were unsurprisingly found to perform very poorly, despite the

We are grateful to Gilles Celeux and to the two anonymous reviewers for their useful comments on this work.