^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: WML PL. Performed the experiments: WML. Analyzed the data: WML PL. Contributed reagents/materials/analysis tools: WML. Wrote the paper: WML PL.

A central goal of RNA sequencing (RNA-seq) experiments is to detect differentially expressed genes. In the ubiquitous negative binomial model for RNA-seq data, each gene is given a dispersion parameter, and correctly estimating these dispersion parameters is vital to detecting differential expression. Since the dispersions control the variances of the gene counts, underestimation may lead to false discovery, while overestimation may lower the rate of true detection. After briefly reviewing several popular dispersion estimation methods, this article describes a simulation study that compares them in terms of point estimation and the effect on the performance of tests for differential expression. The methods that maximize the test performance are the ones that use a moderate degree of dispersion shrinkage: the DSS, Tagwise wqCML, and Tagwise APL. In practical RNA-seq data analysis, we recommend using one of these moderate-shrinkage methods with the QLShrink test in QuasiSeq R package.

In the last five years, groundbreaking new RNA sequencing (RNA-seq) technologies have considerably improved studies in genetics that previously relied on microarray technologies. RNA-seq technologies have several advantages over microarrays, including less noise, higher throughput, and the power to detect novel promoters, isoforms, allele-specific expression, and a wider range of expression levels. So it is not surprising that RNA-seq has become ubiquitous in experiments that investigate the regulation of gene expression across different conditions, such as levels of a treatment factor, genotypes, environmental conditions, and developmental stages.

In a typical RNA-seq experiment, reverse transcription and fragmentation convert each RNA sample into a library of complementary DNA (cDNA) fragments, or tags. Next, a sequencing platform, such as the Illumina Genome Analyzer, Applied Biosystems SOLiD, Pacific Biosciences SMRT, or Roche 454 Life Sciences, amplifies and sequences the tags. After sequencing, a subsequence within each tag, called a read, is recorded. After the resulting collection of reads, or library, is assembled, the reads are mapped to genes in the original organism’s genome. The number of reads in a library mapped to a gene represents the relative abundance of that gene in the library. The investigator typically assembles all the read counts of multiple libraries into a table with rows to indicate genes and columns to indicate libraries. Please consult references by Oshlack, Robinson, and Young

A central goal of RNA-seq experiments is to detect genes that are differentially expressed : i.e., ones for which the average number of reads differs significantly across treatment groups. Improving the detection of differentially expressed genes opens new ways to control organisms at the molecular level, advancing fields like agriculture engineering, personalized medicine, and the treatment of cancers, contributing to social welfare.

Some of the most popular new statistical methods that detect differentially expressed genes from RNA-seq data rely on the negative binomial (NB) probability distribution. If a random variable,

Cameron and Trivedi

In an RNA-seq dataset, the number of reads,

It is common practice to include in the model the library-wise normalization factors,

With the above preliminaries taken care of, we now turn to the main issue of this article: the estimation of the dispersion parameters,

First, we briefly review several popular dispersion estimation methods (implemented in freely-available R-language packages, AMAP.Seq, DSS, edgeR, and DESeq). We also touch on some popular tests for differential expression. Next, using a simulation study that draws fundamental information from real datasets, we compare the practical effectiveness of the methods in terms of the accuracy and precision of the point estimates and the effect on the performance of tests for differential expression. In the results and discussion sections, we discuss the distinguishing features of the most successful dispersion estimation methods.

Studies by Wu, Wang, and Wu

In RNA-seq data analysis, we could apply methods based on counts for each gene separately to estimate model parameters, such as the Quasi-Likelihood (QL) method reviewed below. However, in RNA-seq data, there are typically tens of thousands of genes, but only a few counts per gene with which to estimate gene-specific parameters, a typical example of a “large

The QL method estimates a dispersion parameter independently for each gene. This method

The MLE,

The tagwise dispersion estimate,

The QL method estimates tens of thousands of dispersion parameters, but uses only a few read counts to compute each estimate. More sophisticated techniques make use of a larger number of read counts to estimate each

As explained by Robinson and Smyth

The wqCML method is implemented in the R package, edgeR, designed by Robinson, McCarthy, and Smyth and available at bioconductor.org

The wqCML method only applies to completely randomized designs with two treatment groups. In the APL method, McCarthy, Chen, and Smyth

To estimate the

Like the wqCML and APL methods, the DESeq method by Anders and Huber

Anders and Huber

As with the APL method, there are three variations on the DESeq method giving different ways to shrink the dispersions. The no-shrinkage variation transforms the estimated raw variance parameters directly into the estimated gene-wise dispersions without any shrinkage. The “Trended” variation performs a regression of the raw variance parameter estimates on the estimated means, and then computes the estimated dispersions from the fitted values on the trend. (In the DESeq R package, the implementation of the DESeq method, the user can choose between local and parametric regression to compute this trend. However, the parametric regression in the package is prone to failure and leads to poor point estimation test performance in our simulation study. Hence, only the local regression results are presented in this article.) The “Maximum” variation computes the maximum of each raw variance parameter estimate and its fitted value on the trend and then computes the dispersion estimates from these maxima. This last method of dispersion shrinkage is conservative, allowing overestimation of the dispersions, but guarding against underestimation in order to avoid false positives in tests for differential expression. The DESeq method is implemented in the R package, DESeq, available at bioconductor.org.

The Bayesian paradigm more naturally accommodates the notions of “borrowing information” and “shrinkage” than the Frequentist paradigm. Hence, the DSS method by Wu, Wang, and Wu

Here, the gamma distribution is parameterized in terms of its mean,

With a model specified and all the parameters estimated, we can test for the differential expression of genes. The simulation study in the next section uses the following five recently-proposed testing methods. The first two tests, found in the edgeR and DESeq packages, extend Fisher’s exact test to data following negative binomial distribution. The next three tests, developed by Lund, Nettleton, McCarthy, and Smyth

This article presents a simulation study that puts the featured dispersion estimation methods to the test. We first generate pseudo-datasets for which the true negative binomial dispersion parameters and truly differentially expressed genes are known. Then, we apply the featured dispersion estimation methods and testing methods to the pseudo-data, compare the results to the truth, and measure the performance of the dispersion estimation methods in terms of point estimation and performance in testing for differential expression.

The simulation study uses two real RNA-seq datasets to generate the pseudo-datasets. The “Pickrell dataset” (Gene Expression Omnibus accession number GSE19480) comes from a study by Pickrell, Marioni, Pai, Degner, Engelhardt, et al.

The top two panels of

Hammer data and Hammer-generated pseudo-data are in blue, while Pickrell data and Pickrell-generated pseudo-data are shown in black. The top two panels show the gene-wise log geometric mean counts and log dispersion estimates, estimated with the QL method, for the Hammer and Pickrell datasets. The bottom two panels plot the analogous quantities for example simulated pseudo-datasets, except that the log dispersions plotted are the true dispersions used to simulate the pseudo-counts and the gene-wise log geometric mean counts are the

The top two panels show the relationship between the log QL-method-estimated dispersions and the gene-wise log geometric mean counts of the Hammer and Pickrell datasets. The bottom two plot the analogous quantities for example simulated pseudo-datasets, except that the log dispersions plotted are the true log dispersions used to simulate the pseudo-counts (i.e., the

For each one of the real datasets (Hammer or Pickrell), we first compute the two quantities,

Randomly select 10,000 genes from one of the real datasets (Hammer or Pickrell) without replacement. The corresponding 10,000 pairs of

Randomly select simulated gene

Set the log fold change across treatment levels,

Compute the true mean expression level

Randomly draw the pseudo-count of each simulated gene

Each gene in the pseudo-dataset should have at least one read to be included in the following analysis. Hence, if the pseudo-counts of simulated gene

Note that our method of choosing true mean and dispersion pairs builds an empirical dispersion-mean relationship into the simulated data. As

Package | Version | Repository |

abind | 1.4–0 | CRAN |

AMAP.Seq | 1.0 | CRAN |

Biobase | 2.18.0 | Bioconductor |

clusterGeneration | 1.3.1 | CRAN |

DESeq | 1.10.1 | Bioconductor |

DSS | 1.0.0 | Bioconductor |

edgeR | 3.0.8 | Bioconductor |

ggplot2 | 0.9.3.1 | CRAN |

hexbin | 1.26.1 | CRAN |

iterators | 1.0.6 | CRAN |

magic | 1.5–4 | CRAN |

MASS | 7.3–23 | CRAN |

multicore | 0.1–7 | CRAN |

plyr | 1.8 | CRAN |

QuasiSeq | 1.0–2 | CRAN |

pracma | 1.4.5 | CRAN |

reshape2 | 1.2.2 | CRAN |

Some pseudo-datasets were generated from the Hammer dataset, while others were generated from the Pickrell dataset. In addition, the number of libraries per treatment group varied from pseudo-dataset to pseudo-dataset. Hence, six “simulation settings”, given in

Setting | Dataset | Group 1 Libraries | Group 2 Libraries |

I | Pickrell | 3 | 3 |

II | Pickrell | 3 | 15 |

III | Pickrell | 9 | 9 |

IV | Hammer | 3 | 3 |

V | Hammer | 3 | 16 |

VI | Hammer | 9 | 9 |

Our simulation procedure was configured such that within each pseudo-dataset, the library sizes do not vary systematically. So when analyzing the simulated data, it would be reasonable to set all the library-wise normalization factors,

Anders and Huber’s method assigns each library-wise normalization factor,

The methods implemented in edgeR

With 30 pseudo-datasets generated for each of 6 simulation settings, we apply each dispersion estimation method to each pseudo-dataset, and we use the dispersion estimates to test for the differential expression of genes. We use the true dispersions, the knowledge of which genes are truly differentially expressed, the dispersion estimates, and the test results to compare the dispersion estimation methods. We assess the quality of the methods in terms of the accuracy and precision of point estimation and performance in tests for differential expression.

The overall quality of any point estimator can be measured in terms of its mean squared error. For each pseudo-dataset and each dispersion estimation method, we calculate the mean squared error of the transformed estimated dispersions,

Here, the

Wu, Wang, and Wu

Overall, the results for the Hammer-generated pseudo-datasets (simulation settings IV-VI) naturally group the dispersion estimation methods into three categories. The first group includes Maximum DESeq and the no-shrinkage DESeq methods. These methods produce the largest MSE, which is not surprising because the Maximum DESeq method is conservative and designed to obtain larger dispersion and because the no-shrinkage DESeq method applies a naive dispersion estimation technique for each gene independently. The next category includes the QL method alone, which performs better than the Maximum and no-shrinkage DESeq methods, but worse than others when the number of libraries is small (simulation setting IV). The other methods all perform similarly and form a group of MSE-best methods. This demonstrates that shrinkage indeed helps improve the point estimators by borrowing information across genes. However, too much shrinkage is detrimental, as the Common methods perform slightly worse than their Trended and Tagwise counterparts.

Parameter estimation is more challenging for the Pickrell-generated pseudo-datasets (simulation settings I–III) than the Hammer-generated pseudo-datasets (simulation settings IV–VI) because the counts are lower, dispersion is larger in general (

Although useful for determining the overall quality of a point estimator, the MSE heuristic is only a single scalar computed for an entire dataset. It is also important to consider the way that estimation error varies with the magnitude of the true dispersions.

Dispersions with gene-wise log geometric mean counts below the median (log mean from

Dispersions with gene-wise log geometric mean counts below the median (log mean from −2.17 to 4.49) are shown in black, while those above the median (log mean from 4.49 to 12.3) are shown in light blue. Overlapping points are shown in dark blue. Results for simulations IV and VI are similar.

Wu, Wang, and Wu

According to

Since the detection of differentially expressed genes is the major goal of most RNA-seq experiments, it is vitally important to measure and compare the direct impact of the dispersion estimation methods on the detection of differentially expressed genes, which is why the pseudo-datasets are generated such that each simulated gene is known to be either differentially expressed or equivalently expressed. Using this knowledge and the p-values obtained from the tests for differential expression, a receiver operating characteristic (ROC) curve is constructed for each pseudo-dataset/test for differential expression/dispersion estimation method combination.

An ROC curve is a graph of the true positive rate (TPR) of the detection of differentially expressed genes vs the false positive rate (FPR). In practice, we define FPR and TPR to be functions of the significance level,

In this study, we use the area under each ROC curve (AUC) as a relative measure of the quality of a test, where a high AUC indicates relatively good test performance. Here, each AUC is computed only for FPR

Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

Boxplots of AUC calculated based on 30 pseudo-datasets are shown for each combination of dispersion estimation method and test for differential expression.

Overall, the three tests in the QuasiSeq package are less affected by dispersion estimation than the edgeR and DESeq exact tests. These tests introduce gene-wise quasi-likelihood dispersion parameters to the negative binomial model, and the new parameters absorb some of the variability that would otherwise manifest solely in the negative binomial dispersions. The practical upshot is that all three QuasiSeq tests are relatively robust under both noisy data and poor estimation of negative binomial dispersions. The QL test is an extreme case, with little change in its AUC boxplots across the dispersion estimation methods, because it does not apply any special constraints to the quasi-likelihood dispersions. (On the other hand, the QLShrink test shrinks the quasi-likelihood dispersions using a common value, and the QLSpline test shrinks them using a fitted spline.) Unfortunately, the QL test also performs the worst among the five tests overall, making it a poor choice in practice despite its otherwise useful robustness. The QLSpline test is better than the QL test, and the QLShrink test is better still.

The rankings of the dispersion estimation methods are similar among the edgeR exact test, the DESeq exact test, and QLShrink test. Specifically, the DSS, Tagwise wqCML, and Tagwise APL – i.e., the moderate-shrinkage methods – are the best. Not only do these dispersion estimation methods perform well relative to other methods, but they are also extremely close to the true dispersions in terms of AUC.

Across all six simulation settings, when combined with moderate-shrinkage methods for dispersion, the best tests for differential expression are the edgeR exact test, the DESeq exact test, and the QLShrink test. These methods for testing perform roughly equally well, and they are better than other combinations of tests and dispersion estimation methods. In some cases, this difference in AUC between the two tiers of tests is dramatic. With the addition of the gene-wise quasi-likelihood dispersion parameter to the negative binomial model, the QLShrink test is more robust to changes in dispersion estimation method than the either of the exact tests, which is most noticeable for the Pickrell-generated pseudo-datasets (simulation settings I-III). In practice, we recommend using the QLShrink test because we expect the addition of quasi-likelihood dispersion parameters to make the QLShrink test more flexible than the edgeR and DESeq exact tests under departures from the negative binomial model.

It is challenging to estimate negative binomial dispersions from RNA-seq data due to the “small

These moderate shrinkage methods outperform the others in all five featured tests for differential expression. However, these tests do not perform equally well. The edgeR exact test, DESeq exact test, and QLShrink test outperform the other two. Furthermore, the performances of the edgeR and DESeq tests depend highly on the dispersion estimation method chosen, while the addition of a gene-wise quasi-likelihood dispersion parameter gives the QLShrink test extra robustness under the choice of dispersion estimation method. We expect this same flexibility to help the QLShrink test perform especially well under departures from the negative binomial model, so we recommend using the QLShrink method with either the DSS, Tagwise wqCML, or Tagwise APL method in practice.

Interestingly, the ranking of dispersion estimation methods according to MSE is not the same as the ranking according to AUC. A notable example is the Maximum DESeq method, which performs poorly in terms of MSE, but often performs extremely well in tests for differential expression. This behavior may result from the intentional overestimation of the dispersions, which contributes to a high MSE, but guards against false positives. In addition, methods with similar MSE may have very different AUC. For example, the Trended APL and Tagwise APL methods yield similar MSEs in simulation setting V, but the Tagwise APL method performs much better than the Trended APL method in the edgeR test (

(R)

We would like to thank Drs. Dan Nettleton, Dianne Cook, and Yaqing Si for their useful feedback. We would also like to thank Dr. Gordon Smyth of the Walter and Eliza Hall Institute in Australia for answering our questions.