^{1}

^{2}

^{*}

Conceived and designed the experiments: LJF BMT. Performed the experiments: LJF. Analyzed the data: LJF BMT. Contributed reagents/materials/analysis tools: LJF. Wrote the paper: LJF BMT.

The authors have declared that no competing interests exist.

The widespread use of high-throughput experimental assays designed to measure the entire complement of a cell's genes or gene products has led to vast stores of data that are extremely plentiful in terms of the number of items they can measure in a single sample, yet often sparse in the number of samples per experiment due to their high cost. This often leads to datasets where the number of treatment levels or time points sampled is limited, or where there are very small numbers of technical and/or biological replicates. Here we introduce a novel algorithm to quantify the uncertainty in the unmeasured intervals between biological measurements taken across a set of quantitative treatments. The algorithm provides a probabilistic distribution of possible gene expression values within unmeasured intervals, based on a plausible biological constraint. We show how quantification of this uncertainty can be used to guide researchers in further data collection by identifying which samples would likely add the most information to the system under study. Although the context for developing the algorithm was gene expression measurements taken over a time series, the approach can be readily applied to any set of quantitative systems biology measurements taken following quantitative (i.e. non-categorical) treatments. In principle, the method could also be applied to combinations of treatments, in which case it could greatly simplify the task of exploring the large combinatorial space of future possible measurements.

The widespread adoption in systems biology of high-throughput experimental assays designed to measure the entire complement of a cells genes or gene products in response to some set of experimental conditions has created a paradox. On one hand these techniques produce such large amounts of data that researchers often struggle to find meaningful and statistically significant patterns amongst the large amounts of noise. On the other hand, since the cost of producing one of these comprehensive measurements is relatively high, researchers are often limited as to the numbers of samples that they can afford to assay in their experimental design, even if the cost of collecting the sample material is relatively low. This may mean limiting the number of time points in a time course experiment, the number of different treatment levels, the number of biological or technical replicates, or all of the above. If the end goal is one of network inference or predictive model development, the scarcity of the measurements can lead to vastly under-determined systems.

In order to design the most useful possible experiment, a biologist needs information about the most dynamic regions of the system in response to each independent variable (e.g. time duration or treatment levels). Usually this is provided from the biologists domain knowledge and/or pilot experiments. However, a more efficient approach would be possible if the most dynamic or uncertain regions of the system could be predicted quantitatively. This would enable more measurements to be concentrated within the most dynamic or uncertain regions and fewer within the less dynamic or more certain regions. Uncertainty in a system can result from error or noise in those measurements that do exist, it can result from a lack of measurements in certain regions of the system, and it can result from intrinsic dynamics of the system in certain regions. Most statistical methods focus on estimating the uncertainty due to error in the existing measurements. There are also a number of methods to deal with uncertainty in model inference caused by noisy data, mostly by using either an automated or manual cyclic refinement of model parameters through parameter perturbation or “tuning”

Many algorithms

Along with missing value imputation algorithms, algorithms for uncertainty quantification and minimization have been developed in a variety of fields. Lermusiaux

Another domain where this idea has been explored is that of robotic vision, where images of a 3D surface are taken and then certain regions of the surface are re-sampled at a higher resolution in order to decrease the level of uncertainty in those regions. Huang and Qian

Here, we describe a conceptually similar approach that may be used in conjunction with gene expression experiments, or other experiments where the cost of collecting samples is substantially less than the cost of assaying them. We introduce a novel algorithm to quantify the uncertainty in the unmeasured regions of gene expression time course experiments that is based on our (BTs) experience as a biologist regarding the dynamics of biological systems. The algorithm enables an experimental strategy where

In any time course experiment, the region between measurements carries a certain degree of uncertainty. If a measurement is taken at some time point

After the boundaries containing the complete set of plausible interpolations have been established, the next piece of information required is the likelihood distribution of those interpolations within the boundaries. To obtain a useful approximation of the many possible interpolations within the boundaries, two randomly distributed “guide points” were considered to exist within each interval between measurement time points and between the upper and lower boundaries. New interpolations were defined that connected all measured points and guide points (

(A) The plausible bounds (dashed in black) based on an arbitrary set of measured points (blue). “+” and “−” indicate the directions of changes of gradient. Red lines, subtended by red points, contain an additional inflexion point. Thin green line subtended by the green point does not. (B) Extended plausible bounds resulting from measurement error. The bounds (light green) consist of the union of 16 sets of boundaries (insets) defined by all possible combinations of the upper and lower confidence limits of the four measurements (blue dots). (C) After the extended plausible bounds have been determined, guide points (small blue dots) are inserted into each segment between the measured points. The span between plausible bounds for each guide point is divided into regular sub-intervals spanning the possible values (gray hatched lines). The confidence intervals for each measured point are divided into sub-intervals of equal probability (black hatched lines). To determine the likelihood distribution of plausible interpolations, one value is randomly chosen from each set of possible measured and guide points. If the interpolation passing through all these points contains more inflexion points than the original curve (e.g. the interpolation through the small red dots), it is discarded; otherwise (e.g. the interpolation through the small green dots) it is added to set of interpolations used to calculate the likelihood distribution.

Using this approach, the likelihood distribution of plausible interpolations at any time point could be calculated, using two methods. For clarity, the two methods are described in the following sections using the temporary assumption that the measured points have zero error. In the final two sections, a further elaboration to account for measurement error is introduced.

While the complete set of interpolations could be infinite, our algorithm uses a simple, biologically plausible assumption to place reasonable bounds on the set of interpolations, namely that no novel regulatory event occurs within any interval unless there is experimental evidence for it. Mathematically we define this to mean that the total number of inflexion points in the actual path of the system through the measured points is assumed not to exceed the number of inflexion points implied by the original set of measurements. This assumption is an instance of Occams razor. (The limitations of this assumption are considered in the

The first method used a Monte Carlo approach to calculate the likelihood distributions. In this method, the two guide points were spaced evenly along the time-axis (splitting each interval into exactly three sub-intervals) and randomly along the y-axis, constrained by the interpolation boundaries. The number of inflexion points of the interpolation passing through the real measurements and the guide points was then calculated. If the number of inflexion points of the new interpolation was greater than that of the original, the interpolation was discarded. If the interpolation was accepted, then the positions of all the guide points were recorded. This process was repeated until a predetermined number of passing interpolations had been found. The distributions of the accepted guide points in the y-direction at each point in the time axis thus approximated the distribution of all possible plausible interpolations passing through each time point. Because of the geometric relationships among neighboring guide points and measured points, the distributions of plausible interpolations were markedly non-uniform and non-normal (e.g.

Estimation of the likelihood distribution via the Monte Carlo method compared to the Direct Method from a pilot study used in the development of the algorithm. The Monte Carlo method generated the blue approximation to the true distribution. The results of calculating the distributions directly using numerical integration are shown in red.

This approach proved computationally expensive because the time required to discover sufficient passing interpolations to accurately calculate the likelihood distributions increased exponentially with the number of measured points.

In order to create a computationally less expensive algorithm than the Monte Carlo method, we developed a method to calculate the likelihood distribution of interpolations directly from the geometric relationships among the measured points and guide points. The Direct method is based on exactly the same premises, but calculates the likelihood that plausible curves pass through a given location directly, rather than finding the likelihood by trial-and-error.

Given a set

In order to compute the distributions via the method outlined above, using numerical integration, for each

Three probability values were defined for each interval, a probability conditional on the points to the right

To calculate the values for

For every combination of midpoints

Define

If the total number of inflexion points in

Normalize

The values for

For every combination of midpoints

Define

If the total number of inflexion points in

Normalize

Next, the values of

Finally, the algorithm calculates the values for

For each combination of midpoints

Define

If the total number of inflexion points in

Normalize

As the algorithm uses the position of the measured points to define the boundaries of plausible interpolations, the error in these measurements must also be taken into account. To do so, the algorithm uses defined confidence limits of each measurement (we have used 99%) to define a plausible range of values for that point (as illustrated by the black vertical bars in

To take the measurement error into account in the calculation of the likelihood distributions, for each measurement

Then:

Since each measured point

In order to compensate for the added computational cost of this modification, which if implemented fully would increase the computational cost by a factor of

For each set of midpoints spanning the confidence interval for a measurement

By employing Latin Hypercube Sampling, the added computational cost of the modification to account for measurement error is reduced to a factor of

In order to evaluate the accuracy of the algorithm in predicting where new measurements may lie, including the calculated likelihood distributions, we looked for gene expression time course experiments with two qualities. First, we needed datasets that contained a relatively large number of measurements. This would allow us to provide the algorithm with only a fraction of the measurements and then measure how well the values of the omitted measurements corresponded to the likelihood distributions calculated by the algorithm. Second, in order to evaluate the effectiveness of the algorithm at accounting for measurement error, we needed a dataset that provided multiple replicates for each measurement so that confidence intervals could be assigned for each measurement.

Our first selected dataset comes from time series data recently published on the yeast IRMA gene network

The authors

We used RT-PCR data provided by the authors, which were obtained using the

The blue, green, and red regions indicate the 95%, 75%, and 20% likelihood regions respectively. The boundaries were created by using straight lines to connect the 20%, 75% and 95% likelihood boundaries of all the guide points and the 20%, 75% and 95% confidence boundaries of the measured points The gray dashed lines represent the interpolation boundaries as defined in

The blue, green, and red lines show the boundaries of the 95%, 75% and 20% likelihood regions respectively. The boundaries were calculated using only a fraction of the points (black diamonds). The omitted points (yellow diamonds) and their 95% confidence intervals are also shown. (A)

Next we applied the algorithm to the time course expression data from a study

The nonsense-mediated mRNA decay (NMD) pathway is responsible for the rapid decay of transcripts that contain a premature stop codon

We analyzed the time course expression data for 4 genes from a yeast strain containing a mutation in a gene that the authors identified as potential targets of NMD,

We generated expression values from the raw chip data by first correcting for optical noise using the GC-RMA

Before generating the interpolation likelihood distributions we removed every

Finally, in order to test the algorithm on a high-throughput dataset, we applied the algorithm to the entire NMD mutant array series

As before, expression values were generated from the raw chip data by first correcting for optical noise using the GC-RMA

Before generating the interpolation likelihood distributions we removed every

The percentage of the omitted measurement means that were found within each likelihood region is shown for all measurements of each gene. The Combined group shows the average across all genes analyzed in the respective system. (A) The five genes from the switch-on dataset of the IRMA gene network. (B) The four genes selected from the NMD dataset and the first principal component of the Extended NMD dataset.

The ability to quantify which gaps in a dataset contribute the most to the uncertainty in the knowledge about a particular system provides a powerful tool for planning future experimentation, especially when the cost of the future experiments is high. By utilizing a simple but powerful assumption about the plausible behaviour of biological systems, we have created an algorithm that quantifies the uncertainty created by gaps in biological datasets in a probabilistic fashion, including an intuitive graphical representation (illustrated in

We have presented two possible uses of our algorithm. First, we have shown its use to estimate the likelihood distributions of new measurements of the levels of individual genes. Second by using a summarization method (here we used Principal Components Analysis), we have shown its use in estimating the likelihood distribution of interpolations for a genome-wide microarray dataset. Several approaches can be used to estimate the uncertainty of the whole system at each time point. For example, the range of a particular quantile (e.g. 95%) for each gene at each time point could be used, or if it was considered desirable to take the size of the gap between measured time points into account, the area of the likelihood envelope between two measured points could be used. The range (or envelope area) of a particular gene of interest could be used to target a new measurement, or the ranges for all (or selected) genes in the system could be combined (e.g. by multiplication) to calculate an aggregate uncertainty. If time points were being considered that fell between measured points or guide points, then a simple interpolation (e.g. linear or cubic spline) could be used to estimate the distribution at the point of interest. Unlike sensitivity analysis techniques that attempt to measure the sensitivity of models to changes in parameter values or initial conditions

One limitation of our algorithm is the assumption that in order for an interpolation to be “biologically plausible” it must not introduce new, unmeasured regulatory events into the system, which is an instance of Occams razor. In cases where the initial measurements have been spaced far apart, there is obviously an increased likelihood that this assumption might be invalid, and biologists should take this into account in planning future experiments.

For biologists interested in “take-home messages” from this study, two are immediately evident: (i) since the predictions of the system are affected by the reliability of the measurements, the more replicates of the measured points the better (good advice in any context); (ii) intervals that contain an inflexion point are usually the most uncertain, and thus the best place for new measurements, because of the additional freedom in plausible paths this allows.

Although the context for developing the algorithm was gene expression measurements taken over a time series, the approach can be readily applied to any set of quantitative systems biology measurements taken following quantitative (i.e. non-categorical) treatments. In principle, the method could also be applied to combinations of treatments, in which case it could greatly simplify the task of exploring the large combinatorial space of future possible measurements. This methodology should have wide applications outside of biology as well. Our approach can benefit any application that uses continuous sets of measurements (e.g. time course studies), where the system under question can be expected to be constrained in a predictable fashion, and where it is desirable to quantify the uncertainty in the intervals between measurements.

The Castor analysis software developed to calculate the likelihood distributions is available under the GNU GPL from:

(XLSX)

(XLSX)

(XLSX)

(XLSX)

We thank Pedro Mendes, Reinhard Laubenbacher, Henning Mortveit, TM Murali and George Terrell for valuable discussions.