^{1}

^{2}

^{1}

^{3}

^{1}

^{3}

^{*}

Contributed reagents/materials/analysis tools: MW DB NSW. Wrote the paper: MW DB NSW.

The authors have declared that no competing interests exist.

Recently, a novel approach has been developed to study gene expression in single cells with high time resolution using RNA Fluorescent In Situ Hybridization (FISH). The technique allows individual mRNAs to be counted with high accuracy in wild-type cells, but requires cells to be fixed; thus, each cell provides only a “snapshot” of gene expression. Here we show how and when RNA FISH data on pairs of genes can be used to reconstruct real-time dynamics from a collection of such snapshots. Using maximum-likelihood parameter estimation on synthetically generated, noisy FISH data, we show that dynamical programs of gene expression, such as cycles (

Programs of gene expression lie at the heart of how cells regulate their internal processes. Some dynamical gene-expression programs, such as the cell cycle, are well known and studied, others, such as metabolic cycles, have only recently been recognized, and many other dynamical programs including switches are likely to be discovered. Traditional bulk studies typically fail to resolve such cycles or switches, because individual cells are out-of-phase with each other. On the other hand, standard techniques for studying single cells are limited in time resolution and scope. RNA Fluorescent In Situ Hybridization (FISH) is a single-cell technique that offers both high time-resolution and precise quantification of mRNA molecules, but requires fixed cells. We have explored how, when, and with what prior information FISH snapshots of pairs of genes can be used to accurately reconstruct gene-expression dynamics. The technique can be readily implemented, and is broadly applicable from bacteria to mammals. We lay out a principled and practical approach to extracting biological information from RNA FISH data to reveal new information about the dynamics of living organisms.

Cells are well known to respond to external conditions by altering their gene expression. In recent years, many examples of altered gene expression programs have been revealed by population level studies, including microarray studies of yeast, mammalian, and bacterial cells. But many cells are also known to alter gene expression is ways that are heterogeneous across a cell population. Examples include the acquisition of competence for DNA uptake

Since population level studies are not well suited to reveal heterogenous behavior, how can heterogeneous changes in gene expression be studied and quantified? Fluorescent reporter proteins have been used successfully to report on expression of a small number of genes either via FACS analysis or fluorescence microscopy. However, the use of fluorescent reporters is generally limited to highly expressed genes, with time resolution severely limited by fluorescent protein maturation and the low turnover rates of the fluorescent marker. Moreover, construction of fluorescent reporters can be laborious and impractical for studies of large-scale transcriptional responses.

A promising approach that has recently been used to study gene expression on a cell-by-cell basis is Fluorescence In Situ Hybridization (FISH)

Here, we present an approach to extracting information about the dynamics of gene expression from FISH data by considering correlations of expression between pairs of genes (

The dashed curves indicate mean transcript numbers. The actual number of mRNA transcripts will fluctuate from cell to cell and from cycle to cycle. (A) Sketch of the number of mRNA transcripts versus time for two genes in the continuous regime, where fluctuations about the mean are small. (B) Sketch of FISH observations corresponding to (A): a large number of mRNA transcripts from both genes will typically be found in each cell. Inset: schematic of corresponding distribution of pairwise FISH data. (C) Sketch of the bursty regime where typically only the most recent transcriptional burst contributes to the total mRNA number, implying large fluctuations about the mean. (D) Sketch of FISH observations corresponding to (C): many cells display either no mRNA or bursts of a single mRNA type, and coincident bursts of both mRNA types are rare. Inset: schematic of corresponding distribution of pairwise FISH data.

Specifically, we show how Maximum Likelihood Estimation (MLE)

Importantly, the method we present here for inferring intracellular dynamics from data in the form of “snapshots” is quite general, relying only on measurements of pairs of quantities in single cells, with no requirement for exact counts. The method can therefore be applied with little modification in other contexts such as quantitative immunofluorescence or single-cell sequencing studies.

We presume that production of mRNA transcripts is a stochastic process. Transcription factors bind to DNA at random times, with a probability that depends on other signals, and which can therefore also vary with time. Binding of one or more transcriptional activators, or unbinding of repressors, typically leads to production of a “burst” of mRNA transcripts. One can distinguish three regimes, two of which are illustrated in

For each regime of mRNA expression, our approach consists of defining a class of possible dynamics, and choosing the one for which the observed data is most likely. Specifically, for a given set of model parameters, we calculate the probability of the observed data, and then ask for the particular set of parameters that maximizes this probability. Since the probabilities don't sum to one over all models (

In practice the parameter optimization in MLE can be a challenge, and algorithms used to search parameter space for the maximum likelihood can get stuck in local maxima. However, the general formulation of the maximum likelihood approach is conceptually distinct from the detailed choice of algorithms used to optimize parameters, and so we have chosen to present only fully optimized results in the main text. In

It is important to recognize one absolute limitation of using FISH data to reconstruct the dynamics of gene expression. Because cells must be fixed before mRNAs are measured, only “snapshots” of individual dynamical trajectories are available. As a consequence, it is impossible from FISH data alone to determine the overall time scale of the dynamics of gene expression. Thus, while it is possible to infer from correlated FISH data that cells undergo cycles of gene expression, and even practical, as we will show, to accurately reconstruct such cycles, it is not possible, even in principle, to determine the period of these oscillations. Similarly, it is not possible, even in principle, to determine which direction around the cycle of gene expression corresponds to the forward arrow of time. In many cases, we anticipate that other methods,

At a qualitative level, the regime of continuous mRNA production allows for relatively straightforward reconstruction of cyclic gene-expression dynamics. In the absence of noise, FISH data for even a single pair of genes is sufficient. One can simply plot the ordered pairs of FISH data as in

(A) Comparison of true mRNA dynamics (red curve) with MLE reconstructions based on 4 genes (solid blue curve) and 2 genes (dashed blue curve), for

Reconstruction of gene-expression dynamics in the regime of bursty mRNA production is more challenging. In this case, the data consists of the presence or absence of bursts of mRNAs, with rare coincidences of bursts for two genes (

We first consider the continuous regime where many bursts typically contribute to the instantaneous mRNA number. To demonstrate the MLE algorithm, we reconstruct the dynamics of gene expression using synthetic FISH data for which the underlying dynamics is known. We focus on analyzing cyclic dynamics,

For each FISH observation, one therefore has:

We generate synthetic FISH data by first choosing the parameters in Eq. (4) for the oscillating mRNA levels

To test the accuracy of reconstruction of mRNA dynamics using our MLE approach, we generated a large number of sets of parameter, and for each parameter set generated synthetic FISH data and then applied MLE to reconstruct the true dynamics. Specifically, for each parameter set defining a trajectory of mean mRNA levels

As shown in

To quantify the accuracy of the MLE algorithm, we computed the reconstruction error

We now consider the bursty regime where a cell will typically either have few (or no) mRNAs of a particular type, or the mRNAs present will come from a single recent burst of transcription. In this limit, the information provided by FISH is essentially binary - either mRNAs for a particular gene are present at significant levels, indicating a recent burst, or they are not. Formally, if a significant number of mRNAs for gene

We denote by

For simplicity, let us now consider only the lowest harmonic, as in the previous section. We introduce the

An important consideration in analyzing FISH data is that overall transcription rates may vary from cell to cell. Indeed, measurements of gene-expression noise in single yeast cells at the protein level reveal

We now consider a model where the expression pattern can switch stochastically among

(A) Illustration of the number of mRNA transcripts versus time, in the bursty limit, for two genes subject to regulation by a stochastic 2-state switch (solid lines); dashed lines indicate true burst probabilities

It is straightforward to see that

For either cyclic or switching dynamics, maximum likelihood parameter estimation in the regime of bursty mRNA production requires the following steps, (1) estimating the mean burst probability and covariance from the FISH data, (2) determining the uncertainty of these estimates, and (3) obtaining the parameters for which the observed data is most likely. Taking an average over FISH data provides an estimate

To estimate the uncertainty in the covariances, we first note that the true variance in

As discussed above, for bursty mRNA production the means and covariances alone cannot distinguish cyclic from switching dynamics. However, if one has prior evidence that gene expression is cyclic, maximum likelihood estimation can be usefully employed to reconstruct the dynamics. If

Inference error

A stochastic switch between 2 states implies a covariance matrix of rank 1, and therefore can be distinguished from cyclic dynamics, which leads to a minimum rank of 2 (unless all the genes are exactly in phase). Still, one piece of additional information is required to reconstruct the dynamics. For example, it is sufficient to know the expression level of a single gene in one state. Here, we instead assume that the probability

In principle, with enough FISH data it should be possible to reconstruct more than just the probability of observing a burst. For example, the entire distribution of mRNAs of each type in each switching state could be obtained using MLE,

In the regime of bursty mRNA production, all of the information from FISH is contained in the mean burst probabilities and the covariance matrix, suggesting that Principal Component Analysis (PCA) could be usefully applied. For example, for a 2-state switch the covariance matrix has rank 1. Thus, according to Eq. (15), performing PCA by diagonalizing

In practice, elements of the PCA and MLE approaches can be usefully combined. The main utility of PCA lies in diagonalizing

In recent years, McKnight and coworkers demonstrated that the yeast

Note that the expression levels cycle periodically and approach zero at some point along the cycle. Adapted from

To analyze the dynamics, we first binarized the FISH data of

We assumed that the dynamics is cyclic and considered the expansion of Eq. 6 up to the first harmonic. Such a model has 74 independent parameters for 25 genes. Moreover, the number of data points per pair of genes varies from 175 to 16032, with only 29 pairs having more than 2000 data points. Thus some of the correlations are well-characterized, but others are not. If only the 29 gene pairs with more than 2000 data points are considered, even a single-harmonic model is under-constrained. To circumvent this problem, we are guided by the observation apparent from

Next, the likelihood of all the observed FISH correlations was maximized with respect to the phases

(A) Reconstructed dynamics of the probabilities

To quantify our results statistically, we define for each cluster,

From the chemostat studies

In

In general, Maximum Likelihood Estimation (MLE) requires finding the set of model parameters for which the observed data are most likely. Finding the global maximum in the space of model parameters can be a challenging task, particularly as there may be many local maxima in which a search algorithm can get stuck. For synthetic FISH data in the regime of continuous mRNA production, we found that such local maxima occurred frequently. (In contrast, for synthetic FISH data in the bursty regime a simple steepest-descent algorithm invariably found the same maximum, independent of initial conditions.) To find the global maximum in the continuous regime, we developed a heuristic algorithm that worked very well in practice to reconstruct simple cycles.

One approach is to consider various initial parameter values, and to use a steepest-descent algorithm to find the local maximum of the likelihood. Then the global maximum (with the highest likelihood) could be chosen among the different solutions. However, in practice this procedure can be very time-consuming if initial conditions are chosen randomly. Here we propose two approaches to first compute estimates of the parameters, and then use these estimates to initiate the optimization protocol. In these two approaches we estimated the parameters as follows: (1) For the mean expression level we took

We assign

In the second approach, we tried to approximate the relative phases rather than their absolute values. The initial parameters

Results of optimization using approaches (a) and (b) to set initial parameter values are shown in

Mean reconstruction error

The ability to count mRNA molecules in single cells by Fluorescence In Situ Hybridization (FISH)

To reconstruct the dynamics of gene expression from FISH data, our approach employs Maximum Likelihood Estimation (MLE)

In applying our approach, how should one choose among models to reconstruct gene dynamics? For example, when is it better to use multiple harmonics instead of a single harmonic to model a cycle? The answer depends on the type of data. We discuss first the regime of continuous mRNA production. For this case, a standard and reliable way to choose among models when fitting data is “leave-one-out” validation, which both rewards a good fit while punishing overfitting. In the leave-one-out approach, a model is selected and its parameters are optimized on the entire data set, but with one data point left out. The resulting parameterized model is then used to fit the neglected data point. The average fitting error, taken over all possible left-out data points, is a robust measure of the quality of the model. Among competing models, the one that minimizes this error can be selected as the better choice. In the regime of continuous mRNA production, leave-one-out validation can be applied within the MLE framework by using the log(likelihood) of the left-out data point in place of the fitting error. Among competing models, the one with the largest average log(likelihood) is the best choice.

In contrast, finding the “best” model for data in the bursty mRNA regime is generally an under-constrained problem. We showed explicitly that for many cases it is impossible in principle to distinguish among different types of models, or even to find a unique best set of parameters for a given model. Intuitively, reduction of bursty FISH data to pairwise covariances means that even as the number of FISH data points approaches infinity, the number of model constraints stays finite. So, for bursty FISH data inference alone cannot guide one in choosing the model, and one must also use common sense. Clearly, prior knowledge of the system under study should be used in selecting a model. In addition, a simple rule is that one should use models that are sufficiently parsimonious in parameters not to have degenerate solutions. For example, in analyzing FISH data on metabolic cycles, we chose the one-harmonic model because there were not enough low-noise covariances to constrain a two-harmonic model. More generally, it is advisable to choose a model with significantly more well-constrained data than parameters. If the model is barely constrained, the peak of likelihood will generally be close to flat in some directions in parameter space and the reconstruction will be poor.

Reconstruction of gene-expression dynamics from FISH data presents multiple practical challenges. One important issue is noise in the measurement of mRNA levels. For the regime of continuous mRNA production, we have shown that sufficient data can compensate for both the noise inherent in gene expression and the noise arising from uncertainty in measurement. For the regime of bursty mRNA production, “binarizing” the data into the presence or absence of a significant number of mRNA molecules substantially reduces the impact of measurement noise. A practical question here is the best threshold to use for binarizing the data. In many cases, the dynamics will be best reconstructed by setting the threshold well above 1 mRNA transcript; for example, in treating the data for metabolic cycles we chose the median expression level for each gene as its threshold. A higher threshold is less sensitive to measurement noise (fewer false positives), and to occasional transcripts produced by promoter leakage (better identification of true bursts), and a higher threshold also allows finer time resolution, as a given burst will remain above threshold for a shorter time (

False-positive rates and false-negative rates are also both important considerations in analyzing FISH data. These are essentially technical issues beyond the scope of our study, but a few remarks are in order. In Ref.

A related issue, highlighted by Zenklusen

Another practical issue in reconstructing gene-expression dynamics from FISH measurements is that data may come in mixed forms,

While these and other practical issues are important to consider, our successful reconstruction of yeast metabolic cycles using the FISH data of Silverman

The many advantages of FISH – absolute quantification, high time resolution, use of wild-type cells, ability to simultaneously measure multiple mRNA types, and broad application across species from bacteria

Average cluster activities Q_{j}(t)as defined in the text, taking into account the presence of global transcriptional noise.

(0.96 MB TIF)

Rank of covariance matrix and required number of additional constraints (obtained from triplet-FISH measurements or other sources) necessary for complete parameter inference in the regime of bursty mRNA production, for both cyclic and stochastic switching dynamics.

(0.44 MB TIF)

We thank the members of the Botstein lab for very valuable discussions.