^{1}

^{2}

^{3}

^{2}

^{3}

^{4}

^{4}

^{2}

^{3}

^{5}

^{*}

^{1}

^{6}

^{*}

Conceived and designed the experiments: CS MS BPI MK TJP. Performed the experiments: CS HP VA TJP. Analyzed the data: CS HP MK TJP. Wrote the paper: CS MS BPI MK TJP.

The authors have declared that no competing interests exist.

High throughput measurement of gene expression at single-cell resolution, combined with systematic perturbation of environmental or cellular variables, provides information that can be used to generate novel insight into the properties of gene regulatory networks by linking cellular responses to external parameters. In dynamical systems theory, this information is the subject of bifurcation analysis, which establishes how system-level behaviour changes as a function of parameter values within a given deterministic mathematical model. Since cellular networks are inherently noisy, we generalize the traditional bifurcation diagram of deterministic systems theory to stochastic dynamical systems. We demonstrate how statistical methods for density estimation, in particular, mixture density and conditional mixture density estimators, can be employed to establish empirical bifurcation diagrams describing the bistable genetic switch network controlling galactose utilization in yeast

Decades ago, Waddington, and later Kauffman, likened the dynamics of a differentiating cell to a marble rolling downhill on bumpy terrain—the epigenetic landscape. In this metaphor, the valleys of the landscape represent the paths that cells can follow towards a stable cell type, and the fate of the cell is determined by the constant modulation of the epigenetic landscape by internal and external signals. With new technologies for measuring single-cell gene expression, it is increasingly feasible to map out these valleys and how external variables influence cellular responses. Moreover, it is possible to quantify population level effects, such as what fraction of a population of cells arrives at one valley or another, and variability at the cellular level, such as how individual cells bounce around within, and possibly between, valleys due to the stochasticity of cellular biochemistry. In this paper, we discuss which characteristics of the epigenetic landscape can readily be extracted from single-cell gene expression data, and describe computational methods for doing so.

One of the primary goals of systems biology is to uncover the dynamics of cellular networks. Sometimes, this has meant collecting time-series data and applying tools for time-series analysis such as Fourier methods to identify periodically expressed genes

Here, we are concerned with the experimental and computational quantification of bifurcation-like behavior in stochastic genetic switches. There is considerable evidence that signalling networks in a population of genetically-identical cells exhibit large cell-to-cell variability in their output, despite operating in a homogeneous external environment (see e.g.,

Two ingredients are necessary for empirical analysis of the bifurcation behavior of a cellular network. One is single-cell measurements of one or more cellular variables, such as gene expression. Technologies such as microarrays, SAGE or quantitative mass spectrometry, which operate on collections of cells or whole tissues, obscure potential heterogeneity in the sample. They do not discriminate, for example, between a 100% increase in expression of a gene, and a 200% increase in its expression in 50% of the cells. With technologies such as fluorescent cell imaging and flow cytometry, however, the state of each cell can be ascertained. As a result, one can determine whether the cell population is homogenous or if it comprises a set of subpopulations—each undergoing different dynamical behaviors corresponding to different growth strategies, differentiation endpoints, etc. The other necessary ingredient is a method for experimental manipulation of some system parameter(s) or environmental condition(s), in order to study how subpopulations change under varying conditions. This may mean changing the concentration of ligands or nutrients in the cellular environment or artificially manipulating the activity of regulatory factors inside individual cells. For example, Ozbudak et al.

The organization of this paper is as follows. First, we discuss traditional bifurcation analysis in greater detail, introducing in particular saddle-node bifurcations, a type of bifurcation widely associated with the dynamics of gene regulatory switches. We then describe the necessity of generalizing the notion of bifurcation behavior to account for the inherent noise (stochasticity) in cellular networks. Next, we present the data that motivated our study—single-cell flow cytometry data measuring activity in the yeast galactose utilization network over a range of extracellular galactose concentrations. We then report on two broad approaches to analyzing this data and extracting estimates of bifurcation structure, namely, mixture density modeling and conditional mixture density modeling. We evaluate the relative strengths of these approaches, and describe a number of novel qualitative and quantitative observations about switching in the galactose network.

Bifurcation analysis is a branch of dynamical systems theory concerned with steady-state or asymptotic behaviors of a dynamical system

(A) Bifurcation diagram of the system in Equation 1, an idealized model of a gene activated by signal

In contrast with deterministic models, real cellular networks can be significantly noisy, with system variables fluctuating over time for a variety of reasons, including, for example, fluctuations in biochemical reaction rates, random partitioning of cellular content at cell division, and variation in cell size and cell age (see e.g.,

In a bistable system, fluctuations can induce stochastic transitions between the two expression states such that some cells are expressing at low level while others express at high levels. The result is the emergence of a bimodal population distribution and subpopulations with distinct expression characteristics.

Fluorescence is reported as a function of galactose level in culture (expressed as percent weight per volume; 1% = 10g/L), under the galactose pregrowth condition (A), and the raffinose pregrowth condition (B). All four biological replicates are shown stacked on each other. The blue area represents the number of cells counted in each fluorescence channel in replicate 1, the next lighter blue area is the sum of the counts in the first two replicates, and so on.

How can we capture the bifurcation behavior of a stochastic dynamical system? Suppose that

The number of distinguishable subpopulations

Some notion of the “location” of those subpopulations, in terms of the observable variable

Some notion of the variability in

The fractions of the whole population that are represented by each subpopulation

This is not a formal definition of stochastic bifurcation structure; these are principles, which might be formalized in a number of different ways. For example, as mentioned above, the modes of the steady state distribution of a stochastic dynamical system have previously been proposed as analogs to the steady states of a deterministic model. Thus, one might use the modes of the distribution

If one can assign every cell to a subpopulation, then the variance of

Our thoughts on stochastic bifurcation structure and methods to estimate it were motivated, indeed necessitated, by data we collected on activity in the galactose utilization network in

A natural approach to modeling multi-modal data is to employ mixture distributions. We model data from each biological replicate separately, in order to avoid conflating replicate-to-replicate variation with cell-to-cell variation within a replicate. Consider a replicate,

For a given replicate, fitting mixture distributions to each galactose concentration

The only downside of this approach is that it does not explicitly model the dependence of these features of stochastic bifurcation structure on the external controllable bifurcation parameter—the galactose concentration

For a given replicate

We used two different approaches to fit mixture models and one approach to fit a conditional mixture model to the data. In all approaches, the mixtures contained one or more Gaussian components as well as a single uniform component. The uniform component was given a fixed mixture coefficient of

(A) Means of subpopulations, as extracted by: mixture models estimated by the expectation-maximization algorithm (EM), mixture models estimated by a combination of mode estimation and expectation-maximization (ME+EM), and a conditional mixture model estimated by expectation-maximization (CEM). The x-axis represents the 17 levels of galactose tested, in order of increasing concentration. The y-axis represents fluorescence channels of the flow cytometer, which are proportional to the logarithm of fluorescent intensity. Darker background shading represents more cells counted in the channel at the given galactose level. (B) Estimated mixture coefficients (prior probabilities) of the low subpopulation as a function of galactose concentration. (C) Estimated standard deviations of the Gaussian distributions representing low (darker) and high (lighter) subpopulations as a function of galactose concentration.

In

(A) Subpopulation means as extracted by the three fitting methods, in all four replicates of the gal-pregrowth condition. (B) Subpopulation means in the four raf-pregrowth replicates. (C,D) Estimated sizes of the low subpopulations in the gal-pregrowth and raf-pregrowth conditions respectively.

Empirical count distributions for the four replicates are shown, smoothed using a width-11 moving average to improve visibility. (A) At the third galactose concentration (0.0022%). (B) At the fourth galactose concentration (0.0033%). (C) At the fifth galactose concentration (0.0038%).

Empirical count distributions for the four replicates are shown, smoothed using a width-11 moving average to improve visibility. (A) At the

While the three fitting methods produce qualitatively similar results in many respects, a question arises as to whether any of the methods is better than the others in a quantitative sense. The first way we examined this question was to compare the log likelihood of the data under different models and replicates.

(A) For each method, the mean negative log likelihood of the data. “Training” means each model is evaluated on the same data to which it is fit, whereas “testing” means each model is evaluated on the data from the other three replicates having the same pregrowth condition. Black bars indicate 95% confidence intervals. (B,C) Variability in the estimated locations of four subpopulations: the low (and only) subpopulation at the zero galactose concentration (P1), the low subpopulation at the

In

We have defined a notion of stochastic bifurcation structure suitable for studying the behavior of stochastic genetic switches, and we have generated an extensive map of the response of the canonical bistable yeast galactose utilization network to variation in external galactose concentrations. While the data broadly conforms to our expectations for stochastic switching between low and high expression states within the network, several additional properties are noteworthy. The establishment of a “high” expressing subpopulation occurs rather abruptly and fairly consistently at a concentration of approximately 0.003% galactose, although this state is initially overlapping the low expressing subpopulation. By contrast, the low subpopulation fades away more gradually at higher concentrations, while maintaining clear separation from the high subpopulation. Activity within the high subpopulation, in terms of fluorescent intensity, increases substantially as a function of galactose concentration—by approximately 300% over the range of concentrations tested. Activity within the low subpopulation is fairly constant, and is, in most cases, indistinguishable from that of cells not expressing the reporter gene (data not shown), though there may be a mild increase in expression as the galactose concentration increases. Hence, the response of the network to varying conditions appears to combine a boolean-type “binary” switch between “on” and “off” expression states with a continuous “graded” modulation of activity within the “on” state.

From a methodological point of view, we proposed that mixture density estimation and conditional mixture density estimation are ideally suited to extracting stochastic bifurcation structure from real, noisy data. Our tests of two different mixture fitting methods and one conditional mixture fitting method suggested that, in most respects, the methods are equally accurate in fitting the data. It is possible that the conditional mixture model was less accurate. Visually, it appears to overestimate the location of the high subpopulation at smaller galactose concentrations, and underestimate it at higher concentrations (see

Conditional mixture models have several additional advantages compared to fitting the data at each galactose level separately: they use fewer total parameters, and are thus less likely to overfit the data, and they explicitly represent and make predictions for the bifurcation structure at all values of the bifurcation parameter—not only the values tested experimentally. This approach worked well on our data. The drawback of this approach is that it requires choosing functional forms to represent the dependence of mixture probabilities and mixture component parameters on the bifurcation parameter. In this case, a proper means of representing mixture probabilities only became clear after doing the individual fits. In early conditional mixture model fits, we assumed the mixture probabilities were independent of galactose concentration. This had the unfortunate side affect that the high component would start to “capture” cells from the low subpopulation at low galactose levels, dragging down the whole mean curve for the high subpopulation until it intersected and overlapped with the low subpopulation. The form we chose for the mixture probabilities avoids this problem by definitively assigning cells to the low component at all galactose levels below some threshold. This illustrates that the strength of using few parameters and explicitly generalizing across bifurcation parameter values also implies a danger of poor performance if an inappropriate representation is chosen. While this is a truism in the statistics and machine learning communities, it is all the more important to keep in mind in systems biology where there is a greater focus on interpreting models, as opposed to, say, being concerned only about prediction accuracy.

Despite our focus on mixture modeling, one can imagine other approaches for estimating stochastic bifurcation structure. For example, clustering methods such as K-means or self-organizing maps could readily be applied in much the same way as we applied mixture density estimation. Nonparametric density estimation techniques might also be applied, although it would take extra effort to extract subpopulations from a nonparametric density estimate. Investigating such alternative approaches is an important topic for future research.

Part of our contribution is in specifying four types of information that should be included in a stochastic bifurcation analysis: the number of distinct subpopulations, the fraction of cells they contain, the level of expression and the variance within each subpopulation. Our notion of stochastic bifurcation structure is considerably different from ideas employed in stochastic bifurcation theory, which addresses the behavior of explicitly stochastic dynamical models, such as stochastic differential equations

Stochastic bifurcation structure may provide useful information for the development of quantitative regulatory network models, however this remains to be investigated. The exact relationship between stochastic observables and model features is not yet clearly established. For example, models of gene regulatory networks are usually derived from molecular interactions within individual cells and rarely consider effects due to population dynamics. The gradual fading of the low-expressing subpopulation observed in our experiments could be due the stochastic dynamics of the regulatory network itself, or it could be due to a reduced growth rate of the low-expressing cells. Additionally, while we took steps to present the cells in each culture with homogenous extracellular conditions (see

Careful quantitative estimation of stochastic bifurcation structure facilitates comparison between different experimental conditions or genetic backgrounds. For example, the yeast strain studied by Acar

The experiments use a diploid

Prior to quantification, yHP301 was streaked onto synthetic dropout medium (Wisent, Inc.) agar plates without leucine and histidine supplemented with 2% w/vol glucose and 1% w/vol adenine. Individual colonies were used to inoculate 3 mL rich media (YPR) containing 20 g/L Yeast Bacto-Peptone, 10 g/L yeast extract, 1% w/vol adenine and 2% w/vol raffinose (Wisent) or YPR media supplemented with 2% w/vol galactose (Becton, Dickenson). Following growth for 24 hours at

Reporter gene expression was quantified in individual cells using a Beckman-Coulter FC500 flow cytometer. A total of 60,000 events were collected for each condition and filtered using custom-written software script using a fixed elliptical forward/side-scatter autogate capturing approximately 50% of the events in each sample. The fluorescence intensity (488nm excitation, 510–550nm emission) associated with these events was used to generate representative expression distributions for each sample condition. A total of four replicates were obtained, for each final galactose concentration and both pre-growth conditions.

Mixture density estimation using EM used 100 runs in an effort to avoid problems with stopping at solutions that were only locally optimal. Each of the 100 runs began from different random initial parameters. The means of each Gaussian component were chosen uniformly between the lowest and highest data point. Standard deviations were initialized to 50—roughly the level observed at single-subpopulation galactose concentrations—and initial mixture probabilities for the Gaussians were set to

The EM fitting employed cross-validation to determine the proper number of Gaussian components to have in the mixture for each replicate and at each galactose level. After fitting a model with

Mode estimation for the mode-estimation-plus-EM approach began by smoothing the data by taking a running average over a window of size 71 channels. Call this

For fitting the conditional mixture density models, we used only a single run of EM, as further runs did not improve accuracy. Updates are standard, as given in Bishop

All code is written in MATLAB. Code and raw data are available upon request, as well as on TJP's website:

We thank Peter Swain for feedback on an earlier draft of this manuscript.