^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: CKF PM. Analyzed the data: CKF. Wrote the paper: CKF PM.

Human associated microbial communities exert tremendous influence over human health and disease. With modern metagenomic sequencing methods it is now possible to follow the relative abundance of microbes in a community over time. These microbial communities exhibit rich ecological dynamics and an important goal of microbial ecology is to infer the ecological interactions between species directly from sequence data. Any algorithm for inferring ecological interactions must overcome three major obstacles: 1) a correlation between the abundances of two species does not imply that those species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of sequencing reads into operational taxonomic units, bias inferences of species interactions due to a statistical problem called “errors-in-variables”. Here we introduce an approach, Learning Interactions from MIcrobial Time Series (LIMITS), that overcomes these obstacles. LIMITS uses sparse linear regression with boostrap aggregation to infer a discrete-time Lotka-Volterra model for microbial dynamics. We tested LIMITS on synthetic data and showed that it could reliably infer the topology of the inter-species ecological interactions. We then used LIMITS to characterize the species interactions in the gut microbiomes of two individuals and found that the interaction networks varied significantly between individuals. Furthermore, we found that the interaction networks of the two individuals are dominated by distinct “keystone species”,

Metagenomic sequencing technologies have revolutionized the study of the human-associated microbial consortia making up the human microbiome. Sequencing methods now allow researchers to estimate the relative abundance of the species in a community without having to culture individual species

A microbial community consists of a vast number of species, all of which must compete for space and resources. In addition to competition, there are also many symbiotic interactions where certain species benefit from the presence of other microbial species. For example, a small molecule that is secreted by one species can be metabolized by another

There are two approaches to inferring dependencies between microbial species from metagenomic studies: cross-sectional analysis, and timeseries analysis

Any methods for making reliable inferrences about species interactions from metagenomic studies must overcome three major obstacles. First, as shown below, a correlation between the abundances of two species does not imply that those species are interacting. Second, metagenomic methods measure the relative, not absolute, abundances of the microbial species in a community. This makes it difficult to infer the parameters in timeseries models. Finally, errors due to experimental measurement errors and/or mis-assignment of sequencing reads into operational taxonomic units (OTUs), bias inferences of species interactions due to a statistical problem called “errors-in-variables”

Many previous works use the correlation between the relative abundances of two microbial species in an environment (e.g. the gut) as a proxy for how much the species interact. In particular, a high degree of correlation between the abundances of two species is often taken as a proxy for a strong mutualistic interaction, and large anti-correlations, as indicative of a strong competitive interactions. Using correlations as a proxy for interactions suffers from several drawbacks. First, there are important subtleties involved in calculating correlations between species from relative abundances, but previous studies have presented algorithms (e.g. SparCC) to mitigate these problems

The problems with using correlations in species abundances as proxies for species interactions can be illustrated with a simple numerical simulation. We used the dLV model (Eq. 2) to simulate timeseries of the absolute abundances of 10 species for 1000 timesteps, starting from 100 different initial conditions, for two arbitrary species interaction matrices (see

a) A symmetric interaction matrix and the corresponding correlation matrix. b) There is no relation between the interaction parameters and the correlations in abundance for the symmetric interaction matrix. c) An asymmetric interaction matrix and the corresponding correlation matrix. d) There is no relation between the interaction parameters and the correlations in abundance for the asymmetric interaction matrix. Points from above the diagonal in the interaction matrix are gray circles, whereas points from below the diagonal are black squares. In a and c, matrix elements have been scaled so that the smallest negative element is

In general, the relationship between the interaction coefficients (

Cross-sectional studies that pool data across individuals and utilize the correlations between the abundances of different species as proxies for species dependencies are especially affected by this problem. This suggests that time-series data is likely to be more suited for inferring ecological interactions than cross-sectional data.

Even though the species interaction coefficients cannot be inferred from the correlations in species abundances, it is possible to reliably infer the interaction matrix using timeseries models. To do so, one utilizes a discrete time Lotka-Volterra Model (dLV) that relates the abundance of species

As shown in the Materials and Methods, given a time-series of the absolute abundances of the microbes in an ecosystem, one can learn the interaction coefficients by performing a linear regression of

Recall that most modern metagenomic techniques can only measure the relative abundances of microbes, not absolute abundances. This introduces additional complications into the problem of inferring species interactions using timeseries data. Although it is straight forward to infer species interactions by applying linear regression to a timeseries of absolute abundances, it is not

This insight motivates the use of a forward stepwise regression for selecting the covariate species that explain the changes in abundance of species

a) In forward stepwise regression, interactions are added to the model one at a time as long as including the additional covariate lowers the prediction error by a pre-defined threshold. b) The prediction error used for variable selection is evaluated by randomly partitioning the data into a training set used for the regression and a test used to evaluate the prediction error. c) Multiple models are built by repeatedly applying forward stepwise regression to random partitions of the data, each containing half the data points. The models are aggregated, or “bagged”, by taking the median, which improves the stability of the fit while preserving the sparsity of the inferred interactions.

The procedure for forward stepwise regression is illustrated in

Forward stepwise regression is a greedy algorithm, which results in a well-known instability

a) A symmetric interaction matrix (left), the corresponding matrix inferred from absolute abundance data (middle), and the corresponding matrix inferred from relative abundance data (right). b) There is good aggreement between the true and inferred interactions, from both absolute (black) and relative (gray) abundances, for the symmetric interaction matrix. c) An asymmetric interaction matrix (left), the corresponding matrix inferred from absolute abundance data (middle), and the corresponding matrix inferred from relative abundance data (right). d) There is good aggreement between the true and inferred interactions, from both absolute (black) and relative (gray) abundances, for the asymmetric interaction matrix. The prediction error threshold was set to 5% in for all fits.

To ensure that the exceptional performance of our sparse linear regression approach to inferring species interactions was not a fluke due to a particular choice of interaction matrices, we calculated the correlation between the true and inferred parameters for many randomly generated interaction matrices (see

a) Performance on absolute (red) and relative (black) abundances as a function of sample size for symmetric interaction matrices. b) Performance on absolute (red) and relative (black) abundances as a function of sample size for asymmetric interaction matrices. c) Performance on absolute (red) and relative (black) abundances as a function of the out-of-bag error threshold for symmetric interaction matrices. d) Performance on absolute (red) and relative (black) abundances as a function of the out-of-bag error threshold for symmetric interaction matrices. Error bars correspond to

Up to this point, our analyses have ignored the impact of “measurement noise” on the inferred species interactions. There are two important sources of measurement noise in metagenomic data. The first source is experimental noise introduced by sequencing errors. The second, and perhaps larger source of noise, is the mis-classification of sequencing reads into operational taxonomic units (OTUs). Most metagenomic studies rely on the sequencing of 16S rRNA to estimate species composition and diversity in a community. These 16S sequences are binned into groups, or OTUs, that contain sequences with a predetermined degree of similarity. By comparing the sequences in an OTU to known sequences in an annotated database, it is often possible to assign OTUs to particular species or strains. In general, this is an extremely difficult bioinformatics problem

At first glance, it is tempting to assume that measurement noise, which we assume is multiplicative, simply adds to the stochastic (

Although the errors-in-variables bias cannot be eliminated, the topology of the interaction network can still be reliably inferred using our sparse linear regression approach even when the measurements of the relative species abundances are very noisy. Knowledge of which interactions are beneficial (

Specificity refers to the fraction of species pairs correctly identified as non-interacting, while sensitivity refers to the fraction of species pairs correctly identified as interaction. Both measures range from

Emboldened by the success of our algorithm on synthetic data, we applied LIMITS to infer the species interactions in the gut microbiomes of two individuals. The data from Caporaso et al

The size of a node denotes the median relative species abundance, beneficial interactions are shown as solid red arrows, and competetive interactions are shown as dashed blue arrows. In individual a) species 4

The species interaction network of the gut microbiome of individual (a) (shown in

Additionally, we observed that the species interaction topology of the gut microbiome of individual (a) differs substaintially from the species interaction topology of the gut microbiome of individual (b), as is clear from

Metagenomic methods are providing an unprecedented window into the composition and structure of micriobial communities. They are revolutionizing our knowledge of microbial ecology and highlight the important roles played by the human microbiome in health and disease. Nevertheless, it is important to carefully consider the tools used to analyze these data and to address their associated challenges. We have highlighted three major obstacles that must be addressed by any study designed to use metagenomic data to analyze species interactions: 1) a correlation between the abundances of two species does not imply that those species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of sequencing reads into operational taxonomic units (OTUs), bias inferrences of species interactions due to a statistical problem called “errors-in-variables”.

To overcome these obstacles, we have introduced a novel algorithm, LIMITS, for inferring species interaction coefficients that combines sparse linear regression with bootstrap aggregation (Bagging). Our method provides reliable estimates for the topology of the species interaction network even when faced with significant measurement noise. The interaction networks constructed using our approach are sparse, including only the strongest ecological interactions. Regularizing the inference of the interaction network by favoring sparse solutions has the benefit that the results are easily interpretable, enabling the identification of keystone species with many important interactions. Furthermore, our work suggests that it is difficult to learn species interactions from cross-sectional studies that pool samples of the relative abundances of the microbial species from multiple individuals. This highlights the importance of collecting extended time-series data for understanding microbial ecological dynamics.

We applied LIMITS to time-series data to infer ecological interaction networks of two individuals and found that the interaction networks are dominated by distinct keystone species. This motivated us to propose a hypothesis: that the abundances of certain keystone species are responsible for individuality of the human gut microbiome. While more data will be required to confirm or reject this hypothesis, it is intriguing to examine its potential consequences for the human microbiome. The keystone species hypothesis implies that even small perturbations to an environment can have a large impact on the composition of its resident microbial consortia if those perturbations affect a small number of important “keystone” species. Moreover, relatively small differences in individual diets, or minor differences in the interaction between the host immune system and the gut microbiota, that affect keystone species may be sufficient to organize gut microbial consortia into distinct types of communities, or “enterotypes”

Our analysis identified the closely related species

The keystone species hypothesis can be experimentally tested by perturbing the abundance of individual species in a microbial consortium and observing the effect on the composition of the community. Our prediction is that most perturbations will have little impact on the overall structure of the microbial community, but perturbations applied to a small number of keystone species will have a large impact on the structure of the community. Due to ethical concerns, it is difficult to envision a direct experimental test of the keystone species hypothesis in human microbiota and, therefore, to test our specific predictions in regards to the keystone species

Metagenomic sequencing methods have made it possible to follow the time evolution of a microbial population by determining the relative abundances of the species in a community in discrete intervals (e.g. one day). Given the discrete nature of these data, it is most sensible to use a discrete-time model of population dynamics. The discrete-time Lotka-Volterra (dLV) model of population dynamics (sometimes called the Ricker model) relates the abundance of species

Thus, far we have assumed that it is possible to directly measure the absolute abundances

In all of the simulations discussed in the main text the stochasticity was set to

Suppose we are given data consisting of the absolute (or relative) abundances of the species in a population of

Here, we present a high-level outline of the LIMITS algorithm (see

First, we randomly partition the data into a training set and a test set, each containing half the data points.

A set of active coefficients is initialized to

A linear regression including only species

For each coefficient

Next, the inferred coefficients are used to calculate the prediction errors for the test dataset. The particular species

If

Interaction matrices (of size

(NB)

(TIFF)

(TIFF)

(TIFF)

We would like to acknowledge useful conversations with Sara Collins and Daniel Segré and thank Alex Lang and Javad Noorbakhsh for comments on the mansuscript.