The authors have declared that no competing interests exist.
Conceived and designed the experiments: SS YSS. Performed the experiments: SS. Analyzed the data: SS YSS. Contributed reagents/materials/analysis tools: SS YSS. Wrote the paper: SS YSS. Developed the software used in analysis: SS.
Given genomic variation data from multiple individuals, computing the likelihood of complex population genetic models is often infeasible. To circumvent this problem, we introduce a novel likelihood-free inference framework by applying deep learning, a powerful modern technique in machine learning. Deep learning makes use of multilayer neural networks to learn a feature-based function from the input (e.g., hundreds of correlated summary statistics of data) to the output (e.g., population genetic parameters of interest). We demonstrate that deep learning can be effectively employed for population genetic inference and learning informative features of data. As a concrete application, we focus on the challenging problem of jointly inferring natural selection and demography (in the form of a population size change history). Our method is able to separate the global nature of demography from the local nature of selection, without sequential steps for these two factors. Studying demography and selection jointly is motivated by
Deep learning is an active area of research in machine learning which has been applied to various challenging problems in computer science over the past several years, breaking long-standing records of classification accuracy. Here, we apply deep learning to develop a novel likelihood-free inference framework to estimate population genetic parameters and learn informative features of DNA sequence data. As a concrete example, we focus on the challenging problem of jointly inferring natural selection and demographic history.
With the advent of large-scale whole-genome variation data, population geneticists are currently interested in considering increasingly more complex models. However, statistical inference in this setting is a challenging task, as computing the likelihood of a complex population genetic model is a difficult problem both theoretically and computationally.
In this paper, we introduce a novel likelihood-free inference framework for population genomics by applying
As a concrete example, we consider models of non-equilibrium demography and natural selection, for which multi-locus full-likelihood computation is prohibitive, and apply deep learning to the challenging problem of jointly inferring demographic history and selection (see [
Our focus on the joint inference of demography and selection is motivated by
Several machine learning methods have been developed for selection only, often focusing on classifying the genome into neutral versus selected regions. Examples include methods based on support vector machines (SVMs) [
Many methods have been developed to infer ancestral population size changes, including PSMC [
Few previous works have addressed both population size changes and selection. Galtier
Approximate Bayesian Computation (ABC) [
We see our deep learning method as a complimentary approach to, rather than a direct replacement of, other likelihood-free inference methods such as ABC or SVM. Deep learning has several advantages such as making full use of datasets, elegantly handling correlations between summary statistics, and producing interpretable features of the data. It also produces an “inverse” function from the input summary statistics to the population genetic parameters of interest. However, in contrast to ABC, it does not provide a posterior distribution.
Deep learning has its beginnings in neural networks, which were originally inspired by the way neurons are connected in the brain [
The
A major breakthrough was made in 2006 when Hinton and Salakhutdinov [
Following the work of Hinton and Salakhutdinov, deep learning has been applied to various challenging problems in computer science over the last several years, making groundbreaking progress. Deep learning broke long-standing records for accuracy that had been set by approaches based on hand-coded rules. Well-known examples include automatic speech recognition (transforming spoken words into typed text) [
Many variations have been developed, including
Rejection-based methods have been used since the late 1990’s (see [
In its simplest form, ABC does not naturally handle correlated or weakly informative summary statistics, which can add noise to the data. To tackle this problem, many methods for dimensionality reduction or wisely selecting summary statistics have been proposed (see [
A second issue with ABC is the rejection step, which does not make optimal use of the datasets that are not retained. The more statistics and parameters used, the more datasets must be simulated and rejected to properly explore the space, making the interaction between these two issues problematic. To address the problem of rejecting datasets, different weighting approaches have been proposed (see [
One final issue with ABC is the black-box nature of the output. Given the distances between the simulated datasets and the target dataset, and the posterior, there is no clear way to tell which statistics were the most informative.
We first describe the simulated dataset we generated to investigate joint inference of demographic history and selection. In the Methods section, we describe how to transform each region (as well as each region of the real data) into a set of 345 summary statistics, and also detail our deep learning algorithm. In what follows, we present the results of our method in a variety of different contexts.
To create a simulated dataset that is appropriate for our scenario of interest, we first define the response variables we would like to estimate. For simplicity, we consider piecewise-constant effective population size histories with three epochs: a recent effective population size
For each demographic history (bottleneck), we simulated many different genomic regions. Each region can either have no selection (neutral), one site with an
To simulate data, we used the program
Recent effective population size scaling factor: λ_{1} ∼ Unif(3, 14).
Bottleneck effective population size scaling factor: λ_{2} ∼ Unif(0.5, 6).
Ancient effective population size scaling factor: λ_{3} ∼ Unif(2, 10).
For the selection classes, the different types are shown below:
Using this strategy, we simulated 2500 different demographic histories, with 160 regions for each one, for a total of 400,000 datasets. To build 160 regions for each demography, we simulated 40 datasets for each of the selection classes above.
Of the 400,000 total datasets, we used 75% for training and left out 25% for testing. In
We evaluate the results in three ways, using the relative error of the estimates: |
Deep learning predictions | |||
---|---|---|---|
Average stat prediction | 0.051 | 0.074 | 0.487 |
Final prediction | 0.098 | 0.077 | 0.569 |
Neutral regions prediction | 0.072 | 0.083 | 0.566 |
The rows below each dotted line are the three different size predictions for this dataset.
True sizes | 878,675 | 235,199 | 713,001 |
Average stat prediction | 872,541 | 264,817 | 600,650 |
Final prediction | 840,784 | 243,180 | 622,359 |
Quantiles (2.5th, 97.5th) | (659231, 954631) | (209797, 279496) | (597402, 649887) |
Standard deviation | 74,477 | 37,238 | 54,165 |
Neutral regions prediction | 873,289 | 237,686 | 619,943 |
In both cases, we have the largest uncertainty for the most recent population size. For the simulated dataset, the true population sizes are shown in blue, and we note that our point estimate of the most ancient population size is the least accurate.
To analyze the selection results, we calculated a confusion matrix in
Each row represents the datasets that truly belong to each selection class. Each column represents the datasets that were actually classified as each selection class. Ideally we would like all 1’s down the diagonal, and 0’s in the off-diagonal entries. The largest number in each row is shown in boldface. We can see that neutral datasets are the easiest to classify, and sometimes regions under selection (hard sweeps in particular) look neutral as well (first column). The overall percentage of misclassified datasets was 6.2%.
Called Class | ||||
---|---|---|---|---|
True Class | Neutral | Hard Sweep | Soft Sweep | Balancing |
Neutral | 0.0002 | 0.0003 | 0.0000 | |
Hard Sweep | 0.1434 | 0.0032 | 0.0201 | |
Soft Sweep | 0.0096 | 0.0010 | 0.0003 | |
Balancing | 0.0301 | 0.0356 | 0.0056 |
We defined “low” frequency as
Called Class | |||||
---|---|---|---|---|---|
Neutral | Hard Sweep | Soft Sweep | Balancing | ||
Low | (18.4% of datasets) | 0.3137 | 0.0009 | 0.0035 | |
Moderate | (12.5% of datasets) | 0.0988 | 0.0000 | 0.0793 | |
High | (69.1% of datasets) | 0.0084 | 0.0044 | 0.0138 |
We also wanted to test the impact of unsupervised pretraining using autoencoders (see
For the results in the first row, the weights of the entire network were initialized randomly, then optimized. In the second row, the weights were initialized using autoencoders for each layer. The positive impact of unsupervised pretraining is clear; random initialization causes the optimization to get stuck in a local minima.
Initialization Type | |||
---|---|---|---|
Random | 0.429 | 0.421 | 0.710 |
Autoencoder | 0.061 | 0.166 | 0.577 |
Again, ideally we would like all 1’s down the diagonal, and 0’s in the off-diagonal entries. The largest number in each row is shown in boldface. When the network is initialized randomly, almost every dataset is classified as neutral; the network has not really learned anything meaningful from the input data. The overall percentage of misclassification is 74.8% for random initialization, while it is only 6.1% for autoencoder initialization.
Called Class | ||||
---|---|---|---|---|
True Class | Neutral | Hard Sweep | Soft Sweep | Balancing |
Random Initialization | ||||
Neutral | 0.000 | 0.000 | 0.000 | |
Hard Sweep | 0.007 | 0.000 | 0.015 | |
Soft Sweep | 0.000 | 0.000 | 0.000 | |
Balancing | 0.000 | 0.000 | 0.000 | |
Autoencoder Initialization | ||||
Neutral | 0.000 | 0.000 | 0.000 | |
Hard Sweep | 0.145 | 0.004 | 0.021 | |
Soft Sweep | 0.011 | 0.001 | 0.000 | |
Balancing | 0.030 | 0.028 | 0.001 |
Lastly, we wanted to assess the impact of training the deep network on data that was created under different conditions than the testing data. In particular, we simulated testing data where the ratio of the recombination rate to the mutation rate was equal to 4 (instead of 1 as in our previous results). The results of this experiment are shown in
We also simulated testing data with bottlenecks that were more extreme than the training data, to assess the impact of a misspecified range of demographic models. Specifically, the most recent population size was outside the range of the training data for all the testing datasets. The results are shown in
We then ran the real
For the demographic history, the population size results are shown in
The first row (predictions based on averaging the statistics for each region in the
Prediction | |||
---|---|---|---|
Average stat prediction | 544,200 | 145,300 | 652,700 |
Final prediction | 694,000 | 178,400 | 638,300 |
Quantiles (2.5th, 97.5th) | (359300, 1004300) | (63800, 265000) | (593200, 694900) |
Standard deviation | 170,400 | 53,800 | 27,100 |
Neutral regions prediction | 635,900 | 170,700 | 646,700 |
In blue is our history from the first line of
The number of windows classified as Neutral, Hard Sweep, Soft Sweep, and Balancing Selection are 1101, 2455, 179, 404, respectively. See
Each of the 4 subplots is a chromosome arm (2L, 2R, 3L and 3R). We restricted the plot to selected sites with a probability of at least 0.95.
Upon examining the genes in the top regions classified to be under selection, we find several notable results. In the hard sweep category, a few of the top regions harbor many genes involved in chromatin assembly or disassembly. Further, there are two regions each containing exactly one gene, where the gene has a known function. These are
In the soft sweep category, we find many genes related to transcription, and several related to pheromone detection (chemosensory proteins). We also find four regions each containing exactly one gene, with that gene having a known function. These are
Finally in the balancing selection category, one interesting gene we find is
One of the advantages of deep learning is that it outputs an optimal weight on each edge connecting two nodes of the network, starting with the input statistics and ending with the output variables. By investigating this network, we are able to determine which statistics are the “best” or most helpful for estimating our parameters of interest. Here we use two different methods, one using permutation testing and the other using a perturbation approach described in
To perform permutation testing, we randomly permuted the values of each statistic across all the test datasets and then measured the resulting decrease in accuracy. For each output (3 population sizes and selection category), we then retained the 25 statistics that caused the largest accuracy decreases. The results are shown in the 4-way Venn diagram in
For each variable, the top 25 statistics were chosen using permutation testing. The Venn diagram captures statistics common to each subset of output variables, with notable less informative statistics shown in the lower right. Close, mid, and far represent the genomic region where the statistic was calculated. The numbers after each colon refer to the position of the statistic within its distribution or order. For the SFS statistics, it is number of minor alleles. For each region, there are 50 SFS statistics, 16 BET statistics (distribution between segregating sites), 30 IBS statistics, and 16 LD statistics.
The results (
It is interesting that the list of notable less-informative statistics is very similar for both permutation and perturbation methods. Overall, roughly the same set of informative statistics were identified by each method, but the precise position of each statistic on the Venn diagram shifted.
Although ABC is not well suited for our scenario of interest and deep learning is a complementary method, we wanted to find a scenario where we could compare the performance of these two methods. To this effect, we restricted the analysis to estimating (continuous) demographic parameters only. We used the popular ABCtoolbox [
We tested two scenarios, one with the full set of summary statistics (345 total), and the other with a reduced set of summary statistics (100 total). For the reduced set of summary statistics, we chose statistics which seemed to be informative: the number of segregating sites, Tajima’s
Out of 1000 demographies (160,000 datasets total), 75% were used for training and 25% for testing. In this scenario, deep learning generally outperforms ABCtoolbox, as measured by the relative error: |
Dataset | Method | |||
---|---|---|---|---|
Full summary statistics | ABCtoolbox | 0.062 | 0.043 | 0.218 |
Deep learning | 0.044 | 0.028 | 0.221 | |
Filtered summary statistics | ABCtoolbox | 0.161 | 0.035 | 0.311 |
Deep learning | 0.065 | 0.055 | 0.319 |
In deep learning, one hyper-parameter that should be investigated closely is the regularization parameter λ in the cost function, also called the weight decay parameter. (See
In terms of runtime, the majority is spent simulating the data. During training, most of the runtime is spent fine-tuning the deep network, which requires computing the cost function and derivatives many times. To speed up this computation, our deep learning implementation is parallelized across datasets, since each dataset adds to the cost function independently. This significantly improves the training time for deep learning, which can be run in a few days on our simulated dataset (400,000 regions) with a modest number of hidden nodes/layers. Once the training is completed, an arbitrary number of datasets can be tested more or less instantaneously. In contrast, each of the “training” datasets for ABC must be examined for
Simulating the data (which is the same for both methods) and computing the summary statistics is the most time consuming part of the process, although it is highly parallelized. ABCtoolbox could also be parallelized, but we did not implement that here. For the “filtered statistics” row, the 345 statistics have been downsampled to 100 for all the datasets.
Task | ABCtoolbox | Deep Learning |
---|---|---|
Simulating data | 370 hrs (10 ∼ 15 cores) | 370 hrs (10 ∼ 15 cores) |
Computing summary statistics | 1800 hrs (1 core) | 1800 hrs (1 core) |
Training and testing (filtered statistics) | 114 hrs (1 core) | 3.75 hrs (20 cores) |
Training and testing (unfiltered) | 336 hrs (1 core) | 11 hrs (20 cores) |
Training | N/A | 74 hrs (20 cores) |
Testing | N/A | 3 min (1 core) |
In this paper, we have sought to demonstrate the potential of deep learning as a new inference framework for population genomic analysis. This investigation is still in its infancy, and there are many directions of future work. One important advantage of deep learning is that it provides a way to distinguish informative summary statistics from uninformative ones. Here we have presented two methods for extracting informative statistics given a trained deep network. Other methods are possible, and it is an open question which one would produce the most accurate results. It would be interesting to down-sample statistics using a variety of methods, then compare the results. Overall, learning more about how various summary statistics relate to parameters would be useful for population genetics going forward.
The prospect of using deep learning to classify regions as neutral or selected is very appealing for subsequent demographic inference. There are other machine learning methods that perform such classification, but they are generally limited to two classes (selected or neutral). One exception is a study in humans [
We infer many hard sweeps in African
One aspect of the simulations that warrants future study is the effect of the ratio of recombination to mutation rate,
Deep learning can make efficient use of even a limited number of simulated datasets. In this vein, it would be interesting to use an approach such as ABC MCMC [
We would also like to apply deep learning to a wider variety of scenarios in populations genetics. Population structure and splits would be examples, although these scenarios would most likely require very different sets of summary statistics.
On the computer science side, deep learning has almost exclusively been used for classification, not continuous parameter inference. It would be interesting to see the type of continuous parameter inference presented here used in other fields and applications.
Finally, machine learning methods have been criticized for their “black-box” nature. In some sense they throw away a lot of the coalescent modeling that we know to be realistic, although this is included somewhat in the expert summary statistics of the data. It would be advantageous to somehow combine the strengths of coalescent theory and the strengths of machine learning to create a robust method for population genetic inference.
In this section we provide the theory behind training deep networks. The notation in this section follows that of [
An example deep network is shown in
The first layer is the input data (each dataset has 5 statistics), and the last layer predicts the 2 response variables. The last node in each input layer (+1) represents the bias term. Here the number of layers
To learn this function, we first describe how the values of the hidden nodes are computed, given a trial weight vector. The value of hidden node
Hence, given the input data and a set of weights, we can
Consider a single training example
It is possible to train a deep network by attempting to minimize the cost function described above directly, but in practice, this proved difficult due to the high-dimensionality and non-convexity of the optimization problem. Initializing the weights randomly before training resulted in poor local minima. Hinton and Salakhutdinov [
Training an autoencoder is an optimization procedure in itself. As before, let
When the number of hidden nodes is large, we would like to constrain an autoencoder such that only a fraction of the hidden nodes are “firing” at any given time. This corresponds to the idea that only a subset of the neurons in our brains are firing at once, depending on the input stimulus. To create a similar phenomenon for an autoencoder, we create a
In addition, a regularization term is included, which prevents the magnitude of the weights from becoming too large. To accomplish this, we add a penalty to the cost function that is the sum of the squares of all weights (excluding the biases), weighted by a well-chosen constant λ, which is often called the
We also regularize the weights on the last layer during fine-tuning, so our deep learning cost function becomes:
In population genetics, often we want to estimate continuous response variables. To compute our hypothesis for a response variable, based on a set of weights, we could use a logistic activation function,
For such classification results, if we had two classes, we could use logistic regression to find the probability a dataset was assigned to each class. With more than two classes, we can extend this concept and use
For many deep learning applications, the raw data can be used directly (the pixels of an image, for example). Unfortunately, we currently cannot input raw genomic data into a deep learning method. Similarly to ABC, we need to transform the data into summary statistics that are potentially informative about the parameters of interest. Unlike ABC, however, deep learning should not be negatively affected by correlated or uninformative summary statistics. Thus we sought to include a large number of potentially informative summary statistics of the data. To account for the impact of selection, we divided each 100 kb region into three smaller regions: 1) close to the selected site (40–60 kb), 2) mid-range from the selected site (20–40 kb and 60–80 kb), and 3) far from the selected site (0–20 kb and 80–100 kb). These regions are based off of the simulation scenario in Peter
Note that the selected site was chosen randomly within region 1.
For all the statistics described below,
Number of segregating sites within each smaller region,
Tajima’s
Folded site frequency spectrum (SFS):
Length distribution between segregating sites: let
Identity-by-state (IBS) tract length distribution: for each pair of samples, and IBS tract is a contiguous genomic region where the samples are identical at every base (delimited by bases where they differ). For all pairs, let
Linkage disequilibrium (LD) distribution: LD is a measure of the correlation between two segregating sites. For example, if there was no recombination between two sites, their alleles would be highly correlated and LD would be high in magnitude. If there was infinite recombination between two sites, they would be independent and have LD close to 0. For two loci, let
H1, H12, and H2 statistics, as described in Garud
This gives us a total of 345 statistics.
To modify our deep learning method to accommodate this type of inference problem, during training we have an outer-loop that changes the demography as necessary, and an inner loop that accounts for differences in selection for each region. During testing, we compare three different methods for predicting the effective population sizes: (1)
One final complication is that we estimate continuous response variables for the population sizes, but consider selection to be a discrete response variable. This involves a linear activation function for the population sizes and a softmax classifier for selection. A diagram of our deep learning method is shown in
One advantage of deep learning is that the weights of the optimal network give us an interpretable link between the summary statistics and the output variables. However, from these weights, it is not immediately obvious which statistics are the most informative or “best” for a given output variable. If we only wanted to use only a small subset of the statistics, which ones would give us the best results? To answer this question, we use both a permutation testing approach and a perturbation method described in
An open-source software package (called evoNet) that implements deep learning algorithms for population genetic inference is available at
In this scenario, the testing data is simulated with a recombination rate that is 4 times higher than that of the training data. The effective population size results (top table) are still generally accurate, but selection (bottom table) is harder to predict. Neutral regions are predicted correctly, but soft sweeps are also often classified as neutral. The classifier has difficulty distinguishing between hard sweeps and balancing selection. The top table should be compared to
(TIFF)
In this scenario, the testing data is simulated with bottleneck parameters that are more severe than the training data. This has a slight negative impact on the population size results (largely on the most recent size which was outside the training range), but has little effect on the selection results. The overall percentage of misclassified regions is 7.8%. The top table should be compared to
(TIFF)
(XLS)
(XLS)
The single hidden layer serves to learn informative combinations of the inputs, remove correlations, and typically reduce the dimension of the data. After the optimal weight on each connecting arrow is learned through labeled training data, unlabeled data can be fed through the network to estimate the response variables.
(TIFF)
For each variable, the top 25 statistics were chosen, according to the procedure in
(TIFF)
The x-axis shows increasing values of λ, and the y-axis shows the error on the validation dataset. The curve shows a characteristic shape with low and high λ producing poorer results than an intermediate value. For these hidden layers sizes and this dataset,
(TIFF)
The input data (
(TIFF)
(TIFF)
We thank Andrew Kern and Daniel Schrider for helpful discussion on classifying hard and soft sweeps, as well as Yuh Chwen (Grace) Lee for many insightful comments on our