^{1}

^{*}

^{1}

^{2}

^{1}

^{1}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: VG SAKN STKL MG. Performed the experiments: VG. Analyzed the data: VG STKL. Contributed reagents/materials/analysis tools: VG STKL. Wrote the paper: VG MG STKL SAKN. Software development: VG.

Bayesian inference methods are extensively used to detect the presence of population structure given genetic data. The primary output of software implementing these methods are ancestry profiles of sampled individuals. While these profiles robustly partition the data into subgroups, currently there is no objective method to determine whether the fixed factor of interest (e.g. geographic origin) correlates with inferred subgroups or not, and if so, which populations are driving this correlation. We present O

When there is a lack of free gene flow within sexual populations, neutral and selective forces will erode population homogeneity and tend to establish population structure

The optimal number of inferred populations may be estimated from these analyses

The main aspect we consider here is how the assignment of individuals to inferred populations relates to the factor of interest. The Bayesian methods derive and assign individuals to subgroups without knowledge of the origin of individuals. Imagine one samples individuals from three geographic locations, and hence location is hypothesised to be a driver of structure. An analysis of the genotypes obtained using Bayesian methods suggest the optimal number of inferred populations is four. What does this mean? Any number of possibilities are biologically feasible: one location might harbor two or more populations, or perhaps geographic origin bears no relationship to the inferred populations, but some other factor does. The issue is that the current visualisation methods, while informative, do not allow population assignments and ancestry profiles to be objectively analysed: a subjective assessment of the plots is only possible. Our method statistically analyses these ancestry profiles and allows one to determine whether inferred population assignment and the factor of interest (e.g. origin of individuals) are significantly correlated. Having determined the extent to which a factor of interest defines observed population structure, one may then conduct finer scale analyses to ascertain the relative contribution of each sampled and inferred population to overall population structure. Which sampled and inferred populations are most differentiated or contribute the most to overall structure? We set about applying a statistical procedure which can objectively quantify the level of structure in these ancestry profiles, test the sources of structure, and determine statistical significance using a permutation approach. Our method complements visualization with distruct, adds a final step to the pipeline for population structure analyses, and allows one to analyse factors driving population structure within ancestry profiles and the extent to which these factors are explaining the variability seen within ancestry profiles as a whole.

O

O

Our aim is to determine the extent to which the factor of interest (encoded as the predefined populations) is reflected in the ancestry profiles. We use the

Our null hypothesis states that the inferred ancestries do not reflect our predefined populations, i.e. individuals inferred to share a high proportion of ancestry (forming a population within the data) appear randomly scattered among the predefined populations or, alternatively, all individuals have equal ancestries to all inferred populations. In short, this indicates the factor of interest does not account for or drive inferred population structure. An established way of assessing how well the predefined populations are represented by the inferred populations is by evaluating the variation within and across predefined populations (e.g.,

The

where

i.e., we assess how much of the total variability in the data is accounted for by grouping the data points according to the predefined populations. The

The usual way of assessing the amount of evidence against the null hypothesis is by computing a

Since we assess the null hypothesis using the variability in the data, and the data are proportions, we have to account for heterogeneity in variance. Proportions close to 0 or close 1 tend to have a smaller variance than proportions close to 0.5. Not accounting for this can lead to observing effects that are due to the heterogeneity in the variance rather than similarities in the population structure. To address this problem we follow the common approach and use the logit transform on the proportions, i.e. we replace

Having established the overall level of structure within the dataset, we can apply the statistic to examine the relative contributions of each of the predefined and inferred populations to this structure. To assess the contribution we remove one predefined or one inferred population at a time and recalculate

The primary units of interest are populations deriving from the predefined factor level (e.g. geographic location) so we next focus on further analysing the patterns of structure between them. This allows us to identify similarities and differences between predefined populations and test the significance of these relationships. To do this we apply the

An integral part of every statistical analysis is plotting the data to visualise the outcome of the analysis. To visualise the structure derived from the inferred populations and their relation to the predefined populations we use canonical discriminant analysis (CDA, see e.g.,

The output of O

The first figure visualises all individuals coloured according to the predefined population they belong to. The two axes are labelled with the two variables explaining the highest proportion of variability in the data. This proportion is part of the axis label, thus providing the user with information about the amount of variability visualised in the 2D-plot. Further, the plot shows two ellipsoids centred at the hypothetical average over all data points. The inner ellipse contains approximately 50% of the individuals while the outer ellipse contains about 95% of the individuals.

The second figure summarises the information of the first by drawing 66% ellipsoids for every predefined population centred at the respective population mean. This type of plot indicates the position of the predefined populations relative to each other when given the transformed variables.

The final figure is called an HE plot. Here, the H stands for hypothesis, and E stands for error. The plot will post the same axis labels as the previous plots. However, the predefined populations are reduced to simply show their centres. In addition, inferred populations are visualised by arrows indicating their relation to the transformed variables. Possibly the most important feature of the plot are the two ellipsoids. The one labeled group indicates the range of individuals, while the red ellipsoid labeled error indicates the range of variation between the group means if predefined and inferred populations do not resemble each other. If there is no resemblance in structure the red ellipsoid will be large and potentially exceed the group ellipsoid, while a strong resemblance will lead to a small error ellipsoid within the group ellipsoid.

To test the effectiveness of our method, we applied it to simulated and experimental data. Simulations using coalescent theory allow the demographic modeling of population structure backwards in time. This means we can directly compare the divergence of populations with known parameters with the performance of our method to describe population division processes throughout the simulation. The application of our method to experimental data then serves to illustrate the practical benefits of the method and provides further useful information on the data.

The fastsimcoal software

We chose to analyse two published datasets showing high and low levels of population structure in order to evaluate the effectiveness of our method across a range of conditions encountered in nature. The first of these datasets comprises 1,484 humans genotyped at 678 microsatellite loci in 78 worldwide populations from 7 distinct geographic continents

The human data were analysed with structure as per

The O

The

Error bars denote the standard error of three separate simulations for each sampling time point.

Our analysis shows I

Applying the O

We recapitulated the analyses of data from

Dataset | Scale | |

Human | Continental | |

Regional | ||

Regional |

Ancestry profiles were generated using

Predefined Population | Inferred Population | ||

Africa | 6 (Purple) | ||

East Asia | 5 (Green) | ||

Europe | 4 (Pink) | ||

Middle East | 1 (Orange) | ||

Oceania | 3 (Yellow) | ||

Central South Asia | 2 (Blue) | ||

America |

Ancestry profiles were generated using InStruct with

The relatively low

Finally, the changes to

(a) Mapping the individual data according to the CDA variables. The inner ellipsoid contains 50% of all individuals, the outer ellipsoid contains 95% of all individuals. Individuals are colour- and shape-coded according to their respective sampled region. (b) The HE plot shows the relation of variation in the group means on two variables relative to the error variance. The coloured arrows indicate the position of the inferred populations relative to the axes obtained by the canonical discriminant analysis. The black points indicate predefined populations (WA = West Auckland; WI = Waiheke Island; HB = Hawke's Bay) while numbers at the arrows indicate inferred populations.

Note that not every dataset will show enough discrimination with just two CDA variables. The variation indicated at the axis is a good indicator of how much variation has been covered by the CDA variables. If a third variable is useful, use the

We have presented a novel application of a classic statistical tool to analyse ancestry profiles produced from the Bayesian methods implemented in

The steady increase in structure through time since divergence seen in the simulation data would not be easy to determine using present methods in population genetics. Our method quickly and easily captures this information and tests significance using a permutation approach. We found that within our simulations significant population structure could be detected in all three replicates after 50 generations by I

The application of our method to experimental data showed the comparability of the overall

The absolute

The method described in this work has wide-ranging applications to any field employing population genetic techniques, and we feel that this is a valuable addition to a pipeline for the analyses of population structure. An objective quantification of population structure in datasets means that disparate datasets may now be compared. This opens up the ability to conduct theoretical and practical tests on the nature of population structure and the factors that influence its inception and perpetuation. The ability to look within a dataset at the causes of structure help to determine the relative difference of populations and allows further interpretation of the data. We believe that objectively quantifying the levels of structure in data and taking into account important characteristics such as population size, number of predefined populations and statistical significance is a significant addition to the currently available analyses.

(PDF)

(PDF)

We thank Brian McArdle for many fruitful discussions on multivariate analysis.