^{*}

Conceived and designed the experiments: GM. Performed the experiments: GM. Analyzed the data: GM. Wrote the paper: GM.

The author has declared that no competing interests exist.

Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's _{st}

Genetic variation in natural populations typically demonstrates structure arising from diverse processes including geographical isolation, founder events, migration, and admixture. One technique commonly used to uncover such structure is principal components analysis, which identifies the primary axes of variation in data and projects the samples onto these axes in a graphically appealing and intuitive manner. However, as the method is non-parametric, it can be hard to relate PCA to underlying process. Here, I show that the underlying genealogical history of the samples can be related directly to the PC projection. The result is useful because it is straightforward to predict the effects of different demographic processes on the sample genealogy. However, the result also reveals the limitations of PCA, in that multiple processes can give the same projections, it is strongly influenced by uneven sampling, and it discards important information in the spatial structure of genetic variation along chromosomes.

The distribution of genetic variation across geographical location and ethnic background provides a rich source of information about the historical demographic events and processes experienced by a species. However, while colonization, isolation, migration and admixture all lead to a structuring of genetic variation, in which groups of individuals show greater or lesser relatedness to other groups, making inferences about the nature and timing of such processes is notoriously difficult. There are three key problems. First, there are many different processes that one might want to consider as explanations for patterns of structure in empirical data and efficient inference, even under simple models can be difficult. Second, different processes can lead to similar patterns of structure. For example, equilibrium models of restricted migration can give similar patterns of differentiation to non-equilibrium models of population splitting events (at least in terms of some data summaries such as Wright's

An alternative approach to directly fitting models is to use dimension-reduction and data summary techniques to identify key components of the structure within the data in a model-free manner. Perhaps the most widely used technique, and the most important from a historical perspective, is principal components analysis (PCA). Technical descriptions of PCA can be found elsewhere, however, its key feature is that it can be used to project samples onto a series of orthogonal axes, each of which is made up of a linear combination of allelic or genotypic values across SNPs or other types of variant. These axes are chosen such that the projection of samples along the first axis (or first principal component) explains the greatest possible variance in the data among all possible axes. Likewise, projection of samples onto the second axis maximizes the variance for all possibles axes perpendicular to the first and so on for the subsequent components. Typically, the positions of samples along the first two or three axes are presented, although methods for obtaining the statistical significance of any given axis have been developed

Although PCA is explicitly a non-parametric data summary, it is nevertheless attractive to attempt to use the projections to make inferences about underlying events and processes. For example, dispersion of sample projections along a line is thought to be diagnostic of the samples being admixed between the two populations at the ends of the line, though these need not always be present

In this paper I develop a framework for understanding how PCA relates to underlying processes and events. I show that the expected location of samples on the principal components can, for single nucleotide polymorphism (SNP) data, be predicted directly from the pairwise coalescence times between samples. Because it is often relatively easy to obtain analytical or numerical solutions to expected coalescence times under explicit population genetics models, it is also possible to obtain expressions for the PCA projections of samples under diverse scenarios, including island models, models with isolation and founder events and historical admixture. The result also highlights some key limitations of PCA. For example, it follows that PCA cannot be used to distinguish between models that lead to the same mean coalescence times (for example models with migration or isolation). Furthermore, PCA projections are strongly influenced by uneven sampling. Using examples from human genetics I discuss the implications of these results for making inferences from PCA of genetic variation data.

In this section I provide a brief summary of how PCA is carried out and describe the key result concerning the relationship between PCA and average coalescence time. In what follows I assume that

The transformed data matrix,

The value associated with a given individual in

The correlation between any two rows of

The sum of the variances of the rows equals the variance in the original data.

The variances of the rows are monotonically decreasing.

The variance of the first row is the largest of any possible projection of the original data on a linear combination of the SNPs.

The principal components can be obtained directly by finding the eigenvectors of the covariance matrix

The probability that samples

The probability that sample

The probability that sample

The probability that two samples,

In the case of a low mutation rate, where polymorphic sites are the result of a single historical mutation, expressions can be obtained for the above quantities in terms of features of the genealogical tree

The chart shows a genealogical tree describing the history of a sample of size five. Two samples,

For diploid individuals the genotypic value for an individual at a given SNP is typically given by the sum of the allelic values; i.e.

The implication of Equation 10 is that if the structure of pairwise coalescence times in a given data set can be understood, then the projection of the samples on the principal components can be predicted directly. Two illustrate this idea consider the simple model of a population split shown in

(A) Consider a sample of

These results refer explicitly to the expected value of

A direct consequence of Equation 10 is that PCA predominantly reflects structure in the expected (or mean realized coalescent) time. Consequently, any two demographic models that give the same structure of expected coalescence times will also give the same projections. To illustrate this result, consider a fully general model with two homogeneous populations where the expected coalescence time for two samples from population A is

One connection that is worth exploring further is the link between the results shown here and those of Slatkin

As has been shown previously

PCA projection of samples taken from a set of nine populations arranged in a lattice, each of which exchanges migrants at rate

The principal components identified through PCA can be used to project not just those samples from which the PCs were obtained, but also additional samples. The appeal of such analyses is that it enables the analysis of structural features identified in one data set to be transferred to another. For example, where data from two source populations and a set of possibly admixed samples are available, projection of the admixed samples onto the axes defined by the source populations can identify the extent of mixed ancestry. The advantage of this approach rather than simply performing PCA on all samples together is that other structural features within the admixed samples (e.g. admixture from a third population or relatedness) will have little influence on the projection. In the light of the above results showing how the PCA projection of samples can be interpreted in terms of coalescence times, it is interesting to ask how the the projection of additional samples onto the same axes also relates to coalescence times.

Consider the case of the general two-population model where the positions of the samples on the first PC are

There are three important points to note when applying this result. First, only if the admixture event was very recent are the source populations likely to be available. Rather samples may be available for descendants of these source populations. Consequently, the average divergence between the population A part of an individuals genome and other samples from population A might typically be greater than for two samples taken directly from population A. However, this effect is likely to be very similar for the two source populations and, given Equation 23, these effects largely cancel out.

The second point to note is that if samples are admixed between more than two populations, the result generalises so that an individual whose genome is derived from several source populations will have a projected position (along each significant PC) defined by the weighted sum of the positions of its source populations. Informally, the result arises because of the linearity in Equation 22. Those parts of the genome with ancestry from a given population will have a PC projection that matches samples taken directly from the source population. If there is mixed ancestry, the effect is simply to average the PC projections.

Finally, it is important to note that projection of non-admixed individuals can also lead to their location being intermediate between the two original populations. For example, samples from a third population that either diverged from population A since the split with population B or that come from a population that diverged before the A/B split will (in both cases) be projected between the locations of samples from populations A and B. It may, however, be possible to distinguish between such cases by carrying out PCA on all data combined.

As has already been shown through simulation

Consider a sample of individuals who are the result of an historical admixture event between two populations A and B. In order to define the matrix

Given these considerations there are two situations under which none of the structure between the two source populations is expected to be reflected in the matrix

Initially an admixed population is formed by random mating from two populations, each fixed for a different allele at each locus with 40% contribution from one population. In the simulated population there are 1000 individuals, each of which has 20 chromosomes with 50 markers each, a genetic map length of 1 per chromosome and a uniform recombination rate. Subsequent generations are formed by random mating of the ancestral population. (A) Projections of 100 randomly chosen samples on the first PC over time show a decay in the fraction of variance explained by the first PC (note that the total variance in the population decays little over the time-scale of the simulation). (B) Admxiture proportions for the same individuals as in part A (blue points) as well as the everage heterozygosity (red line) and the fraction of the variance in PC1 explained by admixture proportions (black line). While there is a strong association between admixture proportion and location on PC1 for the first few generations, after 15 generations recombination has eliminated any signal, even though there is still strong admixture LD between nearby markers (data not shown).

The primary result of this paper is that the locations of samples on the principal components identified from genome-wide data on genetic variation can be predicted from an understanding of the average coalescent time for pairs of samples. This gives a direct route to understanding the influence various demographic scenarios can have on the relationships between samples identified from PCA and how PCA can be used to make inference about processes of interest such as admixture. However, the results also demonstrate the way in which sampling schemes can influence PC projections and how similar projections can arise from very different demographic scenarios. Consequently, using these results to motivate inference from PCA about underlying demographic process may prove difficult.

There are, however, situations in which PCA can be used to infer demographic parameters directly. For example, in cases of simple two- or three-way admixture, where populations close to the source populations can be identified and sampled from, estimation of admixture proportions can be achieved from projecting samples onto the PCs identified from the source populations. To illustrate this,

(A) For each of the autosomes (chromosome 1 is the lowest) the points indicate the locations of sampled haplotypes (the transmitted and untransmitted haplotypes inferred from trios) on the first principal component (each chromosome is analysed separately; blue = CEU, orange = YRI, green = ASW). Importantly, PCA is carried out only on the haplotypes from CEU and YRI and all samples are subsequently projected onto the first PC identified from this analysis. Lines connect the transmitted (or untransmitted) haplotypes for each individual across chromosomes. Note the uniformity of the locations of samples on the first PC for CEU and YRI. Individual chromosomes within the ASW, however, show a great range of locations on the first PC. (B) The genome-wide admixture proportions (separately for transmitted and untransmitted chromosomes) can be inferred directly from the location of admixed samples on the first PC between the two source populations. Colours are as for (A). The vertical spacing of points is arbitrary.

One important issue in the application of these ideas to the analysis of empirical data is the extent to which SNP ascertainment will influence outcome. SNP discovery in a small panel will typically lead to the under-representation of rare SNPs in the genotyped data and, depending on the geographical distribution of the samples used for discovery, can also lead to biases in the representation of variation from different areas. The quantities in Equation 8 are therefore conditional not just on segregation in the genotyped sample, but also on segregation within the SNP discovery panel. Consider the joint genealogy of the genotyped and discovery samples shown in

(A) In the joint genealogy of the ascertainment (black circles) and genotyped samples (grey circles), only mutations occurring on the intersection of the two genealogies (shown in black) will be detected in both samples. For small discovery panels and large experimental samples, this may be considerably less than half the total genealogy length. (B) Model used to simulate data from three populations linked by two vicariance events, each of which is associated with a bottleneck; the model is an approximation to the demographic history of the HapMap populations

Finally, it is worth pointing out that because PCA effectively summarizes structure in the matrix of average pairwise coalescent times, but in a manner that is influenced by sample composition, more direct inferences can potentially be made from the matrix of pairwise differences (which are trivially related to pairwise coalescent times). This is not to say that eigenvalue analysis of the pairwise distance matrix will correct for the effects of biased sampling demonstrated in

Coalescent simulations were carried out using scripts written by the author in the R language (

Many thanks to Niall Cardin, Peter Donnelly, Stephen Leslie, Simon Myers, John Novembre, Nick Patterson, and Molly Przeworski for discussion and comments on the manuscript.