^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

Conceived and designed the experiments: DJL GH SM DF. Analyzed the data: DJL GH SM DF. Wrote the paper: DJL GH SM DF. Implemented CHROMOPAINTER: GH. Implemented fineSTRUCTURE: DJL. Derived propositions: SM.

The authors have declared that no competing interests exist.

The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from

The first step in almost every genetic analysis is to establish how sample members are related to each other. High relatedness between individuals can arise if they share a small number of recent ancestors, e.g. if they are distant cousins or a larger number of more distant ones, e.g. if their ancestors come from the same region. The most popular methods for investigating these relationships analyse successive markers independently, simply adding the information they provide. This works well for studies involving hundreds of markers scattered around the genome but is less appropriate now that entire genomes can be sequenced. We describe a “chromosome painting” approach to characterising shared ancestry that takes into account the fact that DNA is transmitted from generation to generation as a linear molecule in chromosomes. We show that the approach increases resolution relative to previous techniques, allowing differences in ancestry profiles among individuals to be resolved at the finest scales yet. We provide mathematical, statistical, and graphical machinery to exploit this new information and to characterize relationships at continental, regional, local, and family scales.

Technologies such as high density genotyping arrays and next generation resequencing have recently facilitated the production of an enormous quantity of data with which to investigate genetic relationships in humans and in other organisms. These data have the potential to provide a new level of insight into patterns of dispersal and mating, and recent and ancient historical events. However there are challenges, in terms of computational burden and statistical modelling, that are yet to be fully addressed. Two of the most popular approaches to investigate population structure using genetic data are exemplified by principal components analysis (PCA)

Model-based methods attempt to more directly reconstruct historical events. In the simplest version of the STRUCTURE approach

The central issue that we address in this work is the fact that both PCA, and the most popular STRUCTURE-like approaches analyse single mutations individually, and do not use information about the relative positions of these mutations in the genome. However the advent of high-density variation data, together with both computational

Here we develop and apply both non-model and model based approaches, analogous to the PCA and STRUCTURE approaches described above, that aim to use much of the information present in haplotype structure. Both approaches are based on analysing the same matrix, which we call the coancestry matrix. Although our main aim is to introduce a framework to exploit LD information where present, our methods can also treat markers independently as a limiting case. We show theoretically and in practice that in this setting, the coancestry matrix approximately contains all the information used by both PCA, and the model-based STRUCTURE-like approaches, unifying these apparently different approaches. Moreover, we show in some settings our model based approach can be more sensitive than either STRUCTURE or ADMIXTURE, and is able to reliably infer over 100 populations simultaneously. When dense marker sets are available, our haplotype-based algorithm performs substantially and uniformly better than all methods treating markers independently. We illustrate our approach using the Human Genome Diversity Panel (HGDP) dataset, comprising over 600,000 markers typed on 938 individuals. Worldwide, we show that the use of haplotype information improves separation of groups, and reveals differences in genetic ancestry even among individuals coming from the same labelled population, and not detectable by the non-LD-based equivalent approaches.

Our approach attempts to capture the most relevant genealogical information about ancestry in compact form. We construct and motivate the approach using an example (

We show the process by which a haplotype (haplotype 1, black) is painted using the others. A) True underlying genealogies for eight simulated sequences at three locations along a genomic segment, produced using the program ‘ms’

Because the set of genealogies consistent with a given dataset is complex to describe, and typically huge, approximate methods are required in order to make inference computationally practical

One important special case is when markers are widely enough spaced as to be effectively unlinked, i.e. the recombination rate between any pair of markers is infinite. It is straightforward to produce our coancestry matrix in this setting by setting the recombination rate

We developed and implemented an approach to perform principal components analysis (PCA), by eigenanalysis of a normalised version of our coancestry matrix (

As stated above, our coancestry matrix

To formalise this idea, we consider

Our model is now defined by our earlier stated assumption that donated chunks within an individual are independent, and no additional information is carried in their size (which for example determines the number of chunks in the genome). For individuals

In our Bayesian approach, we must model the number and distribution of the underlying populations via a prior for

We have implemented our approach as a software package we refer to as fineSTRUCTURE. Because we have chosen the prior of

The statistical model that we have derived has a likelihood depending on the terms

Different chunks will not in practice be fully independent of each other, tending to decrease the ‘effective’ number of chunks and therefore increase

Conversely, by averaging over chunk assignment uncertainty in the painting step we smooth the chunk count distribution for each individual, decreasing

Although we have not been able to derive such a formula for linked data, we can estimate

Our estimation of

One helpful property of this approach is that by attempting to correct for the true underlying variance of the

Since the fineSTRUCTURE algorithm can identify fine subdivisions, it is often important in practice to have some indication of historical relationships amongst the inferred populations. We have found that performing inference under the full model using successively reducing values of

We introduce a new approach, described in detail in Models and Methods and

Given the linked or unlinked coancestry matrix, we have described how this can be used to learn about population structure: firstly, by performing PCA, and secondly, by using a model-based analysis to identify clusters of individuals with similar historical ancestry, corresponding to genetically related populations. In this section, we extensively evaluate properties of our approach in theory and using simulated data, and perform a new analysis of the HGDP dataset. We also explain how in conjunction with the clustering algorithm, analysis of the coancestry matrix reveals both differences, and details of historical interactions, among human populations in unprecedented detail.

To understand the properties and performance of our approach in the simplest possible setting, we begin by analysing the case where markers are treated as unlinked, i.e. our unlinked coancestry matrix. In this setting, markers may be truly unlinked, or there may be LD information being ignored. We began by analysing datasets simulated under a setting where there was no underlying population structure, both with and without tight linkage between markers (

To compare the PCA approaches in practice, we constructed a simulated dataset designed to represent realistic levels of subtle population structure. We simulated data for 100 individuals according to a model containing 5 populations related in a tree-like manner with three major historical splits forming populations A, B and C two of which subsequently split (

A) Effective population size and B) population splits used for creating the simulated data. C) Coancestry heatmaps for linked and unlinked model with

We next turn to our fineSTRUCTURE model-based analysis, again considering the unlinked coancestry matrix even though strong and variable LD exists in the dataset. We first compared performance of our unlinked model to the popular ADMIXTURE

A) Pairwise coincidence matrix output by fineSTRUCTURE using chunk counts calculated using (top right) the linked and (bottom left) unlinked model, for the datasets from

We sought to understand mathematically why our approach, based on only a summary of the original variation data – the unlinked coancestry matrix – equivalent to the matrix used for Eigenstrat's version of PCA, appears to perform so well relative to the earlier approaches, which carefully model each individual SNP marker (

What explains the different behaviour of the model-based approaches? We believe it is differences in prior models used. Both STRUCTURE and ADMIXTURE assume all underlying populations undergo separate genetic drift from some original founder group, and so this prior model penalises shared drift, for every individual marker, and so increasingly strongly as the number of loci increases. Our simulation framework (realistically, we believe), incorporates drift separate to each group, but also shared drift common to clusters of populations (caused for example, by being closer geographical neighbours). By using a more flexible prior model of structure, fineSTRUCTURE is able to separate populations C1 from C2, and B1 from B2, which the existing model-based approaches have difficulty separating even with sufficient data. By not assuming any particular form for the population-level coancestry matrix

To examine improvements offered by utilising LD information, we used our linked coancestry matrix as the basis of new PCA and model-based analyses. The genetic maps used to simulated the sequence data were also used for inference in the linked model, though we note (not shown) that the conclusions still hold without this requirement. We estimated

In the model-based setting, linked fineSTRUCTURE strongly outperforms the unlinked version (

We analysed the pattern of population structure in the Human Genome Diversity Project (HGDP) dataset

A) Relationship between populations for the whole world data. Each tip corresponds to a population; labels include the number of individuals and are coloured red if all individuals within that label are found in a single clade. See text for an interpretation of the values on the edges; the cut defines the ‘sub-continents’ discussed in the text. B) Transposed coancestry matrix for the Hazara and Burusho (in full:

In general, many groups are not related through simple hierarchical ‘tree-like’ drift, but also through complex admixture events. These relationships are captured directly in our representation by the coancestry matrix. Although this is high-dimensional even after clustering individuals into groups, and in future we think it is important to incorporate admixture in our modelling framework, we nevertheless believe the very complex structure of the data itself means visual examination of the coancestry matrix provides important insights using linkage information. Previous analysis of the worldwide HGDP using ADMIXTURE, and to an extent PCA, has identified signals of admixture

Although fineSTRUCTURE performs well on the global dataset, for easier visualisation of results, we developed an approach analysing structure in only sub-regions of the data, but based on the same (worldwide) coancestry matrix as before. In practice, we found this had the second advantage of a small increase in resolution, while retaining the ability to identify many long-range population relationships. This increase in power is related to our prior model – we assume ancestry proportions are independent across groups, while in fact worldwide historical relationships among populations result in correlations in these vectors. Although the prior is overwhelmed by the data for clear splits (unlike that used by other approaches), our algorithm nevertheless can merge very similar groups. Within a subregion of the world, however, differences in ancestry proportions are much closer to independent, potentially improving precision.

For a regional analysis, we chose to split the dataset into eight regions, approximately corresponding to ‘sub-continents’, based only on the results of the merging algorithm used to produce the population tree (

We focus on the European results as an example (

A) (bottom left) population averages, (top right) the raw data matrix, and (left) chunks from other sub-continents. To symmetrise the matrices we show the average of the donor/recipient chunk counts; read the row

We applied ADMIXTURE to the same HGDP European data as analysed by fineSTRUCTURE (

In addition to using fineSTRUCTURE, we also used our linked (and unlinked) PCA approaches to analyse the data for Europe and other continents (

The first 2 PCA components of the East Asian ‘continent’ as defined in

The strongest advantage to using the linked model is in separating subtly different groups, and we see many cases in our data where labelled groups are split into smaller populations by fineSTRUCTURE, but although these show features consistent with their representing genuine ancestral differences, we do not have additional information, for example on geography to confirm these populations. We therefore devised a scheme to overcome our incomplete information, using the fact that although completely unlinked, two approximately equally sized halves ‘A’ and ‘B’ of an individual's genome automatically share all sampling details, and thus have the same underlying ancestry. Examining similarity in ancestral profiles for the two halves thus provides an indication of whether ancestry differences observed (from half the genome) are genuine, at the finest possible scale. Specifically, we analysed half of the individuals at a time (splitting the dataset approximately evenly for each label), painting their chromosomes using an identical donor set consisting of the other half of the sample, so chunk counts for individual ‘A’ or ‘B’ halves are comparable across individuals. For each individual ‘A’ half, we paired with the most correlated individual ‘B’ half, and recorded the fraction of times this ‘B’ half came from the same individual (

For each continent, we show the proportion of times in which two sets of chromosomes of a particular individual are matched correctly based on similarity of their coancestry profile. Coancestry profiles are calculated using a training set as described in the text. Results for coancestry matrices are calculated using correlation between individuals based on the linked and unlinked models. Also shown are the expected success in clustering if individuals within the same label or same inferred (linked results) fineSTRUCTURE population each had the same ancestry profile.

Partial or complete barriers to mating create groups with distinct genetic ancestry, or, in the present terminology, populations. In our approach, we assume that chromosomes within a particular population have characteristic probabilities of sharing stretches of similar DNA from individuals in their own and in other populations, and view these probabilities as defining population composition and relationships. To infer groups, we first reduce data dimensionality by estimating the relationships among all pairs individuals using a “coancestry matrix”, which is central to our method and based on ‘painting’ the chromosomes of each individual

We have also shown that the linked approach substantively improves performance, where LD information is present among tightly packed markers, achieving a resolution in the HGDP that is to our knowledge unprecedented. Intuitively, we believe that the underlying reason is that using haplotype sharing identifies relationships among individuals in the recent past much more strongly than individual ancient SNP sharing, enabling more subtle, recent population structure to be captured

In practical implementation, our approach uses two initial, parallelizable analyses: a phasing step, common in modern population genetic analyses, and a subsequent chromosome painting step, both run once on a given dataset, and feasible for datasets with millions of markers using computer clusters. Subsequent steps using the resulting coancestry matrix have computational time depending only on the number of individuals, which with our efficient algorithmic implementation enable us to, for example, analyse far larger numbers of populations – hundreds in the HGDP - than other approaches that reanalyse each mutation at each iteration. We observed a substantial performance improvement for the linked model, when applied to the HGDP data phased jointly using fastPHASE

In the model-based approaches discussed here, we have described how the coancestry matrix captures key relationships among groups. However, future approaches may aid interpretation of results, and power, by explicitly modelling the processes of drift, and subsequent admixture, among identified populations and their effect on this matrix. The theory developed here for the unlinked case suggests a close connection between population level genetic drift and the coancestry matrix. Although this (like average pairwise coalescent times

Overall, our results demonstrate we have not yet reached the limits of the information available using genetic information, and particularly the precision with which ancestry sources can be determined. As full sequence data and larger sample sizes become increasingly available, we anticipate resolution will improve further beyond the level of countries, to regions within countries in many cases, and this will be of value in a range of settings. The methods described here can produce highly accurate clustering and sensible choices of the number of populations in humans and other species, and can be applied to full genome sequences for thousands of individuals.

The algorithms described in this article have been implemented in computer software packages ChromoPainter and fineSTRUCTURE, which are available at

Correlation with truth for Unlinked data. 15000 non-rare (

(TIFF)

Correlation with the truth for linked data. A varying number of individuals with varying chunk scaling

(TIFF)

Correlations within the coancestry matrix for unlinked data. Left: the raw coancestry matrix for the same scenario as simulated in the main text but with 15000 unlinked SNPs. Centre: the renormalized coancestry matrix based on the true population distribution. Right: The difference, highlighting the correlated nature of the error terms for the coancestry matrix (there are differences for the merged B1 and B2 populations only). Top: These matrices based on the ‘true’ population structure given by the labels. Bottom: These matrices based on merging the most recent split, setting

(TIFF)

Correlation with truth for Unlinked data with strong population structure. This is a demonstration of how our model breaks down in the presence of strong population structure and

(TIFF)

Correlation with truth for fineSTRUCTURE and STRUCTURE. (black) is fineSTRUCTURE and (red) is STRUCTURE, considered as a function of the number of unlinked SNPs. Data are simulated as described, with all SNPs having minor frequency

(TIFF)

ADMIXTURE results for simulated data at 25 linked regions. Top: cross-validation error (lower is better). True populations are separated by a black line. The maximum correlation with truth is obtained at K = 3.

(TIFF)

ADMIXTURE results for simulated data at 50 linked regions. Top: cross-validation error (lower is better). True populations are separated by a black line. The maximum correlation with truth is obtained at K = 3.

(TIFF)

ADMIXTURE results for simulated data at 75 linked regions. Top: cross-validation error (lower is better). True populations are separated by a black line. The maximum correlation with truth is obtained at K = 3.

(TIFF)

ADMIXTURE results for simulated data at 100 regions. Top: cross-validation error (lower is better). True populations are separated by a black line. The maximum correlation with truth is obtained at K = 4.

(TIFF)

ADMIXTURE results for simulated data at 150 regions. Top: cross-validation error (lower is better). True populations are separated by a black line. The maximum correlation with truth is obtained at K = 4.

(TIFF)

ADMIXTURE results for simulated data at 200 regions. Top: cross-validation error (lower is better). True populations are separated by a black line. The maximum correlation with truth is obtained at K = 4.

(TIFF)

ADMIXTURE results for the HGDP Europe dataset. A range of K is considered as described in the text. Dashed lines separate fineSTRUCTURE populations, solid lines separate labelled populations. fineSTRUCTURE agrees with all labelled populations with the exception of the Tuscan/French.

(TIFF)

ADMIXTURE cross validation error as a function of

(TIFF)

Whole world HGDP coancestry matrix. Some population labels are omitted for clarity; this has only been done when the neighbouring population contains the same labels and the exact distribution is recoverable from the tree and

(TIFF)

“Sub-continental” tree for all HGDP populations. Inference was performed in separate subcontinents groupings as defined in

(TIFF)

“Sub-continental” coancestry matrix. Groupings as defined in

(TIFF)

“Sub-continent” of Africa coancestry matrix. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

“Sub-continent” of CentralSouthAsia coancestry matrix. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

“Sub-continent” of Druze coancestry matrix. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

“Sub-continent” of EastAsia coancestry matrix. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

“Sub-continent” of Europe coancestry matrix. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

“Sub-continent” of MiddleEast coancestry matrix. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

“Sub-continent” of NorthEastAsia coancestry matrix. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

“Sub-continent” of “Other” populations. ‘Other’ is defined as America, Oceania and some Asian individuals. (bottom left) the Population coancestry matrix and (top right) the Individual coancestry matrix.

(TIFF)

Whole HGDP pairwise coincidence matrix. (bottom left) run 1 and (top right) an independent run 2. It is recommended to view this figure online and use zoom tools.

(TIFF)

Africa pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

CentralSouthAsia pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

Druze pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

EastAsia pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

Europe pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

MiddleEast pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

NorthEastAsia pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

“Other” populations pairwise coincidence matrix. (bottom left) run 1 and (top right) independent run 2.

(TIFF)

PCA for the continent of Africa. The first two components are shown; furhter structure will be present in the higher components.

(TIFF)

PCA for the continent of America. The first two components are shown; furhter structure will be present in the higher components.

(TIFF)

PCA for the continent of CentralSouthAsia. The first two components are shown; furhter structure will be present in the higher components.

(TIFF)

PCA for the continent of EastAsia. The first two components are shown; furhter structure will be present in the higher components.

(TIFF)

PCA for the continent of Europe. The first two components are shown; furhter structure will be present in the higher components.

(TIFF)

PCA for the continent of MiddleEast. The first two components are shown; furhter structure will be present in the higher components.

(TIFF)

PCA for the continent of Oceania. The first two components are shown; furhter structure will be present in the higher components.

(TIFF)

Population labels assigned to “continents” for PCA.

(PDF)

Mathematical description of the Painting algorithm.

(PDF)

Derivation of the fineSTRUCTURE Partition Posterior probability.

(PDF)

Mathematical details of the fineSTRUCTURE MCMC moves and acceptance probabilities.

(PDF)

Theory linking PCA, STRUCTURE and fineSTRUCTURE. This includes Propositions 1–4 and a brief summary of what they imply.

(DOCX)

Simulation procedure for linked data using SFS_CODE.

(PDF)

Empirical evaluation procedure for the scaling parameter

(PDF)

Empirical comparison of fineSTRUCTURE to STRUCTURE.

(PDF)

Details of the ADMIXTURE linked simulation evaluation procedure.

(PDF)

Details of the ADMIXTURE HGDP analysis.

(PDF)

Results for HGDP data. These comments interpret

(PDF)

We thank David Alexander for help and discussion and Graham Coop, Jonathan Pritchard, Peter Ralph, Chris Spencer, and David Reich for reading the manuscript. We also thank two anonymous reviewers for comments on the paper and Ryan Hernandez for help with the SFS_CODE simulation parameters.