^{1}

^{2}

^{3}

^{4}

^{4}

^{4}

^{4}

^{5}

I have read the journal’s policy and the authors of this manuscript have the following competing interests: CDB and AGI are co-founders of Galatea Bio Inc.

The estimation of genetic clusters using genomic data has application from genome-wide association studies (GWAS) to demographic history to polygenic risk scores (PRS) and is expected to play an important role in the analyses of increasingly diverse, large-scale cohorts. However, existing methods are computationally-intensive, prohibitively so in the case of nationwide biobanks. Here we explore Archetypal Analysis as an efficient, unsupervised approach for identifying genetic clusters and for associating individuals with them. Such unsupervised approaches help avoid conflating socially constructed ethnic labels with genetic clusters by eliminating the need for exogenous training labels. We show that Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. More importantly, we show that since Archetypal Analysis can be used with lower-dimensional representations of genetic data, significant reductions in computational time and memory requirements are possible. When Archetypal Analysis is run in such a fashion, it takes several orders of magnitude less compute time than the current standard, ADMIXTURE. Finally, we demonstrate uses ranging across datasets from humans to canids.

This work introduces a method that combines the singular value decomposition (SVD) with Archetypal Analysis to perform fast and accurate genetic clustering by first reducing the dimensionality of the space of genomic sequences. Each sequence is described as a convex combination (admixture) of archetypes (cluster representatives) in the reduced dimensional space. We compare this interpretable approach to the widely used genetic clustering algorithm, ADMIXTURE, and show that, without significant degradation in performance, Archetypal Analysis outperforms, offering shorter run times and representational advantages. We include theoretical, qualitative, and quantitative comparisons between both methods.

This is a

Estimating ancestry cluster allele frequencies and cluster membership from single nucleotide polymorphism (SNP) data is important for many applications in population genetics and applying such methods to characterize diverse human cohorts has become an essential part of large-scale genomic studies. With the growing number of samples in whole genome databases, efficient population clustering techniques that can handle such sample sizes have become increasingly important. Existing techniques for the clustering of genomes include STRUCTURE [

Dimensionality reduction techniques such as multidimensional scaling (MDS), principal component analysis (PCA) and uniform manifold approximation (UMAP) have been used to overcome the high dimensionality of genomic data [

The complete proposed pipeline is presented in

The allele counts from both haplotypes of each of ^{T}). Archetypal analysis models the individual genotypes as originating from the admixture of

If we observe _{i} indicates the average number of alternate alleles found for each _{i} the vectors for all individuals, we obtain an _{1},…,_{N}]. We center the columns of _{c} of centered genotype vectors and then compute the SVD:

This yields

Because the subspace spanned by the centered genotype vectors can have no more than

This non-negative matrix factorization method was first developed by Cutler and Breiman in 1994 [

The samples are approximated as convex combinations of the archetypes such that the residual sum of squares (RSS) between the approximation and original data is minimized:
_{ij} ≥ 0 for

The archetypes are convex combinations of the samples:
_{ij} indicating the weight of sample _{ij} ≥ 0.

By combining Eqs

The optimization problem presented in Eqs _{ij} ≥ 0 and _{ij} ≥ 0, where an extra dimension is added to enforce

Unlike ADMIXTURE, Archetypal Analysis permits the use of rotated and projected (dimensionally reduced) representations of SNP data. If all singular vectors are used, the residual sum of squares of the decomposition (

This is because the projection matrix

Thus, as discussed earlier, using the singular value decomposition permits us to perform AA clustering on a matrix having dimensions of only

Non-negative least squares (NNLS) is a constrained least squares problem in which coefficients are always non-negative (

Given an

We make use of the implementation in [

Archetypal analysis was run with the following parameters (with code adapted from [

Tolerance: defines when to stop optimization when alternating between finding the best _{c} and the previous iteration _{p}, and

Maximum number of iterations for the residual sum of squares (

Constraint coefficient

Initialization method: we use FurthestSum [

Whole genomes from the Human Genome Diversity Project [

The heterogeneous data set of dog breeds from [

We first compute the principal components of the human data set and display the first two components in a plot coloured by continental population (

To compare the ancestry estimates derived from ADMIXTURE and Archetypal Analysis, we display the proportional ancestry cluster assignments, the

Overall, Archetypal Analysis provides estimates that qualitatively match ethnolinguistic and geographical labels. Additionally, AA properly captures the wide variation within African populations, assigning more than one cluster to this diverse continent; however, this comes at the cost (due to the fixed number of clusters) of lacking a further unique cluster for Oceanians. Due to its stronger constraints than ADMIXTURE, AA also obtains cluster centroids that could represent real individuals, lying either on or within the set of observations. In contrast ADMIXTURE cluster centers can represent population frequencies that have never existed in the past and also cannot be realized in the present by any combination (admixture) of populations (see

We compute the principal components of the dog breed data set and display the first two components in a plot coloured by dog clades (

The dog breed dataset was used to benchmark the computation times and clustering quality of both ADMIXTURE and Archetypal Analysis. Running times and explained variances (defined as

K (number of clusters / archetypes) | |||||||
---|---|---|---|---|---|---|---|

Algorithm | [2–6) | [6–10) | [10–14) | [14–18) | [18–22) | [22–26) | [26–30) |

ADMIXTURE | 43 | 64 | 97 | 150 | 247 | 250 | 319 |

AA | 0.5 | 0.48 | 0.7 | 1 | 1.4 | 1.9 | 2.4 |

86× | 133× | 139× | 150× | 176× | 132× | 132× |

Explained variances increased linearly in the number of clusters for both algorithms (

Archetypal Analysis was able to capture the high genetic variability in African populations by identifying three ancestral clusters in this large and diverse super-population, compared to only two clusters assigned by ADMIXTURE. This had an effect on the clustering of the proportionally over-represented European populations, mostly collected under a single ancestral cluster in Archetypal Analysis but given two clusters in ADMIXTURE for the same

Archetypal Analysis proved to be an interpretable alternative to ADMIXTURE. It assigned separate regional archetypes that associated predominantly with Europeans, with South Asians, and with East Asians, and it recognized the high genetic variability of African populations. Differences within regions were also detectable (

The red archetype, which is modal in Europeans, is also seen in North African peoples (Saharawi, population 10, and Mozabites, population 9) at a smaller fraction due to geographic proximity and migration. As also observed in previous studies, American populations, such as Puerto Ricans in Puerto Rico, population 23, and Colombians in Medellín (Colombia), population 24, showed a European associated cluster due to Spanish colonization. The effects of this historical event can also be observed in (

The popular algorithm ADMIXTURE estimates individual ancestries by computing maximum likelihood estimates in a parametric model. Specifically, it maximizes the biconcave log-likelihood of the model using block relaxation:
_{ij} for individual _{ij} ∽ Bin(2,_{ij}) depends on the fraction _{ik} of _{kj} of the allele 1 in population _{ik} and _{kj} are the entries of

ADMIXTURE and Archetypal Analysis share similar modeling assumptions. Both _{kj} ADMIXTURE and _{kj} and archetype coordinates _{kj}) is a parameter that needs to be learnt. Instead, in AA, cluster centroids have

The likelihood function of ADMIXTURE can be understood as an error or distance metric between the input sequences

Therefore, the likelihood function resembles the

Another shared aspect of both methods is the alternating nature of the optimization procedure. In both methods, cluster centers and cluster assignments are optimized in an iterative manner. Once the cluster assignments are fixed, optimizing centers becomes a convex problem, and vice versa, allowing for fast convergences. A summary of this comparison can be found in

ADMIXTURE | Archetypal Analysis | |
---|---|---|

^{T} |
||

Loss Function | log-likelihood | RSS |

Free-parameters | ( |
2 |

CA Dimensions | ||

CA Free-parameters | ||

CA Constraints |
_{ij} ≥ 0 |
_{ij} ≥ 0 |

^{T} |
||

CC Dimensions | ||

CC Free-parameters | ||

CC Constraints | 0 ≤ _{ij} ≤ 1 |
_{ij} ≥ 0 |

Archetypal Analysis and ADMIXTURE hold a strong relationship with K-Means and K-Medioids. As already stated in [_{ij} ∈ {0, 1} and _{ij}, _{ij} ∈ {0, 1}, Archetypal Analysis becomes equivalent to K-Medioids. Therefore, AA can be understood as a smooth or fuzzy version of K-Medioids. Note that both K-Means and K-Medioids are also typically optimized in a iterative alternating nature, similar to AA and ADMIXTURE.

Cluster centers learned by ADMIXTURE, ADMIXTURE with sparsity regularization, Archetypal Analysis, K-Means, and K-Medoids for K = 4 are plotted as solid circles while the underlying samples are plotted as small blue points. Regularization in ADMIXTURE is introduced with lambda = 500 and epsilon = 0.1.

In this paper we show how Archetypal Analysis (AA) can be used as a fast alternative to ADMIXTURE for population clustering. We also show that the Archetypal Analysis model has fewer degrees of freedom, constraining the centroids of clusters to be within the convex hull of the training samples, leading to lower explained variance than ADMIXTURE, but providing more interpretable cluster centroids that represent realizable populations. We apply our proposed system to human and dog genotypes, showing that AA can perform more than two orders of magnitude faster than ADMIXTURE while still properly capturing the population structure of the data.

Detailed information and additional experiments are provided.

(PDF)

We would like to thank Inés de Vilallonga for her dog breed illustrations.

Dear Dr. Ioannidis,

Thank you very much for submitting your manuscript "Archetypal Analysis for Population Genetics" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Heather E. Wheeler, Ph.D.

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Reviewer #1: In this manuscript, Gimbernat-Mayol et al. propose a method for determining the ancestry composition of cohort data by using Archetypal Analysis (AA). They argue that using AA is more computationally efficient than traditionally used methods such as ADMIXTURE, and that it results in similar genetic cluster assignments. Their computational efficiency improvement is impressive, but I think an expansion of the analyses presented in manuscript is necessary to provide the reader with a more full perspective on the benefits and drawbacks of the methodology proposed. Expansion of the discussion as it pertains to the biological analyses would also be helpful. I detail these and other specific questions below.

Broader Comments:

1. The authors note that several programs represent competing methods that could be used instead of AA for the purposes of ancestry clustering. However in the manuscript only ADMIXTURE is benchmarked. It would be useful to see comparisons to the other two methods listed (line 8) – STRUCTURE and FRAPPE – to assist in justification of the improvement of AA compared to other existing tools. At least efficiency benchmarking would be helpful if the authors do not wish to present full results from the empirical datasets.

2. How are users to determine the best fit value of k in their data with AA? With ADMIXTURE, users can run a cross-validation procedure to determine the value of k that has the best predictive accuracy. These CV errors are typically plotted across increasing ks to visualize the elbow in the dataset. Such a plot would also be a helpful additional figure to show the concordance between AA and ADMIXTURE in terms of the best fit k to the data.

3. The authors note in their introductory discussion of PCA that “interpretation can often be misleading if sampling designs are irregular” (line 26). However no discussion of the impact of sample composition for ADMIXTURE or AA is presented in the manuscript. Both will be impacted to at least some degree by sample composition, and this should at least be described for downstream users in clear terms in the manuscript, if not explored empirically. For example, sample size imbalances will impact the ordering of pulling out genetic clusters with increasing values of k. Additionally, both methods will be affected by the inclusion of related individuals in the analysis. Discussion/exploration of the impact of sample composition and recommendations for QC/data inputs by downstream users is necessary to ensure proper utilization of the method.

4. Some of the phrasing in the text related to the human populations should be revisited, particularly in sections prior to the Discussion.

a. Specifically, the populations of the Americas are referred to throughout interchangeably as “Native American populations” (Figure 3) or “indigenous individuals from the Americas” (line 144) but include many populations and individuals who would not self-identify as such, but may be better described as ‘Latinx’, ‘Latin American’, or simply ‘American.’ While there is a Native American ancestry component present in these populations of the Americas, it is incorrect to refer to many of these groups themselves as ‘Native American’ populations. The authors should revisit the phrasing surrounding such groups to be sensitive to this distinction.

b. In a similar vein, the authors refer to individuals from distinct areas of the African continent as being part of the same population at several points in the text – i.e. “The African population displays the highest genetic variability...” (line 154). There are of course many different populations within Africa with substantial genetic and ethnolinguistic diversity across them which this phrasing currently glosses over. Referring to continental groupings instead as ‘super-population’ or even simply making the population plural (--> “The African super-population displays” or “The African populations display…”) when discussing them would ensure this comes through.

5. The Discussion currently is very thin on biological takeaways from the two empirical datasets. This would be a good place to expand on both the interpretation of the differences seen between AA and ADMIXTURE, and what can be inferred based on the patterns you observe. Cite previous research to justify that the AA determinations are logical based on what we know about the population structure and history, particularly if they differ from the determinations by ADMIXTURE.

Other Specific Comments:

6. I did not receive any Supplementary Information. Please ensure supplementary information is included in the revision.

7. Author Summary – you note that AA has ‘representational advantages’ over ADMIXTURE. What is meant by this? This should be described in the main text if this is perceived as a primary benefit of AA over ADMIXTURE.

8. It appears that only bi-allelic variants can be included in the AA analysis. Is this the case? Can indels be used or just SNPs? Discussion of them specific dataset filtering requirements would be useful.

9. Line 141 – you used a MAF cutoff of 10% here. This is extremely high. Would results be the same had you included slightly less common variants in the analysis? Justify threshold choice.

10. Line 155 – you note that “all principal components” were used but do not note how many you computed. Include the number of PCs.

11. Figure 3 presentation – I have several suggestions to aid in the parsing of this figure.

a. The color scheme is not consistent across panels, which would aid viewer interpretation.

b. It is currently very difficult in panel B to determine the specific populations. The figure legend should at least point to Table 3, which contains the key to the population numbering system. A potential expansion of the x axis to allow for a more detailed population labeling may also help.

12. Fig 3 interpretation: There seem to be several notable differences in the ancestry component determination between ADMIXTURE and AA.

a. Humans: The text notes the 2 vs 1 components seen in ‘Europeans’ and ‘Native Americans’, but does not discuss which orientation is more reasonable based on prior research into the population structure and history of these areas. Additionally, there’s also a distinction in Africa where ADMIXTURE appears to identify a unique San component (though it is hard to see if it is indeed the San given the population labeling system), and AA picks up a dark blue component present in many African populations that appears to be driven by the Luhya and San. Further discussion of the interpretation of differences from the two methods would be useful to include in the Discussion.

13. Related, in the dog dataset, are the patterns you observe compatible with what is known about the dog phylogeny? That is, do the trends fit with the expectations from their demography?

14. How correlated are the ancestry fractions across AA and ADMIXTURE? A quantitative comparison would complement the qualitative comparison.

15. How did you decide on the best fit value of k for your empirical datasets? Justify your choice of the k chosen to be presented in the primary figures (8 for humans, 15 for dogs).

16. In studies examining admixture it is typical to run multiple ADMIXTURE runs with different seeds to confirm the results are consistent. I would suggest the authors do this, as the ancestry composition determined is a primary focus of their paper. Running 10x at different seeds for each value of k, for example, would show if there are different likely modes in the data or if the same result is always determined. It appears this was done to generate Figure 5, but that Figures 3 and 4 may be based only on one run each.

17. Fig 5 – how does runtime scale with increasing sample size?

18. Line 264 – “a gradient of relatedness to Europeans.” This phrasing is a bit confusing as you don’t mean relatedness in the technical genetic meaning. I would change the wording to something like ‘European ancestry component’ or ‘cline of European admixture’ or similar.

Reviewer #2: Review is uploaded as an attachment

**********

The

Reviewer #1:

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool,

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at

Submitted filename:

Submitted filename:

Dear Dr. Ioannidis,

We are pleased to inform you that your manuscript 'Archetypal Analysis for Population Genetics' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. Please also consider making the minor additions/changes suggested by the reviewers in your final edit.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Heather E. Wheeler, Ph.D.

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Reviewer #1: Summary:

In this manuscript, Gimbernat-Mayol et al. propose a method for determining the ancestry composition of cohort data by using Archetypal Analysis (AA). They argue that using AA is more computationally efficient than traditionally used methods such as ADMIXTURE, and that it results in similar genetic cluster assignments. Their computational efficiency improvement is impressive. Their method is also able to resolve ancestral archetypes that represent truly possible individuals, which they point out as a benefit over competing methods.

In their revisions, the authors have expanded the scope of analyses to address the technical questions brought up by the other review and myself and have revised some of their text. A couple small points remain but the bulk of concerns have been addressed. Specifically:

• The authors now test the question of different initialization modifying results, which both reviewers brought up. There seems to be quite a big effect, which may be worth expanding on even further, but is briefly mention in the discussion.

• Importantly, the authors also now include a discussion of the impact of sample size imbalance on component/archetype definitions, and argue that in fact AA does better at resolving the complex ancestral diversity in Africa as compared to ADMIXTURE in spite of the bias in sample size towards the latter.

• The authors however have not commented on the impact of inclusion of related individuals.

• Runtime scaling across sample sizes now shown also, as well as a benchmark against FRAPPE.

• The authors now also a plot on how to interpret the compositional plots, which I think many readers will find extremely useful, as this is a somewhat non-standard way to visualize ancestry proportions.

• The authors include a formal comparison of the similarity between AA and ADMIXTURE output.

• The authors have appropriately revised their description of cohorts throughout the manuscript.

• A much fuller discussion of the interpretation of the dog ancestry inference results is now also presented, which is interesting and helpful to confirm that the parsing of clusters makes demographic sense.

• The human discussion is also expanded, though still feels a bit meager in comparison to the expanded dog discussion.

• Related, the authors find that using AA “the American populations are represented by two archetypes (A6 and A7) and have a gradient running to the European/West Asian archetype as a result of colonial admixture. Example populations found along this gradient are the Puerto Ricans in Puerto Rico and Colombians in Medellin (Colombia).” Similar to the point brought up by reviewer 1, prior research has demonstrated a west African component in Caribbean populations, which is further supported by historical records of the translatlantic slave trade. The authors do not note this additional component in some American populations in their discussion, which is a point I expect many applied readers would be interested in.

Reviewer #2: Thank you for your hard work on improving the manuscript. I have only minor fixes:

* Line 207: Typo - "the the"

* Line 288: Typo - there is a stray "("

**********

The

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2:

PCOMPBIOL-D-21-02146R1

Archetypal Analysis for Population Genetics

Dear Dr Ioannidis,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Olena Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom