^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: GA. Performed the experiments: GA. Analyzed the data: GA MI. Contributed reagents/materials/analysis tools: GA. Wrote the paper: GA MI.

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

Principal component analysis (PCA) is a widely-used tool in genomics and statistical genetics, employed to infer cryptic population structure from genome-wide data such as single nucleotide polymorphisms (SNPs)

However, traditional approaches to computing the PCA, such as those employed by the popular EIGENSOFT suite

In recent years, research into randomized matrix algorithms has yielded alternative approaches for performing PCA and producing these top PCs, while being far more computationally tractable and maintaining high accuracy relative to the traditional “exact” algorithms

Here we present flashpca, an efficient tool for performing PCA on large genome-wide data, based on randomized algorithms. Our approach is highly efficient, allowing the user to perform PCA on large datasets (100,000 individuals or more), extracting the top principal components while achieving identical results to traditional methods.

First we used an LD-pruned HapMap3 genotype data

(a) The first two principal components from analyzing the HapMap3 dataset. (b) Scatter plots showing near-perfect absolute Pearson correlation (lower left-hand corner) between the 1st PCs estimated by smartpa, flashpca, shellfish, and R’s prcomp (using the standardization from

Next, we analyzed an LD-pruned celiac disease Immunochip dataset consisting of 16,002 individuals and 43,049 SNPs after thinning by LD

Due to the substantial time required by shellfish and smartpca to complete the largest runs, we only ran flashpca on the larger datasets (

Each experiment was repeated three times. All programs used multi-threaded mode with 8 threads. All experiments were run on a machine with

shellfish did not complete PCA for the

Importantly, the time taken by flashpca to compute PCA on

Principal component analysis is an important tool in genomics for discovery of population structure or other latent structure in the data, such as batch effects. Early approaches such as smartpca from EIGENSOFT have proven useful for this goal and have been widely used for analysis of SNP datasets. However, many current datasets assay tens of thousands of individuals, making traditional approaches extremely time consuming. In contrast, our approach, flashpca, is based on careful combination of randomized algorithms for PCA together with parallelization, and allows the analyst to easily perform PCA on large datasets consisting of many thousands of individuals in a matter of minutes to hours. Despite relying on an approximation strategy, this approach suffers from essentially no loss in accuracy for the top eigenvalues/eigenvectors compared with traditional approaches. One practical limitation of the current implementation of flashpca is its memory requirements for large datasets: using 15,000 individuals with 43K SNPs requires ∼14 GiB RAM, and 150,000 requires ∼145 GiB. Future work will involve reducing these memory requirements without incurring a substantial performance penalty. Randomized PCA also provides the potential to de-correlate samples (“whitening”), thus essentially removing the effects of population from data prior to further downstream association analysis

More generally, the approach behind flashpca could prove useful for accelerating other methods that depend on performing a large number of eigen-decompositions across many samples, such as varLD

All subjects included in the celiac disease dataset provided written and informed consent. For details, see the original publication

The HapMap phase 3 dataset consists of 1184 human individuals across 11 populations (ASW: African ancestry in Southwest USA; CEU: Utah residents with Northern and Western European ancestry from the CEPH collection; CHB: Han Chinese in Beijing, China; CHD: Chinese in Metropolitan Denver, Colorado; GIH: Gujarati Indians in Houston, Texas; JPT: Japanese in Tokyo, Japan; LWK: Luhya in Webuye, Kenya; MEX: Mexican ancestry in Los Angeles, California; MKK: Maasai in Kinyawa, Kenya; TSI: Toscani in Italia; YRI: Yoruba in Ibadan, Nigeria) assayed for 1,440,616 SNPs

The celiac disease Immunochip dataset

We represent the genotypes as an

PCA relies on finding the eigenvectors of the

The SVD of

In the eigen-decomposition approach, the covariance matrix

The principal components (PCs)

Note that some tools, such as smartpca and shellfish, output the eigenvectors

In traditional PCA, such as that implemented in R’s prcomp, prior to running the SVD/eigen-decomposition itself, the matrix

Performing PCA on large matrices (with both

Focusing on the eigen-decomposition approach (

Another computational shortcut for the case where

While the SVD approach is generally recommended for reasons of better numerical stability and speed, we have found that when

flashpca is implemented in C++ and relies on Eigen

Prior to PCA, thinning of the SNPs by LD and removal of regions with known high LD or other artefacts such as inversions have been recommended

Concordance in principal components 1–10 between smartpca, flashpca, shellfish, and R’s prcomp on the HapMap3 dataset.

(PDF)

We thank the chief investigators of celiac disease Immunochip study, David van Heel and Cisca Wijmenga, for kindly providing the data.