^{1}

^{*}

^{2}

^{3}

^{1}

^{4}

^{5}

^{1}

^{1}

^{¶}

^{‡}

^{6}

^{1}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: TDO MJB SSR SML JMA. Performed the experiments: TDO. Analyzed the data: TDO JMA. Contributed reagents/materials/analysis tools: TDO AK MJB SSR JDS ET. Wrote the paper: TDO SML JMA.

¶ Membership of the NHLBI GO Exome Sequencing Project is provided in File S1.

‡ Membership of the NHLBI GO Exome Sequencing Project is provided in File S1.

Advances in next-generation sequencing technology have enabled systematic exploration of the contribution of rare variation to Mendelian and complex diseases. Although it is well known that population stratification can generate spurious associations with common alleles, its impact on rare variant association methods remains poorly understood. Here, we performed exhaustive coalescent simulations with demographic parameters calibrated from exome sequence data to evaluate the performance of nine rare variant association methods in the presence of fine-scale population structure. We find that all methods have an inflated spurious association rate for parameter values that are consistent with levels of differentiation typical of European populations. For example, at a nominal significance level of 5%, some test statistics have a spurious association rate as high as 40%. Finally, we empirically assess the impact of population stratification in a large data set of 4,298 European American exomes. Our results have important implications for the design, analysis, and interpretation of rare variant genome-wide association studies.

Population structure can be a strong confounding factor in association studies

To this end, a large number of rare variant association test statistics have been developed (reviewed in Bansal et al.

Two recent studies have explored how rare variant association methods perform in the presence of population stratification. Tintle et al.

Although these two studies have provided insights into the behavior of rare variant association studies in the presence of population structure, several important questions remain. In particular, the quantitative impact of fine-scale population structure is not well defined. Indeed, as large sample sizes are necessary to detect associations with rare variants

We evaluated nine rare variant association methods: the collapsed χ^{2} test, the collapsed Fisher’s Exact Test (FET), the Weighted Sum Statistic (WSS) _{α}

For all analyses, statistical significance was determined empirically using permutations. We performed 1,000 permutations for each test statistic unless specified otherwise, which is sufficient to evaluate a α = 0.05 significance threshold. For computational efficiency, we used a rejection procedure, which stops permuting once more than α×1,000 statistics are greater than the test statistic. The p-values are thus not estimated in the full 1,000 permutations, but the approximation is useful for testing significance at a particular threshold

To simulate a confounding effect due to population structure, we adapt a previously described approach ^{th} subpopulation. Furthermore, we denote the probability that an individual is drawn from the ^{th} subpopulation as ^{th} subpopulation given either case (

For each simulated scenario, we randomly paired haplotypes within each subpopulation to produce diploid individuals. Unless otherwise noted, we randomly sampled without replacement 1,000 cases and 1,000 controls from the subpopulations based on their disease probability:_{c|i}^{th}

We used the strategy of Schaffner et al. ^{2}) (with bins of nucleotide distance spaced by 100 kb sections), Tajima’s D, nucleotide diversity (π), and the site frequency spectrum. In addition, we estimated the mean and mean squared error of F_{ST} for variants with a minor allele frequency (MAF) ≥0.05 from eight European populations (N = 158) of the Human Genome Diversity Project (HGDP) data set

Parameter values were inferred by calibrating to patterns of variation in exome data from 316 European Americans.

To calculate these statistics in the exome data, we divided the genome into 1 Mb windows. Ten of these windows were randomly selected, with replacement, and concatenated to form a “genomic region”. This procedure was repeated 10,000 times in order to estimate a genome-wide distribution for the various statistics to compare to the simulated data. In the simulated data, we followed a similar procedure by calculating the statistics defined above on ten 1 Mb windows. In total, we performed 21 independent replicates in order to get a good estimate of the parameters and allow for variation in the number of segregating sites per region (see below). The average value of each statistic was used to calculate the RMSE. The RSME function was calculated similarly to that described by Schaffner et al. ^{th} statistic is:^{2} by physical genomic distance [

We also performed Kolmogoroff-Smirnov (K-S) tests to compare the distributions of the number of segregating sites in the observed and simulated data. To achieve a better fit for this statistic, the scaled population mutation rate _{e}µ_{k}_{p}_{k} =

Using the parameter values inferred as described above (see

We selected 1,000 cases and 1,000 controls using ^{th} subpopulation,

We also evaluated the power of logistic regression based methods, which can incorporate covariates. To this end, we used the same simulations generated with the five-population model and the same population risk confounding framework (see Figure S1C in

To explicitly evaluate how the magnitude of population structure influences the SAR, we also considered a simple model with two subpopulations and varied the time of population splitting (see Figure S1D in _{e}×[1.5, 2.0, …4.0]×10^{−3}) as well as a divergence time of zero (i.e. a single panmictic population). We also included the migration rate estimate and other parameters of the calibrated model. For each time of population splitting, we simulated 4,000 haplotypes (2,000 diploid individuals) in each of the two populations. We then sampled 1,000 cases and 1,000 controls from

To empirically evaluate the SAR, we analyzed exome sequences from a sample of 4,298 European Americans with modest, but statistically significant, levels of population stratification (

To calculate the _{PC1}_{PC2}_{PC} is a function of the distance between the minimum and maximum values along a PC axis and allows us to vary the strength of PC confounding, and OR denotes the odds ratio. Values of δ_{PC} considered were 1, ½, or ¼ of the intervening distance between the most distant individuals, and smaller values indicate larger differences in disease risk among individuals in PC space.

For each combination of δ_{PC}, OR, and PC, we performed ten replicates of an exome-wide analysis with the logistic T1 calculated for each gene. In total, we performed 490 exome scans with a median gene count of 14,360 (min = 14,313 and max = 14,401) where differences in gene number are due to the individuals sampled and a minimum of five rare variants per gene.

Through extensive simulations, we inferred parameters of a demographic model that recapitulate patterns of variation present in observed exome data (Figure S1A in

Using the calibrated demographic model (Figure S1B in

Simulations where the disease prevalence was identical and individuals were sampled equally from all subpopulations (X = 0 and Y = 0) yielded expected type I error rates (

Each square represents a confounding scenario set by different values of disease risks, parameterized by

Among the association methods considered here, the logistic regression based methods (i.e. T1, CMC, StepUp, and StepDown) are able to incorporate covariates. By including ten PCs, the SAR for each of these four methods was reduced to nominal levels. As an example,

The top figure has the spurious association rate (SAR) of CMC without correcting for population structure. The middle figure shows the SAR of CMC when a single PC is included as a covariate. The bottom figure shows the SAR of CMC when 10 PCs are included as covariates. Each square represents a confounding scenario parameterized by

We also evaluated the T1 test at a lower p-value threshold (α = 0.0001) and found comparable results (see Figure S7 and S8 in

We next tested how correcting for population structure influences power of rare variant association methods (Figure S1C in

. We tested for spurious association rates at various divergence times, presented as F_{ST} estimates for comparison with European populations in HGDP (light blue shading). The various lines represent differences in disease risk according to the equations

Nonetheless, it is still possible to have good power in samples with population structure. As expected, the logistic T1 test performed the best, as this is the same model we used to generate the genotype-phenotype map. CMC also performed well, but the StepUp and StepDown optimization methods had the lowest power compared to the other methods, consistent with our estimates of power in the absence of population structure (Figure S3 in

To more precisely delineate the magnitude of population structure necessary to inflate the SAR, we considered a simpler demographic model of two subpopulations and varied the time of population splitting (see Materials and Methods; Figure S1D in _{ST} in the range of 0.01–0.025; light blue shading in _{ST} ∼ 0.005), RareCover, StepUp, StepDown, CMC, and the WSS have SARs of 0.16, 0.21, 0.19, 0.14, and 0.07, respectively. Thus, these results help refine the conditions in which spurious associations become an important issue to rare variant association analyses. For the logistic regression models, we evaluated the SAR when one or ten PCs were included as covariates. These methods recovered reasonable error rates with a single PC (Figure S9 in

The figure shows the power of logistic regression methods when including ten PC covariates. The x-axis shows the odds ratio (OR), where 1.0 is the null model. “No Structure” indicates simulations where power was estimated from sampling cases and controls from a single panmictic population, but still corrected for structure. The dashed black line represents α = 0.05 and the dotted lines represent the 95% bootstrap confidence intervals.

The NHLBI Exome Sequencing Project recently described a large, high-quality sequence data set consisting of exomes (approximately 15,000 protein-coding genes) from 4,298 European Americans and 2,017 African Americans

We generated phenotypes that are confounded with population structure using a PCA approach as described in Materials and Methods. For example,

Individuals (dots) are colored according to the logistic regression with β values scaled so that for this example an odds ratio (OR) of 5 for a distance of a fourth of the minimal and maximal values for each axis. In other words, individuals separated by a fourth of the PC distance will have an OR of 5 compared to each other. The probability of being a case is thus indicated by the color of each dot on a scale from 0.06 to 1, as indicated by the gradient (lower right corner).

The highest average SAR value from these scans was 7.07%, which is only slightly elevated above the expected value of 5%. We did not attempt to correct this SAR using PCA as that was how the confounding was generated. Even with the most extreme parameters considered in _{ST} between extreme groups from the first and second PC (as identified in Figure S10 in _{ST} of 0.011, is lower than the minimum pairwise F_{ST} observed from the HGDP populations of 0.012. However, we note that our simulations suggest that with larger sample sizes, and hence higher power to detect structure, the magnitude of population structure present in European Americans could result in elevated rates of spurious associations.

PC1 | |||||||

PC2 | OR 1/5 Fourth | OR 1/5 Half | OR 1/5 Full | OR 1 | OR 5 Full | OR 5 Half | OR 5 Fourth |

OR 1/5 Fourth | 0.0686 | 0.0548 | 0.0448 | 0.0400 | 0.0421 | 0.0501 | 0.0660 |

OR 1/5 Half | 0.0683 | 0.0543 | 0.0442 | 0.0402 | 0.0425 | 0.0521 | 0.0680 |

OR 1/5 Full | 0.0686 | 0.0544 | 0.0441 | 0.0397 | 0.0443 | 0.0524 | 0.0684 |

OR 1 | 0.0700 | 0.0555 | 0.0442 | 0.0389 | 0.0434 | 0.0527 | 0.0707 |

OR 5 Full | 0.687 | 0.0522 | 0.0434 | 0.0391 | 0.0454 | 0.0541 | 0.0684 |

OR 5 Half | 0.0683 | 0.0530 | 0.0441 | 0.0395 | 0.0444 | 0.0530 | 0.0689 |

OR 5 Fourth | 0.0641 | 0.0494 | 0.0401 | 0.0393 | 0.0448 | 0.0539 | 0.0667 |

We have demonstrated that all rare variant association methods considered here can yield elevated rates of spurious associations in the presence of fine-scale population structure. Furthermore, we showed that incorporating PCs as covariates can mitigate the confounding effects of population structure and return spurious association rates to be within normal type I error rates. The ability of PCA to correct for spurious associations in our demographic model is possibly attributable to the fact that rare and common variants possess correlated patterns of population structure (unpublished data). In demographic models where this is not true, PCA may not be sufficient to properly control for spurious associations

The differences in disease risk among populations that we found to generate increased SARs are plausible, and further underscore the importance of carefully designing and interpreting rare variant association methods. For instance, between populations of European men there is a 2.5% to >10% difference in rates of lung cancer, though a less striking difference among women

Another issue for rare variant association methods is admixture. Discrete populations, as we have modeled here, can be viewed as a special case of an admixture model

In conclusion, although rare variant association tests are poised to provide new insights into the genetic architecture of complex traits, they are susceptible to spurious associations when individuals are sampled from even modestly differentiated populations. All methods considered here showed elevated SARs, suggesting this is a general phenomenon that should be considered in the design, analysis, and interpretation of rare variant association studies.

(PDF)

The authors wish to acknowledge the support of the National Heart, Lung, and Blood Institute (NHLBI) and the contributions of the research institutions, study investigators, field staff and study participants in creating this resource for biomedical research.