^{*}

^{*}

Conceived and designed the experiments: KH EC DMG EES. Performed the experiments: KH EC EES. Analyzed the data: KH EC EES. Contributed reagents/materials/analysis tools: KH EC EES. Wrote the paper: KH EES.

All authors were affiliated with Rosetta Inpharmatics LLC, a wholly-owned subsidiary of Merck & Co. Inc., when this research was carried out. There are no patents related to this work.

Genome-wide association studies (GWAS) may be biased by population stratification (PS). We conducted empirical quantification of the magnitude of PS among human populations and its impact on GWAS. Liver tissues were collected from 979, 59 and 49 Caucasian Americans (CA), African Americans (AA) and Hispanic Americans (HA), respectively, and genotyped using Illumina650Y (Ilmn650Y) arrays. RNA was also isolated and hybridized to Agilent whole-genome gene expression arrays. We propose a new method (i.e., hgdp-eigen) for detecting PS by projecting genotype vectors for each sample to the eigenvector space defined by the Human Genetic Diversity Panel (HGDP). Further, we conducted GWAS to map expression quantitative trait loci (eQTL) for the ∼40,000 liver gene expression traits monitored by the Agilent arrays. HGDP-eigen performed similarly to the conventional self-eigen methods in capturing PS. However, leveraging the HGDP offered a significant advantage in revealing the origins, directions and magnitude of PS. Adjusting for eigenvectors had minor impacts on eQTL detection rates in CA. In contrast, for AA and HA, adjustment dramatically reduced association findings. At an FDR = 10%, we identified 65 eQTLs in AA with the unadjusted analysis, but only 18 eQTLs after the eigenvector adjustment. Strikingly, 55 out of the 65 unadjusted AA eQTLs were validated in CA, indicating that the adjustment procedure significantly reduced GWAS power. A number of the 55 AA eQTLs validated in CA overlapped with published disease associated SNPs. For example, rs646776 and rs10903129 have previously been associated with lipid levels and coronary heart disease risk, however, the rs10903129 eQTL was missed in the eigenvector adjusted analysis.

Genome-wide association studies (GWAS) have emerged as an important approach to identify common polymorphisms underlying complex traits. Allele frequency disparity due to systematic ancestry differences, otherwise known as population stratification (PS), can bias testing results and lead to artifactual associations, although there is not yet consensus on how significant such a bias could be. Two general strategies were developed to address the PS risk. First, family-based design is robust against PS

As a drawback, self-eigen does not directly infer origin and magnitude of the PS, where such information is important, especially for populations of recent admixture. Therefore, we developed a new eigenvector based method (termed “hgdp-eigen”) to overcome this challenge. It constructs the eigenvector space using the Human Genetics Diversity Panel (HGDP) ^{th} dimension we were able to see the finer separation among populations in Africa. In the second step, we projected genotype of study subjects to the PC space built in step 1 and derive the subjects' coordinates on each PC dimension.

To assess the hgdp-eigen, we assembled a human liver-specific cohort (HLC) comprised of over 1,000 individuals, broadly representing three ethnic groups: Caucasian American (CA), Hispanic American (HA), and African American (AA). DNA and RNA were isolated from each of the liver samples, and genotype data for each DNA sample was generated using the Illumina 650Y genotyping array and expression data for each RNA sample was generated using a whole-genome custom Agilent gene expression array. To minimize the effects of assay artifacts, we applied very stringent data quality filters

The hgdp-eigen took advantage of the HGDP PC space by providing us with a global context of geographically defined world populations (^{nd} and 3^{rd} HGDP-PC dimensions, we concluded there were no Asian or Native American ancestries represented in AA (^{rd} HGDP-PC dimension, ^{th} and 6^{th} dimensions reveal the seven populations collected in Africa (

The HGDP-PC space can separate world populations with excellent resolution (

For comparison, we also applied the self-eigen to the HLC (

The African Americans (A and B), Hispanic Americans (C and D) and European Americans (E and F) showed stratification similar but not identical to those in

Given the acknowledged PS in even relatively homogeneous populations (e.g. European or Finnish), it would be natural to ask about the extent and impact of PS on GWAS in practical settings. More importantly, despite the routine adjustment for PS in GWAS (e.g. using self-eigen), no empirical studies have been carried out to date to assess the impact of these adjustments on statistical power. The large number of phenotypes scored in the HLC provides a path to estimate the impact of PS on GWAS empirically

Given the considerable population differences observed for gene expression traits

Adjustment | 10% FDR | 30% FDR | |

unadj | cis-eQTL p-value_{cutoff} |
7.9e-5 | 3.6e-4 |

trans-eQTL p-value_{cutoff} |
6.0e-9 | 3.2e-8 | |

number of cis-eQTLs | 7101 | 10044 | |

number of trans-eQTLs | 607 | 982 | |

Self - eigen | cis-eQTL p-value_{cutoff} |
7.4e-5 | 3.3e-4 |

trans-eQTL p-value_{cutoff} |
5.9e-9 | 2.6e-8 | |

number of cis-eQTLs | 6958 | 9647 | |

number of trans-eQTLs | 582 | 861 | |

Hgdp -eigen | cis-eQTL p-value_{cutoff} |
8.3e-5 | 3.6e-4 |

trans-eQTL p-value_{cutoff} |
8.0e-9 | 2.6e-8 | |

number of cis-eQTLs | 7063 | 9847 | |

number of trans-eQTLs | 613 | 836 |

Adjustment | 10% FDR | 30% FDR | |

unadj | cis-eQTL p-value_{cutoff} |
9.5e-6 | 2.9e-5 |

trans-eQTL p-value_{cutoff} |
- | - | |

number of cis-eQTLs | 65 | 132 | |

number of trans-eQTLs | 0 | 0 | |

Self - eigen | cis-eQTL p-value_{cutoff} |
2.8e-6 | 4.7e-6 |

trans-eQTL p-value_{cutoff} |
- | - | |

number of cis-eQTLs | 18 | 22 | |

number of trans-eQTLs | 0 | 0 | |

Hgdp -eigen | cis-eQTL p-value_{cutoff} |
6.2e-6 | 8.9e-6 |

trans-eQTL p-value_{cutoff} |
- | - | |

number of cis-eQTLs | 21 | 30 | |

number of trans-eQTLs | 0 | 0 |

Adjustment | 10% FDR | 30% FDR | |

unadj | cis-eQTL p-value_{cutoff} |
6.1e-6 | 3.5e-5 |

trans-eQTL p-value_{cutoff} |
1.0e-7 | 1.0e-7 | |

number of cis-eQTLs | 33 | 105 | |

number of trans-eQTLs | 1 | 1 | |

Self - eigen | cis-eQTL p-value_{cutoff} |
7.9e-6 | 2.4e-5 |

trans-eQTL p-value_{cutoff} |
- | 2.4e-7 | |

number of cis-eQTLs | 21 | 50 | |

number of trans-eQTLs | 0 | 3 | |

Hgdp -eigen | cis-eQTL p-value_{cutoff} |
9.3e-6 | 2.0e-5 |

trans-eQTL p-value_{cutoff} |
- | - | |

number of cis-eQTLs | 24 | 55 | |

number of trans-eQTLs | 0 | 0 |

With a sample size of N = 979, we had excellent statistical power to detect cis-eQTLs in the CA (

Due to the modest sample sizes in the AA and HA, we only had statistical power to detect cis eQTLs (

In addition, we looked at the effect size of the eQTLs (10% FDR). Because the test statistic for the Kruskal-Wallis test does not reflect the effect size, we used the r^{2} estimate from the robust linear model, Trait_{adj} ∼ genotype, to estimate effect sizes. (Here Trait_{adj} denotes the gene expression value already adjusted for age and gender.) Among AA eQTLs, the mean, median and standard deviation of r^{2} were 0.43, 0.46 and 0.14, respectively. Among HA eQTLs, the mean, median and standard deviation of r^{2} were 0.54, 0.56 and 0.16, respectively. And for CA eQTLs that are confirmed in AA or HA, the mean, median and standard deviation of r^{2} are 0.52, 0.54 and 0.19, respectively.

Further, we investigated whether the adjustment reduced statistical power, leading to a failure to detect many true cis eQTL that would have been found without the adjustment. The large number of phenotypes (∼40,000 expression traits) provided a path to empirically estimate power (

Finally, if we assume that many of the cis eQTLs in the AA and HA were caused by PS and consequently excluded by eigenvector adjustment, the false trait-SNP associations should bear the following properties: (1) the gene expression trait should be differentially expressed between subpopulations, and (2) the SNP allele frequency should be different between the subpopulations. In fact, for any gene whose expression varied significantly between subpopulations and any SNP whose frequency also varied significantly between subpopulations, we would detect associations for such trait-SNP pair. Because SNPs with different allele frequencies among subpopulations should uniformly distribute throughout the genome, we would expect to see the same number of cis-eQTL detected no matter where we decided to place the 2 Mbp window defining the cis region of interest. That is, if we randomly chose 2 Mbp windows for each gene expression trait and counted the number of pseudo cis-eQTL detected, we would expect it to be close to the number of true cis eQTL. We performed this simulation in the HLC for the AA and HA, randomly placing the 2 Mbp window for each gene and conduct cis-eQTL mapping for over 1000 runs. We found an average of 9.92 and 2.03 pseudo cis-eQTLs in AA and HA, respectively, significantly less than the 65 and 33 true cis-eQTL. Therefore, we are able to reject our hypothesis that the cis-eQTL detected in the AA and HA were mostly driven by PS and support of the hypothesis that the majority of cis-eQTLs in the unadjusted analysis were real.

To illustrate the impact adjusting for PS can have on identifying disease susceptibility loci, we intersected the HLC eQTLs with the set of SNPs in the public GWAS databases identified and replicated as associated with common human disease

In this study, we have analyzed three major ethnic groups in the United States, Caucasian American, African American and Hispanic American. Hgdp-eigen methods provided valuable information on the origin of the admixtures. For example, it revealed the African and Native American components in the HA genome. Second, HGDP-PC quantified the magnitude of the PS. In contrast, the self-eigen method simply detected the PS without inferring the origin of the genetic flow and the magnitude. Lastly, the HGDP-PC space was constructed on the HDGP sample, capturing the primary allele frequent differences of world populations, robust to the study cohorts. Comparing ^{rd} dimension is primarily defined by two outliers (

Assessing the extent of PS confounding is an important but challenging task. There were attempts to address this issue using simulation and small empirical datasets (e.g. the lactase gene)

In contrast, the eigenvector-adjustment greatly reduced the number of cis-eQTL findings in AA and HA, due to a loss of statistical power. The underlying rationale is easy to understand. Many SNPs showed considerable allele frequency differences (e.g. ≥0.1) among ethnic groups. In our data, 60.1% SNPs showed ≥0.1 allele frequency difference between AA and CA (

Liver tissue samples were collected from “Liver Study subjects”, whose detailed characteristics were reported in a separate article

Since the liver tissues

938 unrelated individuals from 51 populations (collected in Europe, Middle East, Central/South Asia, Africa, East Asia, America and Oceania) of the HGDP were successfully genotyped using Ilmn650Y

To avoid artifacts due to linkage disequilibrium we thin the data by excluding highly correlated SNPs. Then we conducted two versions of PCA. First, the standard methods as implemented in the Eigensoft package

Kruskal-Wallis (KW) one-way analysis of variance was employed in testing association between gene expression traits and genotypes. The KW test can be considered as the non-parametric counterpart to ANOVA for testing equality among groups (e.g., the three genotype groups corresponding to a given SNP). This test does not assume the traits are normally distribute and therefore is more robust to outliers and violations of other assumptions important for successful application of parametric tests. In brief, the KW test was applied on a given trait-SNP pair by first ranking all trait values regardless of genotype, assigning tied values the average of the ranks they would have received had they not been tied. Then we computed the test statistic (K) as_{i} is the number of subjects for genotype i; r_{ij} is the rank of subject j who carried genotype i; N is the entire sample size; and g denotes the number of genotype groups (either 2 or 3 for the groups tested). Finally, the p value was derived using the approximation Pr(χ^{2}_{g-1}≥K). Before testing the gene expression traits, we adjusted them for age, gender and PCs. This adjustment was carried out by fitting a robust linear model (using the rlm function in the R statistical software package) to each of the gene expression traits,

We repeated the eQTL mapping analyses on permuted gene expression data sets to empirically estimate FDR. In each permutation run, we first randomized the patient IDs in the expression file, breaking any association between expression traits and genotypes, while leaving the respective correlation structures among gene expression traits and SNP genotypes intact. Then we repeated the association tests for every expression trait and genotype pair in the permuted sets, leading to a set of null statistics for each permutation. A standard FDR estimator was then applied to the resulting association statistics, as previously carried out on observed and permutation null statistics

Although we could not determine whether a particular discovery was true or false, at a given FDR (e.g. 10%) we knew the proportion (e.g. 90%) of discoveries that were true. Therefore, at a fixed FDR, when two methods resulted in a different number of discoveries (termed as N_{1} and N_{2}) there would be (1-FDR)*N_{1} and (1-FDR)*N_{2} true findings, where N_{1}/N_{2} is then proportional to the relative power of the two methods.

Cis eQTL detected in the African American samples from the human liver-specific cohort.

(0.03 MB XLS)

Cis eQTL detected in the Hispanic American samples from the human liver-specific cohort.

(0.03 MB XLS)

eQTL Mapping in African American, adjusted for the top 1 or 2 eigenvectors. Fifty-five of the 65 AA cis-eQTLs (detected using unadjusted traits at 10% FDR) also exist as CA cis-eQTLs (detected using self-eigen adjusted traits at 10% FDR), with an enrichment pvalue = 1.45E-33.

(0.05 MB DOC)

eQTL Mapping in Hispanic American, adjusted for the top 1 or 2 eigenvectors. Twenty-nine of the 33 HA cis-eQTLs (detected using unadjusted traits at 10% FDR) also exist as CA cis-eQTLs (detected using self-eigen adjusted traits at 10% FDR), with an enrichment pvalue = 4.87E-20.

(0.04 MB DOC)

We conducted PCA on the HGDP dataset and observed consistent results as Li et al. A, the 1st PC separates Africa vs. Non-Africa populations and the 2nd PC separates East Asia, Native America and Oceania from other Non-Africa populations; B, the 3rd PC separates Native America from other populations; C, the 4th component separates Oceania from others; D, the 5th component separates different populations in Africa; E, the 6th component continues to separate different populations in Africa; F, in the space formed by the 5th and the 6th HGDP-PCs, African populations of various culture/language/locations were well separated; G, we projected the AA subjects to the 5th and 6th HGPD PCs, interestingly, the AA samples were located very close to Bantu groups.

(9.23 MB TIF)

We compared the allele frequency among the three ethnic groups of the liver study (A, African America vs. Caucasian American; B, African America vs. Hispanic American; and C, Caucasian American vs. Hispanic American). A considerable percentage of SNPs showed large allele frequency disparities (e.g., < = 0.1). Further, we applied simple Chi-square test and found many of the differences were significant (D, African America vs. Caucasian American; E, African America vs. Hispanic American; and F, Caucasian American vs. Hispanic American).

(8.82 MB TIF)

There are many genes near the SNP rs646776 locus, including PSRC1 and SORT1.

(1.80 MB TIF)