^{*}

Conceived and designed the experiments: JH CM DH. Performed the experiments: JH. Analyzed the data: JH. Wrote the paper: JH CM DH. Contributed source code and computational analysis tools: CK.

The authors are affiliated with Microsoft Corporation. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at

Population structure, family structure and/or cryptic relatedness are well-known confounding factors that cause spurious associations to be found in GWAS

The standard techniques for dealing with confounding factors fall into several classes. An effective class of methods includes approaches formulated as LMMs

Although LMMs have been shown to effectively model and correct for confounding factors in GWAS, an important problem that remains to be solved is how to minimize the computational costs of such methods. Methods based on LMMs typically incur high computational costs, particularly for studies with larger numbers of individuals, as the matrix operations required for parameter estimation scale cubically with the number of individuals. In the regime where the number of individuals grows large and where confounding factors exert strong effects, this may hinder the applicability of LMMs. One possible approach to the above problem is to turn to alternative classes of models that allow us to model similarities between individuals in order to account for confounding factors in the data (as do LMMs) while eschewing the need for costly matrix operations during parameter estimation. In particular, probabilistic graphical models are a natural class of statistical models that allow both for modeling similarities between individuals and fast parameter estimation. In this paper we propose a probabilistic graphical model and parameter estimation method for associating SNPs to phenotype that both accounts for confounding factors and runs significantly faster than current LMM-based methods for larger numbers of individuals, allowing the method to scale to larger study sizes. Unlike LMM-based methods (which present local optima in parameter estimation

We present a model for relating individuals' phenotypic labels as a function of a given SNP marker and other covariates. The output of our model will be some statistic for the SNP marker, so that we can perform a GWAS by applying our model to each SNP marker in a large set of interest. Given a set of individuals, we assume that phenotypes consist of binary labels corresponding to the absence/presence of a phenotype in an individual, although the model can easily be generalized to polytomous discrete or continuous phenotypes. For a given locus, our model specifies a joint probability over individuals' observed phenotypes, conditioned on each individual's SNP and covariates. The joint probability will be a function of all pairs of individuals' phenotypes and each individual's SNP and covariates. Under our model, the contribution of each pair of individual phenotypes will increase or decrease as a function of the genetic similarity between the pair of individuals. Analogously, the contribution of each individual's SNP and covariates will vary as a function of how strongly the SNP and covariates influence that individual's phenotype, taking into account genetic similarity between individuals. The dependencies between individuals due to genetic similarity, in addition to the influence of genetic variation and covariates in generating phenotypes, can be modelled using a graph in which nodes correspond to observed phenotypes and covariates. Edges in the graph denote dependencies between phenotypes and covariates (

Nodes correspond to variables in the model and edges correspond to dependencies between variables under the model. Shaded nodes correspond to observed variables under the model. Conditioned on each individual's SNP and covariates, phenotypic labels are modeled using a fully connected undirected graphical model.

The goal of associating SNPs to phenotypes then corresponds to parameter estimation under our model in which genetic similarity between individuals is accounted for (see

Given our probabilistic model for estimating associations between SNPs and phenotype, we would like to test two aspects of the model. The first is that of

The second aspect we wish to test is that of

To test the above two aspects, we used both real data and synthetic phenotypes generated from a GLMM using real human genotype and phenotype data. The GAW14

We applied our model to the above real datasets and to the synthetically-generated data, where for all datasets, individual age covariates were binned into five ranges

a),b) Histograms of p-values obtained from our method for the synthetic (a) and real (b) GAW14 data. For comparison, p-values obtained from a logistic regression that does not account for confounding factors and from Eigenstrat

a),b) Histograms of p-values obtained from our method for the synthetic (a) and real (b) GOLDN data. For comparison, p-values obtained from a logistic regression that does not account for confounding factors and from Eigenstrat

QQ plots of model negative

QQ plots of model negative

Negative

In addition to testing the calibration of our method, we would also like to test its ability to distinguish spurious associations from real ones, or its statistical power. A method that produces few significant p-values for data where

The plots are shown as receiver operating characteristic (ROC) curves of the true positive rate as a function of the false positive rate for the GAW14 dataset (a,b,c,d) and the GOLDN dataset (e,f,g,h) for various values of the SNP regression weight when using our method (blue), a LMM-based method (red), a PCA-based method (green) and random guessing (black dotted).

In addition to assessing the statistical power of our method, we also assessed the runtime of our method as a function of the study size, or number

Runtimes for the CRF and LMM models (in hours) are shown as a function of study size. All experiments were run on a machine with two 3.0 GHz CPUs and 64.0 GB of RAM.

We have presented a novel GWAS method that accounts for confounding factors such as population structure, family structure or cryptic relatedness. Similar to LMMs and PCA-based methods for association, our model accounts for confounding factors through the use of pairwise similarities between patients, which allows us to significantly reduce false positive rates when performing associations. In contrast to LMM-based and PCA-based methods, our method retains high statistical power and is relatively inexpensive even as the number of individuals in a study grows. Our experimental results on both real and synthetic genotype data demonstrate that our method can adequately account for confounding factors in order to reduce false positive rates, with a modest loss in statistical power as compared to LMM-based and PCA-based methods for data that is generated from a generalized LMM. We have shown that our method is significantly faster than methods based on LMMs, where significant speedups are obtained as the number of individuals in a study grows. As future studies grow to encompass tens of thousands of individuals

The GAW14 dataset consisted of a subset of the data provided to the Genetic Analysis Workshop 14 (GAW 14) as part of the Collaborative Study on the Genetics of Alcoholism (U10 AA008401), which is described in detail elsewhere

Details about the GOLDN study has been described in detail elsewhere

The Wellcome Trust Case Control Consortium (WTCCC) data consisted of SNP data for about 1,900 individuals with Crohn's disease and about 1,500 controls from the UK Blood Service Control Group (NBS). SNPs were excluded from analysis using the more conservative SNP filter described by the WTCCC in

Given a set of individuals

For a given locus, our model consists of a probabilistic graphical model over individuals' observed phenotypes, conditioned on each individual's SNP and covariates. A probabilistic graphical model consists of two parts: the first is an graph

Given individuals' phenotypes

The gradient descent updates for parameter estimation under our method take the form

Given an estimate

Given

To gauge the calibration and discrimination of our model for both weaker and stronger associations, we generated data with different SNP regression weights using a GLMM. For each SNP, we generated SNP-phenotype associations by setting the SNP regression weight

Given a set of model p-values for synthetic data, we define a false positive (FP) to be a SNP that has a significant p-value for some significance level

The GAW14 data were provided by the Collaborative Studies on the Genetics of Alcoholism (U10 AA008401). The GOLDN dataset were provided by the Genetics Of Lipid Lowering Drugs and Diet Network (U01 HL72524). This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from