^{1}

^{2}

^{*}

^{1}

^{2}

^{1}

^{2}

^{2}

^{3}

^{4}

^{5}

^{6}

^{1}

^{2}

^{7}

^{8}

^{2}

^{9}

^{10}

^{¶}

^{*}

^{1}

^{2}

^{¶}

^{*}

Conceived and designed the experiments: BMN MAR BFV DA BD SMP KR MJD. Performed the experiments: BMN MAR KR MJD. Analyzed the data: BMN MAR KR MJD. Contributed reagents/materials/analysis tools: BMN MAR MOM SK SMP KR MJD. Wrote the paper: BMN MAR KR MJD. Commented on the manuscript and aided in the development of the method: BFV BD. Commented on the manuscript and aided in the development of the methodological idea: DA. Contributed APOB data: MOM SK.

¶ These authors also contributed equally to this work.

The authors have declared that no competing interests exist.

Technological advances make it possible to use high-throughput sequencing as a primary discovery tool of medical genetics, specifically for assaying rare variation. Still this approach faces the analytic challenge that the influence of very rare variants can only be evaluated effectively as a group. A further complication is that any given rare variant could have no effect, could increase risk, or could be protective. We propose here the C-alpha test statistic as a novel approach for testing for the presence of this mixture of effects across a set of rare variants. Unlike existing burden tests, C-alpha, by testing the variance rather than the mean, maintains consistent power when the target set contains both risk and protective variants. Through simulations and analysis of case/control data, we demonstrate good power relative to existing methods that assess the burden of rare variants in individuals.

Developments in sequencing technology now enable us to assay all genetic variation, much of which is extremely rare. We propose to test the distribution of rare variants we observe in cases versus controls. To do so, we present a novel application of the C-alpha statistic to test these rare variants. C-alpha aims to determine whether the set of variants observed in cases and controls is a mixture, such that some of the variants confer risk or protection or are phenotypically neutral. Risk variants are expected to be more common in cases; protective variants more common in controls. C-alpha is sensitive to this imbalance, regardless of its origin—risk, protective, or both—but is ideally suited for a mixture of protective and risk variants. Variation in APOB nicely illustrates a mixture, in that certain rare variants increase triglyceride levels while others decrease it. The hallmark feature of C-alpha is that it uses the distribution of variation observed in cases and controls to detect the presence of a mixture, thus implicating genes or pathways as risk factors for disease.

High throughput sequencing of the human genome is now a reality: recent advances in sequencing technology now permit near complete ascertainment of genetic variation, including rare variants (<1% population frequency), across large portions of the genome in thousands of individuals. While this in principle can reveal the role of each gene in every medical phenotype, the analytic challenges are profound. Of particular concern are genetically complex common diseases for which the role of any gene is expected to be quite modest and an individual rare variant would have relatively small impact on the common endpoint. Under this scenario there would be little power when testing one variant at a time, as in traditional association testing. For example, while in African Americans low-frequency variants in PCSK9 can have a substantial effect on serum low-density lipoprotein cholesterol (LDL-C)

Recently published methods show that power to detect rare risk variation can be greatly enhanced by combining information across variants in a target region, such as a gene or exon, when multiple variants influence phenotype. The “cohort allelic sums test” (CAST)

Even in a gene harboring phenotypically relevant variation, however, many variants will be phenotypically neutral. Indeed, the target region could include a handful of rare Mendelian mutations that cause disease, some variants that moderately increase or decrease risk, along with numerous variants of no effect. To gain insight into a new model for analysis, it is helpful to think of a coin toss associated with each variant. If the variant is phenotypically neutral, the coin is fair and the variant is as likely to appear in a case as it is in a control. In contrast, risk variants correspond to biased coins and are more likely to be observed in cases. Similarly, protective variants correspond to coins biased in the opposite direction and are more likely to be observed in controls, particularly if they are selected. See

(A) shows distribution of the outcome of coin tosses generated using an 80∶20 mixture of neutral coins and biased coins (probability of a head = .9), compared with the outcome of a series of biased coin tosses (probability of a head = .58); the mixed coin toss (blue) has the same mean bias (p = .58) as the biased coin toss (black). (B) shows distribution of a 10∶80∶10 mixture of a biased coin (probability of a head = .1), neutral coin, and a biased coin (probability of a head = .9), compared with the outcome of a series of neutral coin tosses. In both simulations, coins are selected and flipped 10 times and the resulting number of heads, ranging from 0 through 10, are shown. The increased variance of the outcomes in the mixture setting carries information about the presence of some non-neutral coins in the experiment.

In (A) 100 high and 100 low extremes of triglyceride levels drawn from the Malmo Diet and Cancer Study – Cardiovascular Arm in

Position | Annotation | High Lipid Level | Low Lipid Level |

21078358 | Ala4481Thr | 2 | 5 |

21078359 | Ile4314Val | 3 | 0 |

21078990 | Arg4270Thr | 6 | 3 |

21079417 | Val4128Met | 1 | 7 |

21083082 | Thr3388Lys | 2 | 1 |

21083637 | Ser3203Tyr | 6 | 0 |

21086035 | Leu2404Ile | 2 | 3 |

21086072 | Glu2391Asp | 2 | 2 |

21086127 | Thr2373Asn | 2 | 2 |

21086308 | Val2313Ile | 2 | 1 |

21087477 | His1923Arg | 6 | 12 |

21087504 | Asn1914Ser | 0 | 5 |

21087634 | Asp1871Asn | 2 | 0 |

21091828 | Pro1143Ser | 0 | 6 |

21091872 | Arg1128His | 0 | 3 |

21091918 | Asp1113His | 1 | 3 |

21106140 | Thr498Asn | 2 | 0 |

6 | 4 |

A well-established and powerful test for the presence of a mixture of biased and neutral coins is the C-alpha score-test

To illustrate where the information for C-alpha originates, consider a standard balanced case-control study. If the target region has no alleles associated with the phenotype, then the distribution of counts should follow a binomial distribution, indexed by n, the total number of copies of an observed variant. In

We have tailored the C-alpha score test so that it is suitable for testing a set of rare variants for association. The binomial (_{i}_{i} is binomial (_{i}_{i}_{i}_{0}_{i}_{i}>p_{0}), some neutral, and some protective (p_{i}<p_{0}).

The C-alpha test statistic

To standardize this quantity we require

The resulting test statistic is defined as

Let's examine data from two genes, ^{−6} from individual level data.

It is additionally possible to estimate the underlying mixture model from the distribution of variants used for C-alpha testing. For instance, if the target region includes risk variants, phenotypically neutral variants, and variants that engender a modest protective effect, then a 3-component mixture will fit the data corresponding to the 3 genetic components. Whenever C-alpha shows significance, we can estimate the number of mixture components and the posterior probability that a particular variant is detrimental or protective using the EM algorithm (see

A variant observed only once provides no direct information about over-dispersion; however the distribution of singletons as a group reflects on the question of association between the target region and phenotype. Singletons can be pooled into a single binomial count that can be included in the C-alpha test. This treatment of singletons, which is essentially identical to a burden test (see

We conduct two main sets of simulations to compare C-alpha with Li and Leal burden tests and Madsen and Browning's test. Tests proposed by Li and Leal are all built around a regression model, predicting phenotype based on recoding of rare variation. For Li and Leal's approach, we sum the number of rare variants in the region for each individual as the predictor. For Madsen and Browning, the coding of variation is similar to the sum of rare variants, but a weighting scheme based on the inverse of the control allele frequency is included. The test statistic is evaluated as a nonparametric rank sum test in which each individual is scored by a weighted sum of the mutation counts. We also include a variable threshold model, which is an implementation of a burden test that selects the threshold for inclusion of variants by optimizing the test statistic. Specifically, this burden test is calculated at all allele frequency cutoffs. The test statistic is then defined by the maximum of the test statistics for all cutoffs. The distribution of the test statistic is obtained empirically by random permutation of case/control status and recalculation of the test statistic. This approach is described in Price et al

We first present a set of simulations for a population genetics model that incorporates selection ^{−3}. Two genes were simulated, and the mean liability value for an individual differed as a function of the number of rare alleles occurring at these genes

For the second set of risk simulations, we assume a model with a disease prevalence of 1%, 50 sites in the region of analysis with allele frequency between 0.025% and 0.5% (this distribution is similar to that observed from the Crohn's data). These sites are variable in the population, but may be invariant in any given simulation, as they are probabilistically assigned for each member of the sample. Of these 50 variants, 6 are chosen at random for each simulation to affect the phenotype (regardless of whether they are present in the dataset), and each variant explains 0.1% of the variance of the disease under a liability threshold model (i.e. 2p [1-p]a^{2} = 0.001, where p is the risk allele frequency and a is the effect on mean liability

Three scenarios are explored: all 6 variants confer risk; 3 variants confer risk and 3 variants confer protection; and all 6 variants confer protection. We also consider two different study designs: 1,000 cases (individuals who exceed the threshold on the liability distribution) and 1,000 controls (individuals selected for absence of disease); and 1,000 cases and 1,000

A third set of simulations explores behavior of test statistics under the null hypothesis to determine whether the type I error rate is well calibrated for C-alpha. Like many tests, C-alpha relies on asymptotic properties consistent with the central limit theorem (CLT). Specifically, the test statistic converges to a normal distribution under the null hypothesis as the number of variants tends toward infinity, with convergence being potentially faster if the frequency of all variants is similar. Thus, we varied both the number of variants and the distribution of allele frequencies to explore type I error. For these simulations, we drew N variants from the empirically observed allele frequency distribution used in the second set of simulations above and performed 25,000 replicates for each value of N.

From simulation results evaluating power (

(A) shows power comparisons for the population genetics model simulations. Power comparisons are for C-alpha, Madsen-Browning (MB), Variable threshold (VT), and Li-Leal's approach (presence/absence Li-Leal_p and count of rare variants Li-Leal_c). These simulations reflect the presence of selection on the variation which predisposes to phenotype. As we increase the mixing proportions between risk and protective variants (moving from mixtures 1 to 6, which reflects 0, 10, 20, 30, 40 and 50% chance of any of the phenotypically relevant variants are protective, rather than risk), C-alpha maintains power, while other tests lose power. In (B), the each of 6 variants explains 0.1% of the variance of the phenotype. All three approaches have high power when all the effects are detrimental. For burden tests, the power drops markedly when 3 variants are protective and 3 are detrimental. “Selected” controls are chosen from the lower 1% of the liability distribution. The solid (dashed) lines represent power for selected (unselected) controls.

Simulation results (

α = 0.1 | α = 0.05 | α = 0.01 | ||||

# of variants | Limited | Distribution | Limited | Distribution | Limited | Distribution |

5 | .118 | 0.105 | .049 | 0.071 | .012 | 0.033 |

10 | .132 | 0.111 | .066 | 0.071 | .011 | 0.028 |

20 | .087 | 0.110 | .048 | 0.064 | .012 | 0.026 |

50 | .105 | 0.110 | .045 | 0.064 | .012 | 0.020 |

100 | .09 | 0.104 | .0508 | 0.059 | .011 | 0.017 |

We have demonstrated here the adaptation of the C-alpha test statistic and its broad applicability to medical sequence data on a gene or pathway level. The approach, distinct from more traditional burden testing, has several advantages over the previously proposed test statistics. Its primary advantage is sensitivity to risk and protective variants in the same gene or pathway. Yet, even if the effects of rare alleles are uniformly in one direction, such as increasing risk, C-alpha maintains comparable power to burden tests. Grouping genes together into pathways and testing rare variants falling into these groups of genes could provide greater statistical power and biological insight into the functionally relevant processes affecting the phenotype of interest. In such groups the presence of both risk and protective variants is even more likely. As demonstrated here, the C-alpha test is well calibrated to incorporate such divergent effects on risk. Moreover because it is a single degree of freedom test with normal asymptotic properties, C-alpha enhances power and allows for rapid and straightforward calculation.

As with other burden-style tests, C-alpha is designed for situations in which numerous rare variants are observed in the target region. We recommend permutation testing for accurate significance estimation in scenarios where the asymptotic behavior is not assured - in particular, for small numbers of variants, when LD is present, or when the test is driven by a single more common variant. In this latter scenario, as in the example of NOD2, it will likely be desirable to analyze the more common variant with Fisher's exact test and to reanalyze the remaining rarer variation to search for additional signal.

While singleton variants can be combined in C-alpha, it seems reasonable to consider a distinct analysis of singletons in a more biologically motivated fashion. For example, when a truly rare and fully penetrant mutational origin is suspected, one might focus on variants seen only in cases and not seen in external reference data such as from 1000 Genomes Project. One might then further filter the identified variants to leave only putatively deleterious non-synonymous, splice and obvious loss of function variants and then compare the rate of these to the parallel set found in controls only. Such a Mendelian-style analysis (already successfully applied in several cases such as Miller syndrome

As with any statistical test, the presence of confounders can seriously bias the results. For instance, we have demonstrated that population stratification has a significant impact on rare variation tests. C-alpha assumes cases and controls have the same balance of ancestry, because the distribution of rare variants likely depends on ancestry. Otherwise unequal representation could increase the rate of false positives for any rare variant test. It is important to note that the balanced sampling assumption is not equivalent to requiring that samples be homogenous. Balanced sampling requires solid experimental design. While statistical methods for controlling for bias can be effective, proper study design is far preferable. If a large set of genotypes were available, such as from a genome-wide association (GWA) study, principal component analyses and related procedures can be used to match or control for ancestry

More generally, the choice of target region and set of variation affects power. A test based on all potentially functional variants (non-synonymous, splice, nonsense, etc.) within a single gene will often be effective as it strikes a balance between sufficient numbers of variants and the enrichment of functional variation. In contrast, a single exon is not likely to have enough variants for the test to be powered adequately or to achieve asymptotic properties and, while including introns and synonymous variants would dramatically increase the number of variants, the expectation is that the vast majority of these variants will be phenotypically neutral and thus will mute the signal. Alternatively, a group of exons from related genes (e.g., a biological pathway) could be analyzed jointly. The test would then determine if some unspecified variants in the pathway are associated with the phenotype via a deviation from the expected distribution of variation. Using this strategy, if several of the genes have an effect, then the power will be enhanced; however, if only one gene in the pathway has an effect, including the other genes in the test will reduce power to detect the effect. For target regions that do show evidence of association, using a nonparametric mixture model we can estimate the distribution of _{i}

Selecting a subset of variants, as suggested above, is just one form of weighting variants in the analysis. C-alpha allows for valid weights to be incorporated into the calculation of the test statistic (see

Sequencing technology will continue to develop and reduce in cost for the foreseeable future. Large medically-focused sequencing efforts involving thousands of exomes or whole genomes are now underway and are introducing a host of novel computational challenges not encountered in GWAS and previous large-scale medical genetics studies. By enabling powerful analyses of genes and pathways, without concern for effect direction, C-alpha promises to be a flexible and powerful approach for the identification of functionally relevant regions from experiments involving deep sequencing.

The four panels show the distribution of the mean and 2*variance of mixtures of binomial distributions under the following set of simulation scenarios. For each set we start with a 50 random draws of size 2 from a binary variable with equal probability of each outcome plus 10 "spiked-in variants" and compare that to 60 random draws of size 2. The spiked-in variants are either all protective (10 0/2's), all risk (10 2/0's) or a mixture as indicated in the title of each panel. The distributions of mean and 2*variance is shown in each panel, with black and blue representing the mean and 2*variance of the null simulations and red and yellow representing the mean and 2*variances of the simulations with spiked in draws. Including a subset of variants with a protective and/or detrimental effect increases the variance of the overall data in a way that is not described by shift in the mean number of alleles in cases (see also

(0.19 MB TIF)

(A) Distribution of p-values under the null hypothesis of no disease association. The distribution of 1,000 p-values under the null hypothesis is consistent with a uniform distribution. The simulations were performed using 1,000 case versus 1,000 control individuals. (B) Distribution of p-values evaluated over exons in a pooled sequencing experiment. We artificially induce inflation to the overall test statistic by including an African American pool, which differs in allele distribution.

(0.10 MB TIF)

The relationship between the strength of the effect and the population minor allele frequency of the locus where the variance explained is fixed for all loci. The rarer the variant, the stronger effect it has on phenotype.

(0.06 MB TIF)

We simulate mixture outcomes for each of the three fixed mixture components, protective (red diamonds), risk (green triangles), neutral (blue triangles). We demonstrate that the Expectation Maximization algorithm, outlined in

(0.09 MB TIF)

Supplementary methods.

(0.09 MB PDF)

We thank Eric Lander, Nick Patterson, and Mark DePristo for useful feedback and comments on the manuscript. We thank Shamil Sunyaev, Alkes Price, and Grigory Kryukov for sharing and discussing population genetic simulations. We thank Helen Hobbs and Jonathan Cohen for PCSK9 sequencing data. We thank Christine Stevens, Candace Guiducci, Noel Burtt, Stacey Gabriel, and the Genome Sequencing Analysis Platform at the Broad Institute for their work on pooled sequencing studies from which illustrative examples were shown. We thank Su Chu for preparing the graphics.