DOT: Gene-set analysis by combining decorrelated association statistics

Historically, the majority of statistical association methods have been designed assuming availability of SNP-level information. However, modern genetic and sequencing data present new challenges to access and sharing of genotype-phenotype datasets, including cost of management, difficulties in consolidation of records across research groups, etc. These issues make methods based on SNP-level summary statistics particularly appealing. The most common form of combining statistics is a sum of SNP-level squared scores, possibly weighted, as in burden tests for rare variants. The overall significance of the resulting statistic is evaluated using its distribution under the null hypothesis. Here, we demonstrate that this basic approach can be substantially improved by decorrelating scores prior to their addition, resulting in remarkable power gains in situations that are most commonly encountered in practice; namely, under heterogeneity of effect sizes and diversity between pairwise LD. In these situations, the power of the traditional test, based on the added squared scores, quickly reaches a ceiling, as the number of variants increases. Thus, the traditional approach does not benefit from information potentially contained in any additional SNPs, while our decorrelation by orthogonal transformation (DOT) method yields steady gain in power. We present theoretical and computational analyses of both approaches, and reveal causes behind sometimes dramatic difference in their respective powers. We showcase DOT by analyzing breast cancer and cleft lip data, in which our method strengthened levels of previously reported associations and implied the possibility of multiple new alleles that jointly confer disease risk.


Introduction
During the recent years, genome-wide association studies (GWAS) uncovered a wealth of genetic susceptibility variants.The emergence of new statistical approaches for the analysis of GWAS have largely contributed to that success.The majority of these methods require access to individual-level data, yet methods that require only summary statistics have been developed as well.[3] Many types of association tests, including those originally developed for individual-level records, can be presented in terms of added summary statistics.For example, gene set analysis (GSA) tests or burden and overdispersion tests for rare variants, 2,4,5 can be written as a weighted sum of summary statistics.In GSA applications, methods based on combined summary statistics can be used to efficiently aggregate information across many potentially associated variants within individual genes, as well as over several genes that may represent a common etiological pathway.
When within-gene association statistics (or equivalently, P-values) are being combined, linkage disequilibrium (LD) needs to be accounted for, because LD induces correlation among statistics.
The correlation among association test statistics for individual SNPs without covariates is the same as the correlation between alleles at the corresponding SNPs.This fact allows one to model a set of statistics using a multivariate normal (MVN) distribution with the correlation matrix equal to the matrix of LD correlations.More generally, in the presence of covariates correlated with SNPs, MVN correlations among association statistics will depend not only on LD but also on other covariates in the model. 6,7en SNPs are coded as 0,1,2 values, reflecting the number of copies of the minor allele, the LD matrix of correlations can be obtained from SNP data as the sample correlation matrix.It can also be directly estimated from haplotype frequencies whenever those are available or reported.Specifically, the LD (i.e., the covariance between alleles i and j; D ij ) is defined by the difference between the di-locus haplotype frequency, P ij , and the product of the frequencies of two alleles, D ij = P ij − p i p j .Then, the correlation between a pair of SNPs is defined as r ij = D ij p i (1−p i )p j (1−p j ) .The di-locus P ij frequency is defined as the sum of frequencies of those haplotypes that carry both of the minor alleles for SNPs i and j.Similarly, p i allele frequency is the sum of haplotype frequencies that carry the minor allele of SNP i.
It is important to distinguish situations, in which the LD matrix is estimated using the same data that was used to compute the association statistics from those, in which the estimated LD matrix is obtained based on a suitable population reference panel.The reference panel approach is implemented in popular web-based association analysis platforms, such as "VEGAS" 8 or "Pascal." 9 Based on a user-provided list of L SNPs, with the corresponding association P-values, VEGAS queries an online reference panel resource to obtain the matrix of LD correlations.P-values are then transformed to normal scores P i → Z i , i = 1, . . ., L, and vector Z is assumed to follow zeromean MVN distribution under the null hypothesis of no association.The individual statistics in VEGAS are then combined as TQ = L i=1 Z 2 i , (where TQ stands for "Test by Quadratic form") and the overall SNP-set P-value is derived empirically by simulating a large number (j = 1, . . ., B) of zero-mean MVN vectors, adding their squared values to obtain statistics TQ (j) and computing the proportion of times when TQ (j) > TQ.The statistics similar to TQ are ubiquitous and appear in many proposed tests that aggregate association signals within a genetic region.
As exemplified by VEGAS, the distribution of TQ must explicitly incorporate LD.However, an alternative approach that implicitly incorporates LD can be based on first decorrelating the association summary statistics, and then exploiting the resulting independence to evaluate the distribution of the sum of decorrelated statistics, which we call Decorrelation by Orthogonal Transformation (DOT).This general idea is straightforward and have been used in many contexts.For instance, Zaykin et al. suggested a variation of this approach for combining P-values (or summary statistics) but have not studied power properties of the method in detail. 10re, we propose a new decorrelation-based method.We derive theoretical properties of our method and explore asymptotic power of both DOT and TQ type of statistics.To the best of our knowledge, we are the first ones to derive the asymptotic distributions of DOT and TQ under the alternative hypothesis.Our results show that decorrelation can provide surprisingly large power boost in biologically realistic scenarios.However, high statistical power is not the only advantage of the proposed framework.Once statistics are decorrelated, one can tap into a wealth of powerful methods developed for combining independent statistics.][12][13][14][15] Our theoretical analyses also reveal an unexpected result, showing that in many practical settings tests based on the statistic TQ do not gain power with the increase in L (assuming the same pattern of effect sizes for different values of L), while the proposed method steadily gains power under the same conditions.Specifically, the proposed decorrelation method gains power when the effect sizes and/or pairwise LD values become increasingly more heterogeneous.The reasons behind the respective behaviors of tests based on TQ and DOT are explored here theoretically and confirmed via simulations.We further derive power approximations that are useful for understanding power properties of the studied methods.
To showcase our method, we evaluate associations between breast cancer susceptibility and SNPs in estrogen receptor alpha (ESR1 ), fibroblast growth factor receptor 2 (FGFR2 ), RAD51 homolog B (RAD51B ), and TOX high mobility group box family member 3 (TOX3 ) genes, without access to raw genotype data.We first test for a joint association between SNPs in those four genes and breast cancer risk by decorrelating summary statistics based on the overall LD gene structure.
We then describe how to follow up on the joint association results and identify one or more SNPs that drive joint association with disease risk.Our study confirms previous associations and reveals new associations, suggesting new potential breast cancer SNP markers.

Material and Methods
Genetic association tests based on summary statistics are often presented as a weighted sum. 2,4Let w i denote the weight assigned to individual statistic.The weighted statistics can then be defined as The statistics Y 2 i are marginally distributed as one degree of freedom chi-square variables with noncentralities µ 2 i .The overall statistic is then typically defined as TQ = L i=1 Y 2 i .

Joint distribution of association summary statistics
In this section, we derive parameters µ and Σ of the joint MVN distribution of summary statistics.
Under the null hypothesis, when none of the SNPs are associated with an outcome, µ = 0.If individual SNP models do not include covariates, Σ Z equals the LD matrix, i.e., the correlation matrix between the SNP values coded as 0, 1, or 2, reflecting the number of minor alleles in a genotype.In the presence of covariates, Σ Z is a Schur complement of the submatrix of the matrix of all predictor variables. 6That is, the estimated correlation between association statistics ΣZ can be obtained by inverting the covariance or correlation matrix of all predictors, selecting the SNP submatrix, inverting it back, and standardizing the result to correlation.
Under the alternative hypothesis, when some SNPs are associated with a trait y, let β j be the regression coefficient for the j-th SNP.Then, a typical linear model that determines the trait value is defined as: where ∼ N (0, 1).The mean value of the summary statistics (i.e., noncentralities) can be expressed as: where Σ j is the j-th column of Σ, b j = cor(y, SNP j ) and N is the sample size.We note that Eq.
(2) is valid outside of the linear model settings.For example, consider a latent variable model, where the continuous unobserved (latent) variable y l is linear in predictors according to Eq. (1), and the observed variable (disease status) is y = 1 whenever y l > l and y = 0 otherwise.When such binary outcome is analyzed by logistic regression, a good approximation to the noncentrality values will be: If error terms are assumed to be normally distributed, the reduction in correlation due to dichotomization by the factor d can be expressed as d = φ(l)/ Φ(l)(1 − Φ(l)), where φ(•), Φ(•) are the probability and the cumulative densities of the standard normal distribution. 16der association, surprisingly, the correlation matrix between statistics is no longer Σ.Let σ ij be the i, j-th element of Σ.By using the multivariate delta method, we derived the i, j-th element of the correlation matrix R as follows: Note that when some of SNP pairs (i, j) are associated, summary statistics may become correlated even if there is no LD between the SNPs, due to the last term, −b i b j , in Eq. (5).Equations (2), ( 3), ( 4), ( 5) allow one to study power properties of the methods based on sums of association statistics, as well as to design realistic simulation experiments, where summary statistics can be sampled directly from the MVN distribution under the alternative hypothesis.That is, given effect sizes and the correlation matrix among predictors, statistics can be immediately sampled from the MVN(µ, R) distribution.This approach avoids both the data-generating step and the subsequent computation of summary statistics from that data, leading to a substantial gain in computation time.In certain situations, the difference in speed can be dramatic.For example, it is not trivial to simulate discrete (genotype) data given a specific LD matrix.Current state of the art methods tend to be slow, because they rely on ad hoc iterative techniques, such as generation of multiple random "proposal" data sets to fit the target correlation matrix. 17sults of simulation experiments presented here were performed based on effect sizes specified via the linear model (Eq.1).However, we verified (not presented here) the validity of the proposed theory assuming logistic, probit, and Poisson regression models.We also note that Conneely et al.
presented theoretical arguments supporting the validity of the MVN joint distribution of summary statistics under no association for a broad class of generalized regression models. 6

Distribution of sums of association summary statistics
As we noted at the beginning of the "Methods" section, weighted sums of summary statistics can be re-expressed as unweighted sums, where the mean and the correlation parameters are modified to absorb the weights.The distribution of L i=1 Y 2 i follows the weighted sum of independent one degree of freedom non-central chi-square random variables.Although this result is standard, the components of this weighted sum depend on the joint distribution of association summary statistics under the alternative hypothesis, and this distribution has not been previously derived.
In the previous section, we provide the components of µ and R that determine the weights and the noncentralities of chi-squares.Therefore, where the weights, λ, are the eigenvalues of R and γ is the vector of non-centrality parameters.
The columns of the matrix E are orthogonalized and normalized eigenvectors of R. The P-value for the statistic TQ = Y Y is obtained by setting µ to zero and then calculating this tail probability at the observed value TQ = t.Note that the elements in R, and therefore the eigenvalues λ i , explicitly depend on the β-coefficients through Eqs. ( 2) and ( 5).
Our decorrelation approach uses a symmetric orthogonal transformation of the vector of statistics Y to a new vector X, with the new joint statistic based on the sum of elements of X, The cumulative distribution of the new test statistic is thus, There are many ways to choose an orthogonal transformation, but a valid one for our purposes needs to have the following "invariance to order" property.Suppose we sample an equicorrelated MVN vector Y with a common correlation ρ for all pairs of variables.Before decorrelating the vector, we permute its values to a different order.A permutation in this example is a legitimate operation, because an equicorrelation structure does not suggest a particular order of Y values.
After an orthogonal transformation of Y to X, the order of X entries may change due to permutation but their values should remain the same.Moreover, for the method to be useful in practice, we need the invariance to hold for a more general class of statistics than a simple sum of chi-squares, L i=1 X 2 i .For example, the Rank Truncated Product (RTP) is a powerful P-value combination method 11 that emphasizes small P-values: the RTP statistic T RTP is the product of the k smallest P-values, k < L, or equivalently, where is no longer a one degree of freedom chi-square variable.
The "invariance to order" requirement implies that the value of DOT-statistic should not change due to a permutation of (equicorrelated) values in Y.Not all orthogonal transformations meet the invariance to order criteria.It can be easily verified that neither the inverse Cholesky factor (C −1 ) transformation, X = C −1 Y, nor another commonly used transformation X = E 1 √ λ I Y, have the invariance to order property, except in the special case of the sum of L chi-squared variables L i=1 X 2 i .To clarify, we call this statistic "the special case," because, for example, in the case of RTP with k = L, the statistic L i=1 − ln(P i ) is no longer the sum of one degree of freedom chi-squares.Moreover, some transformations of equicorrelated data to independence, such as the Helmert transformation, may change values of X depending on the order of values in Y, even in a special equicorrelation case of ρ = 0 (i.e., when variables in Y are independent).The proposed H, as defined above, has both the invariance to order property and can be used with P-value transformations other than that to the one degree of freedom chi-square.

Theoretical analysis of power
For exploration of power properties, it is useful to first consider the equicorrelation case, because in this case it is possible to derive illustrative equations that relate power to: (1) the number of SNPs, L; (2) the common correlation value for every pair of SNPs, ρ; and (3) the mean values of association statistics, µ.In the equicorrelation case, the correlation matrix can be expressed as For decorrelated statistic DOT, we derived a simple form of L noncentralities by utilizing the Helmert orthogonal eigenvectors 18,19 as follows: where μ is the average of the values in µ.Next, let where d is the average of , over all pairs of µ i and µ j , such that i < j.The values in d ij are the pairwise squared differences in the standardized effect values as captured by the vector µ.This representation yields the noncentrality of DOT as a function of the common correlation and the mean standardized effect size as: Note that as L increases, the first term in Eq. ( 13) approaches μ2 /ρ, while the sum of the remaining noncentralities, δ s , increases linearly with L, as long as the average of the squared effect size differences, d, does not depend on L. Thus, the noncentrality of the decorrelated statistic DOT is expected to steadily increase with L and become approximately μ2 /ρ where γ i 's are the noncentralities for TQ and δ i 's are the noncentralities of DOT.In the equicorrelation case, the distribution TQ reduces to the weighted sum of two chi-square variables, because there are only two distinct eigenvalues that correspond to R ρ , namely: The term 15) approaches the constant d(1−ρ) 2ρ 2 + 1−ρ ρ as L increases.Therefore, under the null hypothesis, the distribution of the quadratic form Y Y can be well approximated by the location-scale transformation of the one degree of freedom chi-squared random variable: where χ 2 α is 1 − α quantile of the one degree of freedom chi-square distribution.
To summarize, we just showed that the distribution of the decorrelated set of variables gains in the total noncentrality with L, while the distribution of the sum Y Y depends heavily only on the noncentrality of the first term, γ 1 .The approximate power of the test based on the statistic TQ = Y Y can be computed as: where Matrix B R provides a way to construct random correlation matrices in a controlled manner, where the degree of departure from the equicorrelation is controlled via the range of the elements in U.
The utility of B R is that it represents a perturbation of R ρ , and we expect our power results under equicorrelation case to hold approximately, at least for small jiggles around ρ. Nevertheless, it turns out that even for a more general correlation structure, our power approximations still hold, which we show via extensive simulation studies.

LD patterns from the 1000 Genome Project
In a separate set of simulation experiments, we utilized realistic LD patterns using data from the 1000 Genomes Project. 20For every simulation experiment, we selected a random set of consecutive SNPs from a chromosome 17 region, that was spanning over 100 Kb and included SNPs from the gene FGF11 to the gene NDEL1.Each stretch of consecutive SNPs contained from 10 to 200 SNPs with the minimum allele frequency 0.025.A random portion of SNPs in every set carried no effect on the outcome on its own, and we considered these SNPs to be "proxies" for causal variants due to LD.The median LD correlation varied from approximately -0.6 to 0.98 between random stretches of SNPs.The number of proxy SNPs varied from 3 to 197 across simulations.The sample size was also set to be random and varied from 500 to 3000 across simulations.Effect sizes for causal variants were modeled by β-coefficients, as given by Eq. ( 1), and drawn randomly from the interval [-0.4,0.4].Summary statistics were sampled from the MVN distribution with parameters given by Eqs. ( 2), (4).To check the validity of our approach of sampling the summary statistics directly, we first conducted a separate set of extensive simulation experiments, in which power and type-I error rates were obtained by simulating individual data and then TQ and DOT statistics were computed by running the actual regression analysis.We confirmed excellent agreement between the two approaches, thus most of the subsequent simulations were conducted by sampling the summary statistics directly (these results are not shown here).

Results
We conducted simulation experiments to study statistical power of the proposed method based on the decorrelation statistic DOT, and to compare it to the statistic TQ.We also included a recently proposed method "ACAT" by Liu and colleagues, 21 where association P-values for individual SNPs are transformed to Cauchy-distributed random variables, then added up to obtain the overall P-value.ACAT was included into comparisons because it has robust power across different models of association.3][24][25] A distinctive feature of ACAT is its good type-I error control in the presence of correlation between P-values, which, interestingly, improves as the α-level becomes smaller, due to its usage of transformation to a moment-free Cauchy distribution.
We used two distinct scenarios in our simulation experiments: 1. First, we assumed that the summary statistics and the sample correlation matrix among statistics are estimated from the same data set.This allowed us to validate power properties derived in "Methods." 2. Second, we assumed that the sample correlation LD matrix was obtained from external reference panel.We included this scenario into our simulations due to the concern that the type-I error rate of the methods considered here may be inflated if the correlation matrix is computed based on a separate data set.
Simulations assuming that the LD matrix and the summary statistics are obtained from the same data To compare methods with and without decorrelation of statistics, we considered several distinct settings.
Setting 1.The decorrelation method (DOT) is expected to gain power as the number of SNPs increases in scenarios where effect sizes vary markedly from SNP to SNP.However, if effect sizes for all SNPs are in fact very close to each other, the power of DOT may decrease.To illustrate this property, our first, and purposely contrived simulation setup is where the effect sizes were all non-zero but very close to each other in their magnitude, varying uniformly from 2.3 to 2.4 (these are the values of the means of normally distributed standardized statistics).Table 1 shows the results of the simulations study under this setting, in which the decorrelation method was deliberately set up to fail.In the table, the columns labeled "Theoretic."provide power calculated based on the distribution of the test statistics under the alternative hypothesis that we derived above.The columns labeled "Empiric."provide results based on the empirical evaluation of power by computing P-values under the null.The columns labeled "Approx."provide power calculated based on the Eq. ( 17).The column labeled γ provide the average noncentrality value.
The table illustrates that our analytical calculations under the alternative hypothesis are correct.That is, the empirical power of both TQ and DOT statistics matches nearly exactly the analytical calculations.The approximation based on Eq. ( 17) apparently works well as well, emphasizing the fact that the distribution of the TQ statistic can be well approximated by a one-degree of freedom chi-square distribution.
Further, the table confirms that the decorrelation method is under-performing relative to TQ if there is very little heterogeneity among SNP effect sizes.Nonetheless, this scenario is admittedly unrealistic in practice.Furthermore, the table also illustrates that as the average non-centrality value increases, the power of DOT increases as well, while the power of TQ is relatively constant and about 80%.Finally, Table 1 shows that the power of TQ (although higher than that of DOT) does not change with L, highlighting the ceiling property of this method and the fact that combining more SNPs would not lead to higher power of TQ.
Setting 2. One of the features of the decorrelation method is that it benefits from heterogeneity in pairwise LD.To illustrate this property, we added jiggle to the equicorrelation matrix as described in the "Methods" section, while keeping the effect size vector the same as in Setting 1 (within the range of 2.3 to 2.4).In this second set of simulations, uniformly distributed perturbations (in the range 0 to 5) were added through U, which made the pairwise correlations range from 0.14 to 0.98.
Table 2 summarizes the results and once again, illustrates the ceiling feature of TQ power.
However, the power of the statistic DOT now starts to climb up with L and the proposed test based on DOT eventually becomes more powerful than the one based on TQ.Moreover, note that although the average noncentrality value does not increase with L, the DOT-test still gains power with L !Setting 3.This setting is analogous to the equicorrelation scenario in Setting 1, except that the mean values were lowered: in Setting 1 the range in µ was 2.3 to 2.4, while here, the range was set to vary uniformly between 1 and 2.3.Thus, the maximum effect size was lower than that in the previous simulations but the heterogeneity among effect sizes was higher.We emphasize again that while the equicorrelation assumption is unrealistic, it serves as a very useful benchmark scenario that highlights power behavior and features of the statistics TQ and DOT and allows one to introduce departures from equicorrelation in a controlled manner.
Table 3 presents the results.The "Approx." column in this table was removed and replaced by power values based on a "P-value"-approximation to the distribution of TQ as in Eq. ( 16).
This switch highlights the idea that both the power and the P-value for the TQ test can be reliably estimated based on the one degree of freedom chi-squared approximation.Importantly, Table 3 demonstrates that the power of the DOT-test reaches 100% as L increases (despite the fact that effect sizes were lower than in the previous settings), while the power of the TQ-test stays in the range 51.2 to 52.5%, Setting 4. This setting is similar to the scenario in Setting 2, except that we allowed higher heterogeneity in pair-wise LD values.LD was constructed as perturbation of R ρ=0.7 + UU (as described in Methods), with U set to be a random sequence on the interval from -5 to 5.This resulted in LD values ranging from -0.93 to 0.99.The effect sizes were sampled randomly within each simulation from (-0.15, 0.15) interval.
Table 4 presents the results and shows that in this setting, the power of DOT is dramatically higher than that of TQ and ACAT.In fact, power values for the TQ and ACAT tests barely exceed the type-I error, while the power of the decorrelation method steadily increases with L, eventually exceeding 90%.
Settings 5-7.In these sets of simulations we used biologically realistic patterns of LD.Also, rather than specifying mean values of association statistics directly, we utilized a regression model for the effect sizes, as described in Eqs. ( 1) and ( 2).We re-iterate that when association of SNPs with a trait is present (under the alternative hypothesis), the correlation among statistics is not equal to LD, because it also has to incorporate effect sizes, as illustrated by Eq. ( 5).
This point is important if one wants to simulate statistics directly from the MVN distribution rather than computing them based on simulated data followed by regression.
The results are presented in Table 5. Columns labeled "Regr."represent scenarios, in which data were generated and statistics were computed.Columns labeled "MVN" represent scenarios, in which statistics were simulated directly.The rows of Table 5 show power values for three different α-levels.We expected the power values in "Regr."and "MVN" columns to match, and they do, highlighting another utility of our analytical derivation of the distribution of the test statistic under the alternative hypothesis.That is, using our results, one can significantly reduce computational and programming burden in genetic simulations.Also note that power values in Table 5 do not decrease as α-level becomes smaller (Settings 6 and 7).This is due to the fact that we deliberately discarded effect size and LD configurations where power was expected to be too low, because we wanted to assure a good range of power values across methods.
As in previous simulations, power values of TQ and ACAT are similar.The power approximation by Eq. ( 17) remains close to the predicted theoretical power, as well as to empirically estimated powers.We also observed that power of the decorrelation test, DOT, is substantially higher than the powers of either TQ or ACAT.
Patterns of LD and effect sizes in Settings 1-4 are not necessarily realistic biologically, however, they serve as benchmark scenarios that help to understand and highlight differences in the respective statistical power of the methods.Simulations for Settings 1-4 were performed at the 5% α-level based on 2 × 10 6 evaluations.Settings 5-7 used realistic patters of LD derived from the 1000 Genomes Project data.Test sizes varied from 0.001 to 10 −7 with at least 10,000 simulations for power estimates.Type-I error rates were well controlled for TQ and DOT.However, as noted by Liu et al., because the ACAT P-value is approximate, the null distribution of its statistic is evaluated under independence, and we found that at the nominal 5% α-level, the type-I error for the ACAT was somewhat higher and could reach 7% for some correlation settings.Nonetheless, the advantage of ACAT is that the approximation improves as the α-level becomes smaller.
Simulations assuming that the correlation matrix is estimated using external data When only summary statistics are available, the correlation matrix Σ can be estimated from a reference panel of genotyped individuals.However, the type-I error of tests based on both TQ and DOT may potentially be affected due to substituting the sample estimate Σ by an estimate obtained from external data.To study the effect of this mis-specification on the type-I error, we conducted a separate set of simulations.In these experiments, we again utilized LD structures derived from the 1000 Genomes Project data.The type-I error rates, given in Table 6, show that both ACAT and TQ have close to the nominal type-I error rates, but the error rate for the decorrelation method (DOT) can be inflated, unless the sample size of the reference panel is 50 to 100 times larger than the number of SNPs (L).For the statistic DOT, the type-I error rates appear to be more inflated at smaller α-levels, such as 10 −7 (according to preliminary simulations not shown here).Power values for TQ are not shown, however they closely followed predicted theoretical power for the scenarios where the same data are used for both LD estimation and computation of association statistics.
There was only 1 to 2% drop in power when the size of the panel was only 2 to 5 times larger than L.

Combining breast cancer association statistics within candidate genes
We applied our decorrelation method to a family-based GWAS study of breast cancer. 26,27The data set was comprised of complete trios, i.e., families where genotypes of both parents and the affected offspring were available.With complete trios, previously reported statistics 8 become equivalent to statistics from the transmission-disequilibrium test and correlation among them is expected to follow the LD among SNPs. 8We selected four candidate genes (TOX3, ESR1, FGFR2 and RAD51B ), for which Min et al. 26,27 replicated several previously reported risk SNPs in relation to breast cancer.
For the joint association, we restricted our analysis to blocks of SNPs surrounding breast cancer risk variants that were previously reported in the literature.Specifically, we selected TOX3 rs4784220, 28 ESR1 rs3020314, 29,30 FGFR2 rs2981579, 28 and RAD51B rs999737, [31][32][33] and then included blocks of SNPs around these 'anchor' risk variants with the LD correlation of at least 0.25.These blocks included 13 SNPs around rs4784220, 36 SNPs around rs3020314, 18 SNPs around rs2981579, and 30 SNPs around rs999737.As an illustration, Figure 1 displays 81 SNP P-values that were available for ESR1 gene, the vertical dashed line highlights the position of 'anchor' rs3020314, the red dots highlight 36 SNPs within LD-block of rs3020314, and the LD matrix displays sample correlation matrix among 36 SNPs.Once SNP blocks were identified for each gene, we applied four combination methods to assess their association with breast cancer.
Table 7 present the joint association analysis results.The first row of Table 7 shows P-values for the association between the LD block of 13 SNPs in TOX3 region and breast cancer.All methods conclude a statistically significant link but our decorrelation method provides the most robust evidence with a substantially lower P-value.The third row of Table 7 shows joint association breast cancer.
Finally, for the LD blocks in FGFR2 and RAD51B we repeated the procedure detailed above and also identified top-ranking SNPs.Table 8 reviews these results and points FGFR2 rs2981427 and RAD51B rs7359088 as two more additional newly found associations.

Discussion
In this research, we have proposed a new powerful decorrelation-based approach (DOT) for combining SNP-level summary statistics (or, equivalently, P-values) and derived its theoretical power properties.To the best our knowledge, we were the first to derive theoretical properties of the traditional approach, TQ (e.g., as implemented in VEGAS).Through extensive simulation studies, we have demonstrated that our decorrelation approach is a very powerful addition to the tools available for studying genetic susceptibility to disease.
Our analysis of breast cancer data illustrates unique properties of DOT.Our results revealed novel potential associations within four candidate genes that would have not been found by previously proposed methods.These novel SNPs were identified by examining the top three linearcombination contributors to the overall value of the DOT-statistic.We note that the top contributions may give large weights to genetic variants that are truly associated with the outcome or to SNPs in a high positive LD with true causal variants.Caution is needed when interpreting such results because our method cannot distinguish between causal and proxy associations.Further studies would be needed to confirm these findings.
The most important feature of the proposed method is that it provides substantial power boost across diverse settings, where power gain is amplified by heterogeneity of effect sizes and by increased diversity between pairwise LD values.Genetic architecture of complex traits is far from being homogeneous, making our method applicable in various settings.We have developed new theory to explain unexpected and remarkable boost in power.This theory allows one to predict behavior of the tests in simulations with high accuracy and to explain unexpected scenarios, where the decorrelation method may give dramatically higher power compared to the traditional approach.
Yet, there are important precautions to the decorrelation approach.When reference panel data are used to provide the LD information and, more generally, correlation estimates for all predictors, including SNPs and covariates, Σ, sample size of the external data should be several times larger than the number of predictors.Ideally, the same data set should be used to obtain association statistics, as well as Σ.Nevertheless, association statistics and Σ are compact summaries of data and are much more easily transferred between separate research groups than raw data, due to privacy considerations and potentially large size of the raw data sets.Also, caution is needed if missing data are present in the original data set because the estimate ( Σ) may no longer reflect the sample correlation between predictors.Imputation of missing values is a suitable solution, if missing values are independent of the outcome.With the usage of reference panel data, the type-I error inflation for the statistic DOT can be affected by many factors, and this statistic is expected to be sensitive not only to the size of a reference panel, but to population variations in LD, especially for highly correlated blocks of SNPs.Overall, it appears to be difficult to give specific recommendations, except that the reference panel size has to be at least 50 times larger than the number of SNPs to be combined.Therefore, we recommend to limit applications of the decorrelation method to situations, where the LD matrix is obtained from the same data set as the summary statistics.Note that all pairwise LD values can be obtained from sample haplotype frequencies of SNPs, thus the LD matrix can be reconstructed.Utility of this approach remains to be investigated, in particular, one concern is that the correlation between the SNP values reflect the composite disequilibrium values, 34 while frequencies of sample haplotypes are often reported following likelihood maximization, e.g., by the EM algorithm.
In our simulations, the recently proposed method ACAT and the test based on the distribution of the sum of correlated association statistics (VEGAS, or TQ) had similar power.In many situations, power of these two tests was substantially lower than that of the DOT.The main advantage of ACAT is that it does not require any LD information.Our theory and simulations also revealed previously unknown robustness of the TQ method with respect to LD mis-specification: the method is valid and remains nearly as powerful when the sample LD matrix is substituted by a single value, summarizing the extent of all pairwise correlations.TQ also remains valid when the LD summary is obtained from a representative reference panel of sample size as small as two to five times the number of SNPs to be combined.We stress again that compared to ACAT and TQ, our method's limitation is that in order to avoid possible bias, the LD information and the summary statistics should ideally come from the same data set and missing genotypes should be imputed prior to its

GWAS
SNP P-values q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q Significance Linkage Disequilibrium Matrix Overview of DOT method in application to breast cancer data.We compute gene-level score by first decorrelating SNP P-values using the invariant to order matrix H and then calculating sum of independent chi-squared statistics.We utilize our DOT method to obtain a gene-level P-value.In the breast cancer data application, we chose an anchor SNP -a SNP that has previously been reported as risk variant (highlighted by a vertical dashed line), -and then combine SNPs in an LD block with the anchor SNP by the DOT.SNP-level P-values highlighted in red are those in moderate to high LD with the anchor SNP.This SNP was previously reported to be associated with endocrine therapy efficacy in breast cancer, 38 as well as with the overall breast cancer risk. 39rs3003921 A new association with susceptibility to breast cancer.
This SNP was previously linked to the effectiveness of androgen deprivation therapy among prostate cancer patients. 40rs985695 A new association with susceptibility to breast cancer.rs2982689 A new association with susceptibility to breast cancer.rs3020424 A new association with susceptibility to breast cancer.rs926777 A new association with susceptibility to breast cancer.FGFR2 18 rs1219648 s2860197 This SNP was previously suggested to have an association with breast cancer. 46rs2981582 s3135730 This SNP was previously suggested to have an interaction between oral contraceptive use and breast cancer. 50rs2981427 A new association with susceptibility to breast cancer.s8016149 This SNP was previously suggested to have an association with breast cancer. 53rs999737 This SNP was previously reported in the literature to be associated with breast cancer.7 31-33 rs1023529 This SNP has been patented as one of susceptibility variants of breast cancer. 54rs2189517 This SNP was showed to be associated with breast cancer in Chinese population. 55rs7359088 A new association with susceptibility to breast cancer.
The orthogonal transformation is defined as follows.Let D = 1 √ λ I and define X = H Y, where H = E DE .The squared values, X 2 i , are one degree of freedom independent chi-square variables, thus DOT = X X is a chi-square random variable with L degrees of freedom and noncentrality value of: |µ| and Ψ(•) is a one degree of freedom chi-square CDF with the noncentrality Lµ * /((L−1)ρ * +1), evaluated at t.The ceiling noncentrality value, as L → ∞, is thus µ * /ρ * .Let us re-emphasize the point that a test based on the distribution of the TQ statistic is expected to be less powerful than DOT in the presence of heterogeneity among effect sizes.Heterogeneity in LD will contribute to the difference in power.Starting with an equicorrelation model, we can introduce perturbations to the common value, ρ > 0, by adding noise derived from a rank-one matrix UU , where U is a vector of random numbers.Specifically, perturbations can be added as B = R ρ +UU .Next, B should be standardized to correlation as B R = 1/Diag(B) I B 1/Diag(B) I .When elements in U are close to zero, the matrix B R deviates from R ρ by only a small jiggle around ρ.

Decorrelation of Z i by H DOT sum of χ 2 1 1 Figure 1 :
Figure1: Overview of DOT method in application to breast cancer data.We compute gene-level score by first decorrelating SNP P-values using the invariant to order matrix H and then calculating sum of independent chi-squared statistics.We utilize our DOT method to obtain a gene-level P-value.In the breast cancer data application, we chose an anchor SNP -a SNP that has previously been reported as risk variant (highlighted by a vertical dashed line), -and then combine SNPs in an LD block with the anchor SNP by the DOT.SNP-level P-values highlighted in red are those in moderate to high LD with the anchor SNP.

Table 4 :
Power comparison of TQ, DOT, and ACAT with effect sizes randomly sampled from -0.15 to 0.15 and heterogeneous LD.

Table 8 :
Breast cancer SNPs identified by DOT in the analysis of GWAS data.