Statistics¶
linear_regression (y, x[, covariates, root, …]) |
For each row, test an input variable for association with response variables using linear regression. |
logistic_regression (test, y, x[, …]) |
For each row, test an input variable for association with a binary response variable using logistic regression. |
linear_mixed_regression (kinship_matrix, y, x) |
Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association. |
pca (entry_expr[, k, compute_loadings]) |
Run principal component analysis (PCA) on numeric columns derived from a matrix table. |
-
hail.methods.
linear_regression
(y, x, covariates=(), root='linreg', block_size=16) → hail.matrixtable.MatrixTable[source]¶ For each row, test an input variable for association with response variables using linear regression.
Examples
>>> dataset_result = hl.linear_regression(y=dataset.pheno.height, ... x=dataset.GT.n_alt_alleles(), ... covariates=[dataset.pheno.age, dataset.pheno.is_female])
Warning
linear_regression()
considers the same set of columns (i.e., samples, points) for every response variable and row, namely those columns for which all response variables and covariates are defined. For each row, missing values of x are mean-imputed over these columns.Notes
With the default root and y a single expression, the following row-indexed fields are added.
- linreg.n (
tint32
) – Number of columns used. - linreg.sum_x (
tfloat64
) – Sum of input values x. - linreg.y_transpose_x (
tfloat64
) – Dot product of response vector y with the input vector x. - linreg.beta (
tfloat64
) – Fit effect coefficient of x, \(\hat\beta_1\) below. - linreg.standard_error (
tfloat64
) – Estimated standard error, \(\widehat{\mathrm{se}}_1\). - linreg.t_stat (
tfloat64
) – \(t\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}_1\). - linreg.p_value (
tfloat64
) – \(p\)-value.
If y is a list of expressions, then the last five fields instead have type
tarray
oftfloat64
, with corresponding indexing of the list and each array.In the statistical genetics example above, the input variable x encodes genotype as the number of alternate alleles (0, 1, or 2). For each variant (row), genotype is tested for association with height controlling for age and sex, by fitting the linear regression model:
\[\mathrm{height} = \beta_0 + \beta_1 \, \mathrm{genotype} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is_female} + \varepsilon, \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]Boolean covariates like \(\mathrm{is_female}\) are encoded as 1 for
True
and 0 forFalse
. The null model sets \(\beta_1 = 0\).The standard least-squares linear regression model is derived in Section 3.2 of The Elements of Statistical Learning, 2nd Edition. See equation 3.12 for the t-statistic which follows the t-distribution with \(n - k - 2\) degrees of freedom, under the null hypothesis of no effect, with \(n\) samples and \(k\) covariates in addition to
x
and the intercept.Parameters: - y (
Float64Expression
orlist
ofFloat64Expression
) – One or more column-indexed response expressions. - x (
Float64Expression
) – Entry-indexed expression for input variable. - covariates (
list
ofFloat64Expression
) – List of column-indexed covariate expressions. - root (
str
) – Name of resulting row-indexed field. - block_size (
int
) – Number of row regressions to perform simultaneously per core. Larger blocks require more memory but may improve performance.
Returns: MatrixTable
– Matrix table with regression results in a new row-indexed field.- linreg.n (
-
hail.methods.
logistic_regression
(test, y, x, covariates=(), root='logreg') → hail.matrixtable.MatrixTable[source]¶ For each row, test an input variable for association with a binary response variable using logistic regression.
Examples
Run the logistic regression Wald test per variant using a Boolean phenotype and two covariates stored in column-indexed fields:
>>> ds_result = hl.logistic_regression( ... test='wald', ... y=dataset.pheno.is_case, ... x=dataset.GT.n_alt_alleles(), ... covariates=[dataset.pheno.age, dataset.pheno.is_female])
Notes
This method performs, for each row, a significance test of the input variable in predicting a binary (case-control) response variable based on the logistic regression model. The response variable type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.
Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’). Hail only includes columns for which the response variable and all covariates are defined. For each row, Hail imputes missing input values as the mean of the non-missing values.
The example above considers a model of the form
\[\mathrm{Prob}(\mathrm{is_case}) = \mathrm{sigmoid}(\beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is_female} + \varepsilon), \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]where \(\mathrm{sigmoid}\) is the sigmoid function, the genotype \(\mathrm{gt}\) is coded as 0 for HomRef, 1 for Het, and 2 for HomVar, and the Boolean covariate \(\mathrm{is_female}\) is coded as for
True
(female) and 0 forFalse
(male). The null model sets \(\beta_1 = 0\).The resulting variant annotations depend on the test statistic as shown in the tables below.
Test Field Type Value Wald logreg.beta float64 fit effect coefficient, \(\hat\beta_1\) Wald logreg.standard_error float64 estimated standard error, \(\widehat{\mathrm{se}}\) Wald logreg.z_stat float64 Wald \(z\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\) Wald logreg.p_value float64 Wald p-value testing \(\beta_1 = 0\) LRT, Firth logreg.beta float64 fit effect coefficient, \(\hat\beta_1\) LRT, Firth logreg.chi_sq_stat float64 deviance statistic LRT, Firth logreg.p_value float64 LRT / Firth p-value testing \(\beta_1 = 0\) Score logreg.chi_sq_stat float64 score statistic Score logreg.p_value float64 score p-value testing \(\beta_1 = 0\) For the Wald and likelihood ratio tests, Hail fits the logistic model for each row using Newton iteration and only emits the above annotations when the maximum likelihood estimate of the coefficients converges. The Firth test uses a modified form of Newton iteration. To help diagnose convergence issues, Hail also emits three variant annotations which summarize the iterative fitting process:
Test Field Type Value Wald, LRT, Firth logreg.fit.n_iterations int32 number of iterations until convergence, explosion, or reaching the max (25 for Wald, LRT; 100 for Firth) Wald, LRT, Firth logreg.fit.converged bool True
if iteration convergedWald, LRT, Firth logreg.fit.exploded bool True
if iteration explodedWe consider iteration to have converged when every coordinate of \(\beta\) changes by less than \(10^{-6}\). For Wald and LRT, up to 25 iterations are attempted; in testing we find 4 or 5 iterations nearly always suffice. Convergence may also fail due to explosion, which refers to low-level numerical linear algebra exceptions caused by manipulating ill-conditioned matrices. Explosion may result from (nearly) linearly dependent covariates or complete separation.
A more common situation in genetics is quasi-complete seperation, e.g. variants that are observed only in cases (or controls). Such variants inevitably arise when testing millions of variants with very low minor allele count. The maximum likelihood estimate of \(\beta\) under logistic regression is then undefined but convergence may still occur after a large number of iterations due to a very flat likelihood surface. In testing, we find that such variants produce a secondary bump from 10 to 15 iterations in the histogram of number of iterations per variant. We also find that this faux convergence produces large standard errors and large (insignificant) p-values. To not miss such variants, consider using Firth logistic regression, linear regression, or group-based tests.
Here’s a concrete illustration of quasi-complete seperation in R. Suppose we have 2010 samples distributed as follows for a particular variant:
Status HomRef Het HomVar Case 1000 10 0 Control 1000 0 0 The following R code fits the (standard) logistic, Firth logistic, and linear regression models to this data, where
x
is genotype,y
is phenotype, andlogistf
is from the logistf package:x <- c(rep(0,1000), rep(1,1000), rep(1,10) y <- c(rep(0,1000), rep(0,1000), rep(1,10)) logfit <- glm(y ~ x, family=binomial()) firthfit <- logistf(y ~ x) linfit <- lm(y ~ x)
The resulting p-values for the genotype coefficient are 0.991, 0.00085, and 0.0016, respectively. The erroneous value 0.991 is due to quasi-complete separation. Moving one of the 10 hets from case to control eliminates this quasi-complete separation; the p-values from R are then 0.0373, 0.0111, and 0.0116, respectively, as expected for a less significant association.
The Firth test reduces bias from small counts and resolves the issue of separation by penalizing maximum likelihood estimation by the Jeffrey’s invariant prior. This test is slower, as both the null and full model must be fit per variant, and convergence of the modified Newton method is linear rather than quadratic. For Firth, 100 iterations are attempted for the null model and, if that is successful, for the full model as well. In testing we find 20 iterations nearly always suffices. If the null model fails to converge, then the logreg.fit fields reflect the null model; otherwise, they reflect the full model.
See Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants for an empirical comparison of the logistic Wald, LRT, score, and Firth tests. The theoretical foundations of the Wald, likelihood ratio, and score tests may be found in Chapter 3 of Gesine Reinert’s notes Statistical Theory. Firth introduced his approach in Bias reduction of maximum likelihood estimates, 1993. Heinze and Schemper further analyze Firth’s approach in A solution to the problem of separation in logistic regression, 2002.
Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.
For Boolean covariate types,
True
is coded as 1 andFalse
as 0. In particular, for the sample annotation fam.is_case added by importing a FAM file with case-control phenotype, case is 1 and control is 0.Hail’s logistic regression tests correspond to the
b.wald
,b.lrt
, andb.score
tests in EPACTS. For each variant, Hail imputes missing input values as the mean of non-missing input values, whereas EPACTS subsets to those samples with called genotypes. Hence, Hail and EPACTS results will currently only agree for variants with no missing genotypes.Parameters: - test ({‘wald’, ‘lrt’, ‘score’, ‘firth’}) – Statistical test.
- y (
Float64Expression
) – Column-indexed response expression. All non-missing values must evaluate to 0 or 1. Note that aBooleanExpression
will be implicitly converted to aFloat64Expression
with this property. - x (
Float64Expression
) – Entry-indexed expression for input variable. - covariates (
list
ofFloat64Expression
) – List of column-indexed covariate expressions. - root (
str
, optional) – Name of resulting row-indexed field.
Returns: MatrixTable
– Matrix table with regression results in a new row-indexed field.
-
hail.methods.
linear_mixed_regression
(kinship_matrix, y, x, covariates=[], global_root='lmmreg_global', row_root='lmmreg', run_assoc=True, use_ml=False, delta=None, sparsity_threshold=1.0, n_eigenvectors=None, dropped_variance_fraction=None) → hail.matrixtable.MatrixTable[source]¶ Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association.
We plan to change the interface to this method in Hail 0.2 while maintaining its functionality.
Note
Requires the column key to be one field of type
tstr
Examples
Compute a
KinshipMatrix
, and use it to test variants for association using a linear mixed model:>>> lmm_ds = hl.read_matrix_table("data/example_lmmreg.vds") >>> kinship_matrix = hl.realized_relationship_matrix(lmm_ds.filter_rows(lmm_ds.use_in_kinship)['GT']) >>> lmm_ds = hl.linear_mixed_regression(kinship_matrix, ... y=lmm_ds.pheno, ... x=lmm_ds.GT.n_alt_alleles(), ... covariates=[lmm_ds.cov1, lmm_ds.cov2])
Notes
Suppose the variant dataset saved at
data/example_lmmreg.vds
has a Boolean variant-indexed field use_in_kinship and numeric or Boolean sample-indexed fields pheno, cov1, and cov2. Then thelinear_mixed_regression()
function in the above example will execute the following four steps in order:- filter to samples in given kinship matrix to those for which ds.pheno, ds.cov, and ds.cov2 are all defined
- compute the eigendecomposition \(K = USU^T\) of the kinship matrix
- fit covariate coefficients and variance parameters in the sample-covariates-only (global) model using restricted maximum likelihood (REML), storing results in a global field under lmmreg_global
- test each variant for association, storing results in a row-indexed field under lmmreg
This plan can be modified as follows:
- Set run_assoc to
False
to not test any variants for association, i.e. skip Step 5. - Set use_ml to
True
to use maximum likelihood instead of REML in Steps 4 and 5. - Set the delta argument to manually set the value of \(\delta\) rather that fitting \(\delta\) in Step 4.
- Set the global_root argument to change the global annotation root in Step 4.
- Set the row_root argument to change the variant annotation root in Step 5.
linear_mixed_regression()
adds 13 global annotations in Step 4. These global annotations are stored under the prefix global_root, which is by defaultlmmreg_global
. The prefix is not displayed in the table below.Field Type Value use_ml bool true if fit by ML, false if fit by REML beta dict<str, float64> map from intercept and the given covariates expressions to the corresponding fit \(\beta\) coefficients sigma_g_squared float64 fit coefficient of genetic variance, \(\hat{\sigma}_g^2\) sigma_e_squared float64 fit coefficient of environmental variance \(\hat{\sigma}_e^2\) delta float64 fit ratio of variance component coefficients, \(\hat{\delta}\) h_squared float64 fit narrow-sense heritability, \(\hat{h}^2\) n_eigenvectors int32 number of eigenvectors of kinship matrix used to fit model dropped_variance_fraction float64 specified value of dropped_variance_fraction eigenvalues array<float64> all eigenvalues of the kinship matrix in descending order fit.standard_error_h_squared float64 standard error of \(\hat{h}^2\) under asymptotic normal approximation fit.normalized_likelihood_h_squared array<float64> likelihood function of \(h^2\) normalized on the discrete grid 0.01, 0.02, ..., 0.99
. Indexi
is the likelihood for percentagei
.fit.max_log_likelihood float64 (restricted) maximum log likelihood corresponding to \(\hat{\delta}\) fit.log_delta_grid array<float64> values of \(\mathrm{ln}(\delta)\) used in the grid search fit.log_likelihood_values array<float64> (restricted) log likelihood of \(y\) given \(X\) and \(\mathrm{ln}(\delta)\) at the (RE)ML fit of \(\beta\) and \(\sigma_g^2\) These global annotations are also added to
hail.log
, with the ranked evals and \(\delta\) grid with values in .tsv tabular form. Usegrep 'linear mixed regression' hail.log
to find the lines just above each table.If Step 5 is performed,
linear_mixed_regression()
also adds four linear regression row fields. These annotations are stored as row_root, which defaults tolmmreg
. Once again, the prefix is not displayed in the table.Field Type Value beta float64 fit genotype coefficient, \(\hat\beta_0\) sigma_g_squared float64 fit coefficient of genetic variance component, \(\hat{\sigma}_g^2\) chi_sq_stat float64 \(\chi^2\) statistic of the likelihood ratio test p_value float64 \(p\)-value Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.
Performance
Hail’s initial version of
linear_mixed_regression()
scales beyond 15k samples and to an essentially unbounded number of variants, making it particularly well-suited to modern sequencing studies and complementary to tools designed for SNP arrays. Analysts have usedlinear_mixed_regression()
in research to compute kinship from 100k common variants and test 32 million non-rare variants on 8k whole genomes in about 10 minutes on Google cloud.While
linear_mixed_regression()
computes the kinship matrix \(K\) using distributed matrix multiplication (Step 2), the full eigendecomposition (Step 3) is currently run on a single core of master using the LAPACK routine DSYEVD, which we empirically find to be the most performant of the four available routines; laptop performance plots showing cubic complexity in \(n\) are available here. On Google cloud, eigendecomposition takes about 2 seconds for 2535 sampes and 1 minute for 8185 samples. If you see worse performance, check that LAPACK natives are being properly loaded (see “BLAS and LAPACK” in Getting Started).Given the eigendecomposition, fitting the global model (Step 4) takes on the order of a few seconds on master. Association testing (Step 5) is fully distributed by variant with per-variant time complexity that is completely independent of the number of sample covariates and dominated by multiplication of the genotype vector \(v\) by the matrix of eigenvectors \(U^T\) as described below, which we accelerate with a sparse representation of \(v\). The matrix \(U^T\) has size about \(8n^2\) bytes and is currently broadcast to each Spark executor. For example, with 15k samples, storing \(U^T\) consumes about 3.6GB of memory on a 16-core worker node with two 8-core executors. So for large \(n\), we recommend using a high-memory configuration such as highmem workers.
Linear mixed model
linear_mixed_regression()
estimates the genetic proportion of residual phenotypic variance (narrow-sense heritability) under a kinship-based linear mixed model, and then optionally tests each variant for association using the likelihood ratio test. Inference is exact.We first describe the sample-covariates-only model used to estimate heritability, which we simply refer to as the global model. With \(n\) samples and \(c\) sample covariates, we define:
- \(y = n \times 1\) vector of phenotypes
- \(X = n \times c\) matrix of sample covariates and intercept column of ones
- \(K = n \times n\) kinship matrix
- \(I = n \times n\) identity matrix
- \(\beta = c \times 1\) vector of covariate coefficients
- \(\sigma_g^2 =\) coefficient of genetic variance component \(K\)
- \(\sigma_e^2 =\) coefficient of environmental variance component \(I\)
- \(\delta = \frac{\sigma_e^2}{\sigma_g^2} =\) ratio of environmental and genetic variance component coefficients
- \(h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2} = \frac{1}{1 + \delta} =\) genetic proportion of residual phenotypic variance
Under a linear mixed model, \(y\) is sampled from the \(n\)-dimensional multivariate normal distribution with mean \(X \beta\) and variance components that are scalar multiples of \(K\) and \(I\):
\[y \sim \mathrm{N}\left(X\beta, \sigma_g^2 K + \sigma_e^2 I\right)\]Thus the model posits that the residuals \(y_i - X_{i,:}\beta\) and \(y_j - X_{j,:}\beta\) have covariance \(\sigma_g^2 K_{ij}\) and approximate correlation \(h^2 K_{ij}\). Informally: phenotype residuals are correlated as the product of overall heritability and pairwise kinship. By contrast, standard (unmixed) linear regression is equivalent to fixing \(\sigma_2\) (equivalently, \(h^2\)) at 0 above, so that all phenotype residuals are independent.
Caution: while it is tempting to interpret \(h^2\) as the narrow-sense heritability of the phenotype alone, note that its value depends not only the phenotype and genetic data, but also on the choice of sample covariates.
Fitting the global model
The core algorithm is essentially a distributed implementation of the spectral approach taken in FastLMM. Let \(K = USU^T\) be the eigendecomposition of the real symmetric matrix \(K\). That is:
- \(U = n \times n\) orthonormal matrix whose columns are the eigenvectors of \(K\)
- \(S = n \times n\) diagonal matrix of eigenvalues of \(K\) in descending order. \(S_{ii}\) is the eigenvalue of eigenvector \(U_{:,i}\)
- \(U^T = n \times n\) orthonormal matrix, the transpose (and inverse) of \(U\)
A bit of matrix algebra on the multivariate normal density shows that the linear mixed model above is mathematically equivalent to the model
\[U^Ty \sim \mathrm{N}\left(U^TX\beta, \sigma_g^2 (S + \delta I)\right)\]for which the covariance is diagonal (e.g., unmixed). That is, rotating the phenotype vector (\(y\)) and covariate vectors (columns of \(X\)) in \(\mathbb{R}^n\) by \(U^T\) transforms the model to one with independent residuals. For any particular value of \(\delta\), the restricted maximum likelihood (REML) solution for the latter model can be solved exactly in time complexity that is linear rather than cubic in \(n\). In particular, having rotated, we can run a very efficient 1-dimensional optimization procedure over \(\delta\) to find the REML estimate \((\hat{\delta}, \hat{\beta}, \hat{\sigma}_g^2)\) of the triple \((\delta, \beta, \sigma_g^2)\), which in turn determines \(\hat{\sigma}_e^2\) and \(\hat{h}^2\).
We first compute the maximum log likelihood on a \(\delta\)-grid that is uniform on the log scale, with \(\mathrm{ln}(\delta)\) running from -8 to 8 by 0.01, corresponding to \(h^2\) decreasing from 0.9995 to 0.0005. If \(h^2\) is maximized at the lower boundary then standard linear regression would be more appropriate and Hail will exit; more generally, consider using standard linear regression when \(\hat{h}^2\) is very small. A maximum at the upper boundary is highly suspicious and will also cause Hail to exit. In any case, the log file records the table of grid values for further inspection, beginning under the info line containing “linear mixed regression: table of delta”.
If the optimal grid point falls in the interior of the grid as expected, we then use Brent’s method to find the precise location of the maximum over the same range, with initial guess given by the optimal grid point and a tolerance on \(\mathrm{ln}(\delta)\) of 1e-6. If this location differs from the optimal grid point by more than 0.01, a warning will be displayed and logged, and one would be wise to investigate by plotting the values over the grid.
Note that \(h^2\) is related to \(\mathrm{ln}(\delta)\) through the sigmoid function. More precisely,
\[h^2 = 1 - \mathrm{sigmoid}(\mathrm{ln}(\delta)) = \mathrm{sigmoid}(-\mathrm{ln}(\delta))\]Hence one can change variables to extract a high-resolution discretization of the likelihood function of \(h^2\) over \([0, 1]\) at the corresponding REML estimators for \(\beta\) and \(\sigma_g^2\), as well as integrate over the normalized likelihood function using change of variables and the sigmoid differential equation.
For convenience, lmmreg.fit.normalized_likelihood_h_squared records the the likelihood function of \(h^2\) normalized over the discrete grid \(0.01, 0.02, \ldots, 0.98, 0.99\). The length of the array is 101 so that index
i
contains the likelihood at percentagei
. The values at indices 0 and 100 are left undefined.By the theory of maximum likelihood estimation, this normalized likelihood function is approximately normally distributed near the maximum likelihood estimate. So we estimate the standard error of the estimator of \(h^2\) as follows. Let \(x_2\) be the maximum likelihood estimate of \(h^2\) and let \(x_ 1\) and \(x_3\) be just to the left and right of \(x_2\). Let \(y_1\), \(y_2\), and \(y_3\) be the corresponding values of the (unnormalized) log likelihood function. Setting equal the leading coefficient of the unique parabola through these points (as given by Lagrange interpolation) and the leading coefficient of the log of the normal distribution, we have:
\[\frac{x_3 (y_2 - y_1) + x_2 (y_1 - y_3) + x_1 (y_3 - y_2))} {(x_2 - x_1)(x_1 - x_3)(x_3 - x_2)} = -\frac{1}{2 \sigma^2}\]The standard error \(\hat{\sigma}\) is then estimated by solving for \(\sigma\).
Note that the mean and standard deviation of the (discretized or continuous) distribution held in lmmreg.fit.normalized_likelihood_h_squared will not coincide with \(\hat{h}^2\) and \(\hat{\sigma}\), since this distribution only becomes normal in the infinite sample limit. One can visually assess normality by plotting this distribution against a normal distribution with the same mean and standard deviation, or use this distribution to approximate credible intervals under a flat prior on \(h^2\).
Testing each variant for association
Fixing a single variant, we define:
- \(v = n \times 1\) input vector, with missing values imputed as the mean of the non-missing values
- \(X_v = \left[v | X \right] = n \times (1 + c)\) matrix concatenating \(v\) and \(X\)
- \(\beta_v = (\beta^0_v, \beta^1_v, \ldots, \beta^c_v) = (1 + c) \times 1\) vector of covariate coefficients
Fixing \(\delta\) at the global REML estimate \(\hat{\delta}\), we find the REML estimate \((\hat{\beta}_v, \hat{\sigma}_{g, v}^2)\) via rotation of the model
\[y \sim \mathrm{N}\left(X_v\beta_v, \sigma_{g,v}^2 (K + \hat{\delta} I)\right)\]Note that the only new rotation to compute here is \(U^T v\).
To test the null hypothesis that the genotype coefficient \(\beta^0_v\) is zero, we consider the restricted model with parameters \(((0, \beta^1_v, \ldots, \beta^c_v), \sigma_{g,v}^2)\) within the full model with parameters \((\beta^0_v, \beta^1_v, \ldots, \beta^c_v), \sigma_{g_v}^2)\), with \(\delta\) fixed at \(\hat\delta\) in both. The latter fit is simply that of the global model, \(((0, \hat{\beta}^1, \ldots, \hat{\beta}^c), \hat{\sigma}_g^2)\). The likelihood ratio test statistic is given by
\[\chi^2 = n \, \mathrm{ln}\left(\frac{\hat{\sigma}^2_g}{\hat{\sigma}_{g,v}^2}\right)\]and follows a chi-squared distribution with one degree of freedom. Here the ratio \(\hat{\sigma}^2_g / \hat{\sigma}_{g,v}^2\) captures the degree to which adding the variant \(v\) to the global model reduces the residual phenotypic variance.
Kinship Matrix
FastLMM uses the Realized Relationship Matrix (RRM) for kinship. This can be computed with
rrm()
. However, any instance ofKinshipMatrix
may be used, so long assample_list()
contains the complete samples of the caller variant dataset in the same order.Low-rank approximation of kinship for improved performance
linear_mixed_regression()
can implicitly use a low-rank approximation of the kinship matrix to more rapidly fit delta and the statistics for each variant. The computational complexity per variant is proportional to the number of eigenvectors used. This number can be specified in two ways. Specify the parameter n_eigenvectors to use only the top n_eigenvectors eigenvectors. Alternatively, specify dropped_variance_fraction to use as many eigenvectors as necessary to capture all but at most this fraction of the sample variance (also known as the trace, or the sum of the eigenvalues). For example, setting dropped_variance_fraction to 0.01 will use the minimal number of eigenvectors to account for 99% of the sample variance. Specifying both parameters will apply the more stringent (fewest eigenvectors) of the two.Further background
For the history and mathematics of linear mixed models in genetics, including FastLMM, see Christoph Lippert’s PhD thesis. For an investigation of various approaches to defining kinship, see Comparison of Methods to Account for Relatedness in Genome-Wide Association Studies with Family-Based Data.
Parameters: - kinship_matrix (
KinshipMatrix
) – Kinship matrix to be used. - y (
Float64Expression
) – Column-indexed response expression. - x (
Float64Expression
) – Entry-indexed expression for input variable. - covariates (
list
ofFloat64Expression
) – List of column-indexed covariate expressions. - global_root (
str
) – Global field root. - row_root (
str
) – Row-indexed field root. - run_assoc (
bool
) – If true, run association testing in addition to fitting the global model. - use_ml (
bool
) – Use ML instead of REML throughout. - delta (
float
orNone
) – Fixed delta value to use in the global model, overrides fitting delta. - sparsity_threshold (
float
) – Genotype vector sparsity at or below which to use sparse genotype vector in rotation (advanced). - n_eigenvectors (
int
) – Number of eigenvectors of the kinship matrix used to fit the model. - dropped_variance_fraction (
float
) – Upper bound on fraction of sample variance lost by dropping eigenvectors with small eigenvalues.
Returns: MatrixTable
– Matrix table with regression results in new global and (optionally) row-indexed fields.
-
hail.methods.
pca
(entry_expr, k=10, compute_loadings=False) → Tuple[[List[float], hail.table.Table], hail.table.Table][source]¶ Run principal component analysis (PCA) on numeric columns derived from a matrix table.
Examples
For a matrix table with variant rows, sample columns, and genotype entries, compute the top 2 PC sample scores and eigenvalues of the matrix of 0s and 1s encoding missingness of genotype calls.
>>> eigenvalues, scores, _ = hl.pca(hl.int(hl.is_defined(dataset.GT)), ... k=2)
Warning
This method does not automatically mean-center or normalize each column. If desired, such transformations should be incorporated in entry_expr.
Hail will return an error if entry_expr evaluates to missing, nan, or infinity on any entry.
Notes
PCA is run on the columns of the numeric matrix obtained by evaluating entry_expr on each entry of the matrix table, or equivalently on the rows of the transposed numeric matrix \(M\) referenced below.
PCA computes the SVD
\[M = USV^T\]where columns of \(U\) are left singular vectors (orthonormal in \(\mathbb{R}^n\)), columns of \(V\) are right singular vectors (orthonormal in \(\mathbb{R}^m\)), and \(S=\mathrm{diag}(s_1, s_2, \ldots)\) with ordered singular values \(s_1 \ge s_2 \ge \cdots \ge 0\). Typically one computes only the first \(k\) singular vectors and values, yielding the best rank \(k\) approximation \(U_k S_k V_k^T\) of \(M\); the truncations \(U_k\), \(S_k\) and \(V_k\) are \(n \times k\), \(k \times k\) and \(m \times k\) respectively.
From the perspective of the rows of \(M\) as samples (data points), \(V_k\) contains the loadings for the first \(k\) PCs while \(MV_k = U_k S_k\) contains the first \(k\) PC scores of each sample. The loadings represent a new basis of features while the scores represent the projected data on those features. The eigenvalues of the Gramian \(MM^T\) are the squares of the singular values \(s_1^2, s_2^2, \ldots\), which represent the variances carried by the respective PCs. By default, Hail only computes the loadings if the
loadings
parameter is specified.Scores are stored in a
Table
with the column key of the matrix table as key and a field scores of typearray<float64>
containing the principal component scores.Loadings are stored in a
Table
with the row key of the matrix table as key and a field loadings of typearray<float64>
containing the principal component loadings.The eigenvalues are returned in descending order, with scores and loadings given the corresponding array order.
Parameters: - entry_expr (
Expression
) – Numeric expression for matrix entries. - k (
int
) – Number of principal components. - compute_loadings (
bool
) – IfTrue
, compute row loadings.
Returns: (
list
offloat
,Table
,Table
) – List of eigenvalues, table with column scores, table with row loadings.- entry_expr (