Statistics¶
linear_regression (y, x[, covariates, root, …]) |
For each row, test an input variable for association with response variables using linear regression. |
logistic_regression (test, y, x[, …]) |
For each row, test an input variable for association with a binary response variable using logistic regression. |
LinearMixedModel (py, px, s[, y, x]) |
Class representing a linear mixed model. |
linear_mixed_regression (kinship_matrix, y, x) |
Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association. |
pca (entry_expr[, k, compute_loadings]) |
Run principal component analysis (PCA) on numeric columns derived from a matrix table. |
-
hail.methods.
linear_regression
(y, x, covariates=(), root='linreg', block_size=16) → hail.matrixtable.MatrixTable[source]¶ For each row, test an input variable for association with response variables using linear regression.
Examples
>>> dataset_result = hl.linear_regression(y=dataset.pheno.height, ... x=dataset.GT.n_alt_alleles(), ... covariates=[dataset.pheno.age, dataset.pheno.is_female])
Warning
linear_regression()
considers the same set of columns (i.e., samples, points) for every response variable and row, namely those columns for which all response variables and covariates are defined. For each row, missing values of x are mean-imputed over these columns.Notes
With the default root and y a single expression, the following row-indexed fields are added.
- linreg.n (
tint32
) – Number of columns used. - linreg.sum_x (
tfloat64
) – Sum of input values x. - linreg.y_transpose_x (
tfloat64
) – Dot product of response vector y with the input vector x. - linreg.beta (
tfloat64
) – Fit effect coefficient of x, \(\hat\beta_1\) below. - linreg.standard_error (
tfloat64
) – Estimated standard error, \(\widehat{\mathrm{se}}_1\). - linreg.t_stat (
tfloat64
) – \(t\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}_1\). - linreg.p_value (
tfloat64
) – \(p\)-value.
If y is a list of expressions, then the last five fields instead have type
tarray
oftfloat64
, with corresponding indexing of the list and each array.In the statistical genetics example above, the input variable x encodes genotype as the number of alternate alleles (0, 1, or 2). For each variant (row), genotype is tested for association with height controlling for age and sex, by fitting the linear regression model:
\[\mathrm{height} = \beta_0 + \beta_1 \, \mathrm{genotype} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is\_female} + \varepsilon, \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]Boolean covariates like \(\mathrm{is\_female}\) are encoded as 1 for
True
and 0 forFalse
. The null model sets \(\beta_1 = 0\).The standard least-squares linear regression model is derived in Section 3.2 of The Elements of Statistical Learning, 2nd Edition. See equation 3.12 for the t-statistic which follows the t-distribution with \(n - k - 2\) degrees of freedom, under the null hypothesis of no effect, with \(n\) samples and \(k\) covariates in addition to
x
and the intercept.Parameters: - y (
Float64Expression
orlist
ofFloat64Expression
) – One or more column-indexed response expressions. - x (
Float64Expression
) – Entry-indexed expression for input variable. - covariates (
list
ofFloat64Expression
) – List of column-indexed covariate expressions. - root (
str
) – Name of resulting row-indexed field. - block_size (
int
) – Number of row regressions to perform simultaneously per core. Larger blocks require more memory but may improve performance.
Returns: MatrixTable
– Matrix table with regression results in a new row-indexed field.- linreg.n (
-
hail.methods.
logistic_regression
(test, y, x, covariates=(), root='logreg') → hail.matrixtable.MatrixTable[source]¶ For each row, test an input variable for association with a binary response variable using logistic regression.
Examples
Run the logistic regression Wald test per variant using a Boolean phenotype and two covariates stored in column-indexed fields:
>>> ds_result = hl.logistic_regression( ... test='wald', ... y=dataset.pheno.is_case, ... x=dataset.GT.n_alt_alleles(), ... covariates=[dataset.pheno.age, dataset.pheno.is_female])
Notes
This method performs, for each row, a significance test of the input variable in predicting a binary (case-control) response variable based on the logistic regression model. The response variable type must either be numeric (with all present values 0 or 1) or Boolean, in which case true and false are coded as 1 and 0, respectively.
Hail supports the Wald test (‘wald’), likelihood ratio test (‘lrt’), Rao score test (‘score’), and Firth test (‘firth’). Hail only includes columns for which the response variable and all covariates are defined. For each row, Hail imputes missing input values as the mean of the non-missing values.
The example above considers a model of the form
\[\mathrm{Prob}(\mathrm{is_case}) = \mathrm{sigmoid}(\beta_0 + \beta_1 \, \mathrm{gt} + \beta_2 \, \mathrm{age} + \beta_3 \, \mathrm{is\_female} + \varepsilon), \quad \varepsilon \sim \mathrm{N}(0, \sigma^2)\]where \(\mathrm{sigmoid}\) is the sigmoid function, the genotype \(\mathrm{gt}\) is coded as 0 for HomRef, 1 for Het, and 2 for HomVar, and the Boolean covariate \(\mathrm{is\_female}\) is coded as for
True
(female) and 0 forFalse
(male). The null model sets \(\beta_1 = 0\).The resulting variant annotations depend on the test statistic as shown in the tables below.
Test Field Type Value Wald logreg.beta float64 fit effect coefficient, \(\hat\beta_1\) Wald logreg.standard_error float64 estimated standard error, \(\widehat{\mathrm{se}}\) Wald logreg.z_stat float64 Wald \(z\)-statistic, equal to \(\hat\beta_1 / \widehat{\mathrm{se}}\) Wald logreg.p_value float64 Wald p-value testing \(\beta_1 = 0\) LRT, Firth logreg.beta float64 fit effect coefficient, \(\hat\beta_1\) LRT, Firth logreg.chi_sq_stat float64 deviance statistic LRT, Firth logreg.p_value float64 LRT / Firth p-value testing \(\beta_1 = 0\) Score logreg.chi_sq_stat float64 score statistic Score logreg.p_value float64 score p-value testing \(\beta_1 = 0\) For the Wald and likelihood ratio tests, Hail fits the logistic model for each row using Newton iteration and only emits the above annotations when the maximum likelihood estimate of the coefficients converges. The Firth test uses a modified form of Newton iteration. To help diagnose convergence issues, Hail also emits three variant annotations which summarize the iterative fitting process:
Test Field Type Value Wald, LRT, Firth logreg.fit.n_iterations int32 number of iterations until convergence, explosion, or reaching the max (25 for Wald, LRT; 100 for Firth) Wald, LRT, Firth logreg.fit.converged bool True
if iteration convergedWald, LRT, Firth logreg.fit.exploded bool True
if iteration explodedWe consider iteration to have converged when every coordinate of \(\beta\) changes by less than \(10^{-6}\). For Wald and LRT, up to 25 iterations are attempted; in testing we find 4 or 5 iterations nearly always suffice. Convergence may also fail due to explosion, which refers to low-level numerical linear algebra exceptions caused by manipulating ill-conditioned matrices. Explosion may result from (nearly) linearly dependent covariates or complete separation.
A more common situation in genetics is quasi-complete seperation, e.g. variants that are observed only in cases (or controls). Such variants inevitably arise when testing millions of variants with very low minor allele count. The maximum likelihood estimate of \(\beta\) under logistic regression is then undefined but convergence may still occur after a large number of iterations due to a very flat likelihood surface. In testing, we find that such variants produce a secondary bump from 10 to 15 iterations in the histogram of number of iterations per variant. We also find that this faux convergence produces large standard errors and large (insignificant) p-values. To not miss such variants, consider using Firth logistic regression, linear regression, or group-based tests.
Here’s a concrete illustration of quasi-complete seperation in R. Suppose we have 2010 samples distributed as follows for a particular variant:
Status HomRef Het HomVar Case 1000 10 0 Control 1000 0 0 The following R code fits the (standard) logistic, Firth logistic, and linear regression models to this data, where
x
is genotype,y
is phenotype, andlogistf
is from the logistf package:x <- c(rep(0,1000), rep(1,1000), rep(1,10) y <- c(rep(0,1000), rep(0,1000), rep(1,10)) logfit <- glm(y ~ x, family=binomial()) firthfit <- logistf(y ~ x) linfit <- lm(y ~ x)
The resulting p-values for the genotype coefficient are 0.991, 0.00085, and 0.0016, respectively. The erroneous value 0.991 is due to quasi-complete separation. Moving one of the 10 hets from case to control eliminates this quasi-complete separation; the p-values from R are then 0.0373, 0.0111, and 0.0116, respectively, as expected for a less significant association.
The Firth test reduces bias from small counts and resolves the issue of separation by penalizing maximum likelihood estimation by the Jeffrey’s invariant prior. This test is slower, as both the null and full model must be fit per variant, and convergence of the modified Newton method is linear rather than quadratic. For Firth, 100 iterations are attempted for the null model and, if that is successful, for the full model as well. In testing we find 20 iterations nearly always suffices. If the null model fails to converge, then the logreg.fit fields reflect the null model; otherwise, they reflect the full model.
See Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants for an empirical comparison of the logistic Wald, LRT, score, and Firth tests. The theoretical foundations of the Wald, likelihood ratio, and score tests may be found in Chapter 3 of Gesine Reinert’s notes Statistical Theory. Firth introduced his approach in Bias reduction of maximum likelihood estimates, 1993. Heinze and Schemper further analyze Firth’s approach in A solution to the problem of separation in logistic regression, 2002.
Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.
For Boolean covariate types,
True
is coded as 1 andFalse
as 0. In particular, for the sample annotation fam.is_case added by importing a FAM file with case-control phenotype, case is 1 and control is 0.Hail’s logistic regression tests correspond to the
b.wald
,b.lrt
, andb.score
tests in EPACTS. For each variant, Hail imputes missing input values as the mean of non-missing input values, whereas EPACTS subsets to those samples with called genotypes. Hence, Hail and EPACTS results will currently only agree for variants with no missing genotypes.Parameters: - test ({‘wald’, ‘lrt’, ‘score’, ‘firth’}) – Statistical test.
- y (
Float64Expression
) – Column-indexed response expression. All non-missing values must evaluate to 0 or 1. Note that aBooleanExpression
will be implicitly converted to aFloat64Expression
with this property. - x (
Float64Expression
) – Entry-indexed expression for input variable. - covariates (
list
ofFloat64Expression
) – List of column-indexed covariate expressions. - root (
str
, optional) – Name of resulting row-indexed field.
Returns: MatrixTable
– Matrix table with regression results in a new row-indexed field.
-
class
hail.methods.
LinearMixedModel
(py, px, s, y=None, x=None)[source]¶ Class representing a linear mixed model.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
LinearMixedModel
represents a linear model of the form\[y \sim \mathrm{N}(X \beta, \, \sigma^2 K + \tau^2 I)\]where
- \(\mathrm{N}\) is a \(n\)-dimensional normal distribution.
- \(y\) is a known vector of \(n\) observations.
- \(X\) is a known \(n \times p\) design matrix for \(p\) fixed effects.
- \(K\) is a known \(n \times n\) positive semi-definite kernel.
- \(I\) is the \(n \times n\) identity matrix.
- \(\beta\) is a \(p\)-parameter vector of fixed effects.
- \(\sigma^2\) is the variance parameter on \(K\).
- \(\tau^2\) is the variance parameter on \(I\).
In particular, the residuals for the \(i^\mathit{th}\) and \(j^\mathit{th}\) observations have covariance \(\sigma^2 K_{ij}\) for \(i \neq j\).
This model is equivalent to a mixed model of the form
\[y = X \beta + Z u + \epsilon\]by setting \(K = ZZ^T\) where
- \(Z\) is a known \(n \times r\) design matrix for \(r\) random effects.
- \(u\) is a \(r\)-vector of random effects drawn from \(\mathrm{N}(0, \sigma^2 I)\).
- \(\epsilon\) is a \(n\)-vector of random errors drawn from \(\mathrm{N}(0, \tau^2 I)\).
However,
LinearMixedModel
does not itself realize \(K\) as a linear kernel with respect to random effects, nor does it take \(K\) explicitly as input. Rather, via the eigendecomposion \(K = U S U^T\), the the class leverages a third, decorrelated form of the model\[Py \sim \mathrm{N}(PX \beta, \, \sigma^2 (\gamma S + I))\]where
- \(P = U^T: \mathbb{R}^n \rightarrow \mathbb{R}^n\) is an orthonormal transformation that decorrelates the observations. The rows of \(P\) are an eigenbasis for \(K\).
- \(S\) is the \(n \times n\) diagonal matrix of corresponding eigenvalues.
- \(\gamma = \frac{\sigma^2}{\tau^2}\) is the ratio of variance parameters.
Hence, the triple \((Py, PX, S)\) determines the probability of the observations for any choice of model parameters, and is therefore sufficient for inference. This triple, with S encoded as a vector, is the default (“full-rank”) initialization of the class.
LinearMixedModel
also provides an efficient strategy to fit the model above with \(K\) replaced by its rank-\(r\) approximation \(K_r = P_r^T S_r P_r\) where- \(P_r: \mathbb{R}^n \rightarrow \mathbb{R}^r\) has orthonormal rows consisting of the top \(r\) eigenvectors of \(K\).
- \(S_r\) is the \(r \times r\) diagonal matrix of corresponding non-zero eigenvalues.
For this low-rank model, the quintuple \((P_r y, P_r X, S_r, y, X)\) is similarly sufficient for inference and corresponds to the “low-rank” initialization of the class. Morally, \(y\) and \(X\) are required for low-rank inference because the diagonal \(\gamma S + I\) is always full-rank.
If \(K\) actually has rank \(r\), then \(K = K_r\) and the low-rank and full-rank models are equivalent. Hence low-rank inference provides a more efficient, equally-exact algorithm for fitting the full-rank model. This situation arises, for example, when \(K\) is the linear kernel of a mixed model with fewer random effects than observations.
Even when \(K\) has full rank, using a lower-rank approximation may be an effective from of regularization, in addition to boosting computational efficiency.
Initialization
With full-rank initialization by \((Py, PX, S)\), the following class attributes are set:
Attribute Type Value low_rank bool False
n int Number of observations \(n\) f int Number of fixed effects \(p\) r int Effective number of random effects, must equal \(n\) py numpy.ndarray Rotated response vector \(P y\) with shape \((n)\) px numpy.ndarray Rotated design matrix \(P X\) with shape \((n, p)\). s numpy.ndarray Eigenvalues vector \(S\) of \(K\) with shape \((n)\) With low-rank initialization by \((P_r y, P_r X, S_r, y, X)\), the following class attributes are set:
Attribute Type Value low_rank bool True
n int Number of observations \(n\) f int Number of fixed effects \(p\) r int Effective number of random effects, must be less than \(n\) py numpy.ndarray Projected response vector \(P_r y\) with shape \((r)\) px numpy.ndarray Projected design matrix \(P_r X\) with shape \((r, p)\) s numpy.ndarray Eigenvalues vector \(S_r\) of \(K_r\) with shape \((r)\) y numpy.ndarray Response vector with shape \((n)\) x numpy.ndarray Design matrix with shape \((n, p)\) Fitting the model
fit()
uses restricted maximum likelihood (REML) to estimate \((\beta, \sigma^2, \tau^2)\), adding the following attributes at this estimate.Attribute Type Value beta numpy.ndarray \(\beta\) sigma_sq float \(\sigma^2\) tau_sq float \(\tau^2\) gamma float \(\gamma = \frac{\sigma^2}{\tau^2}\) log_gamma float \(\log{\gamma}\) h_sq float \(\mathit{h}^2 = \frac{\sigma^2}{\sigma^2 + \tau^2}\) h_sq_standard_error float asymptotic estimate of standard error for \(\mathit{h}^2\) Estimation proceeds by minimizing the function
compute_neg_log_reml()
with respect to the parameter \(\log{\gamma}\) governing the (log) ratio of the variance parameters \(\sigma^2\) and \(\tau^2\). For any fixed ratio, the REML estimate and log likelihood have closed-form solutions.Testing alternative models
The model is also equivalent to its augmentation
\[y \sim \mathrm{N}\left(x_\star\beta_\star + X \beta, \, \sigma^2 K + \tau^2 I\right)\]by an additional covariate of interest \(x_\star\) under the null hypothesis that the corresponding fixed effect \(\beta_\star\) is zero.
After running
fit()
to fit the null model, the methodsfit_alternatives()
andfit_alternatives_numpy()
may be used to test the null hypothesis \(\beta_\star = 0\) versus the alternative hypothesis \(\beta_\star \neq 0\) for each \(n\)-vector \(x_\star\) in a collection of augmentations.Testing uses the likelihood ratio test with both the null and alternative models constrained by the REML estimate of \(\log{\gamma}\) under the null hypothesis. The test statistic \(\chi^2\) equals \(n\) times the log ratio of the squared residuals and follows a chi-squared distribution with one degree of freedom.
When testing alternatives, full-rank inference only requires the vector \(P x_\star\), whereas low-rank inference requires both \(P_r x_\star\) and \(x_\star\).
Parameters: - py (
numpy.ndarray
) – Projected response vector \(P_r y\) with shape \((r)\). - px (
numpy.ndarray
) – Projected design matrix \(P_r X\) with shape \((r, p)\). - s (
numpy.ndarray
) – Eigenvalues vector \(S\) with shape \((r)\). - y (
numpy.ndarray
, optional) – Response vector with shape \((n)\). Include for low-rank inference. - x (
numpy.ndarray
, optional) – Design matrix with shape \((n, p)\). Include for low-rank inference.
-
compute_neg_log_reml
(log_gamma, return_parameters=False)[source]¶ Compute negative log REML constrained to a fixed value of \(\log{\gamma}\).
This function computes the triple \((\beta, \sigma^2, \tau^2)\) with \(\gamma = \frac{\sigma^2}{\tau^2}\) at which the restricted likelihood is maximized and returns the negative of the log of the restricted likelihood at these parameters, shifted by a constant whose value is independent of the input.
To compute the actual negative log REML, add
\[\frac{1}{2}\left((n - p)(1 + \log(2\pi)) - \log(\det(X^T X)\right)\]to the returned value.
Parameters: - log_gamma (
float
) – Value of \(\log{\gamma}\). - return_parameters – If
True
, also return \(\beta\), \(\sigma^2\), and \(\tau^2\).
Returns: float
or (float
,numpy.ndarray
,float
,float
) – If return_parameters isFalse
, returns (shifted) negative log REML. Otherwise, returns (shifted) negative log REML, \(\beta\), \(\sigma^2\), and \(\tau^2\).- log_gamma (
-
fit
(log_gamma=None, bounds=(-8.0, 8.0), tol=1e-08, maxiter=500)[source]¶ Find the triple \((\beta, \sigma^2, \tau^2)\) maximizing REML.
This method sets the attributes beta, sigma_sq, tau_sq, gamma, log_gamma, and h_sq as described in the top-level class documentation.
If log_gamma is provided,
fit()
finds the REML solution with \(\log{\gamma}\) constrained to this value. Otherwise,fit()
searches for the value of \(\log{\gamma}\) that minimizescompute_neg_log_reml()
, and also sets the attribute optimize_result of type scipy.optimize.OptimizeResult.Parameters: - log_gamma (
float
, optional) – If provided, the solution is constrained to have this value of \(\log{\gamma}\). - bounds (
float
,float
) – Lower and upper bounds for \(\log{\gamma}\). - tol (
float
) – Absolute tolerance for optimizing \(\log{\gamma}\). - maxiter (
float
) – Maximum number of iterations for optimizing \(\log{\gamma}\).
- log_gamma (
-
fit_alternatives
(pa_t_path, a_t_path=None, partition_size=None)[source]¶ Fit and test alternative model for each augmented design matrix in parallel.
Notes
The resulting table has the following fields:
Field Type Value idx int64 Index of augmented design matrix. beta float64 \(\beta_\star\) sigma_sq float64 \(\sigma^2\) chi_sq float64 \(\chi^2\) p_value float64 p-value \((P_r A)^T\) and \(A^T\) (if given) must have the same number of rows (augmentations). These rows are grouped into partitions for parallel processing. The number of partitions equals the ceiling of
n_rows / partition_size
, and should be at least the number or cores to make use of all cores. By default, there is one partition per row of blocks in \((P_r A)^T\). Setting the partition size to an exact (rather than approximate) divisor or multiple of the block size reduces superfluous shuffling of data.The number of columns in each block matrix must be less than \(2^{31}\).
Warning
The block matrices must be stored in row-major format, as results from
BlockMatrix.write()
withforce_row_major=True
and fromBlockMatrix.write_from_entry_expr()
. Otherwise, this method will produce an error message.Parameters: - pa_t_path (
str
) – Path to block matrix \((P_r A)^T\) with shape \((m, r)\). Each row is a projected augmentation \(P_r x_\star\) of \(P_r X\). - a_t_path (
str
, optional) – Path to block matrix \(A^T\) with shape \((m, n)\). Each row is an augmentation \(x_\star\) of \(X\). Include for low-rank inference. - partition_size (
int
, optional) – Number of rows to process per partition. Default given by block size of \((P_r A)^T\).
Returns: Table
– Table of results for each augmented design matrix.- pa_t_path (
-
fit_alternatives_numpy
(pa, a=None)[source]¶ Fit and test alternative model for each augmented design matrix.
Notes
The resulting table has the following fields:
Field Type Value idx int64 Index of augmented design matrix. beta float64 \(\beta_\star\) sigma_sq float64 \(\sigma^2\) chi_sq float64 \(\chi^2\) p_value float64 p-value Parameters: - pa (
numpy.ndarray
) – Projected matrix \(P_r A\) of alternatives with shape \((r, m)\). Each column is a projected augmentation \(P_r x_\star\) of \(P_r X\). - a (
numpy.ndarray
, optional) – Matrix \(A\) of alternatives with shape \((n, m)\). Each column is an augmentation \(x_\star\) of \(X\). Required for low-rank inference.
Returns: Table
– Table of results for each augmented design matrix.- pa (
-
hail.methods.
linear_mixed_regression
(kinship_matrix, y, x, covariates=[], global_root='lmmreg_global', row_root='lmmreg', run_assoc=True, use_ml=False, delta=None, sparsity_threshold=1.0, n_eigenvectors=None, dropped_variance_fraction=None) → hail.matrixtable.MatrixTable[source]¶ Use a kinship-based linear mixed model to estimate the genetic component of phenotypic variance (narrow-sense heritability) and optionally test each variant for association.
We plan to change the interface to this method in Hail 0.2 while maintaining its functionality.
Note
Requires the column key to be one field of type
tstr
Examples
Compute a
KinshipMatrix
, and use it to test variants for association using a linear mixed model:>>> lmm_ds = hl.read_matrix_table("data/example_lmmreg.vds") >>> kinship_matrix = hl.realized_relationship_matrix(lmm_ds.filter_rows(lmm_ds.use_in_kinship)['GT']) >>> lmm_ds = hl.linear_mixed_regression(kinship_matrix, ... y=lmm_ds.pheno, ... x=lmm_ds.GT.n_alt_alleles(), ... covariates=[lmm_ds.cov1, lmm_ds.cov2])
Notes
Suppose the variant dataset saved at
data/example_lmmreg.vds
has a Boolean variant-indexed field use_in_kinship and numeric or Boolean sample-indexed fields pheno, cov1, and cov2. Then thelinear_mixed_regression()
function in the above example will execute the following four steps in order:- filter to samples in given kinship matrix to those for which ds.pheno, ds.cov, and ds.cov2 are all defined
- compute the eigendecomposition \(K = USU^T\) of the kinship matrix
- fit covariate coefficients and variance parameters in the sample-covariates-only (global) model using restricted maximum likelihood (REML), storing results in a global field under lmmreg_global
- test each variant for association, storing results in a row-indexed field under lmmreg
This plan can be modified as follows:
- Set run_assoc to
False
to not test any variants for association, i.e. skip Step 5. - Set use_ml to
True
to use maximum likelihood instead of REML in Steps 4 and 5. - Set the delta argument to manually set the value of \(\delta\) rather that fitting \(\delta\) in Step 4.
- Set the global_root argument to change the global annotation root in Step 4.
- Set the row_root argument to change the variant annotation root in Step 5.
linear_mixed_regression()
adds 13 global annotations in Step 4. These global annotations are stored under the prefix global_root, which is by defaultlmmreg_global
. The prefix is not displayed in the table below.Field Type Value use_ml bool true if fit by ML, false if fit by REML beta dict<str, float64> map from intercept and the given covariates expressions to the corresponding fit \(\beta\) coefficients sigma_g_squared float64 fit coefficient of genetic variance, \(\hat{\sigma}_g^2\) sigma_e_squared float64 fit coefficient of environmental variance \(\hat{\sigma}_e^2\) delta float64 fit ratio of variance component coefficients, \(\hat{\delta}\) h_squared float64 fit narrow-sense heritability, \(\hat{h}^2\) n_eigenvectors int32 number of eigenvectors of kinship matrix used to fit model dropped_variance_fraction float64 specified value of dropped_variance_fraction eigenvalues array<float64> all eigenvalues of the kinship matrix in descending order fit.standard_error_h_squared float64 standard error of \(\hat{h}^2\) under asymptotic normal approximation fit.normalized_likelihood_h_squared array<float64> likelihood function of \(h^2\) normalized on the discrete grid 0.01, 0.02, ..., 0.99
. Indexi
is the likelihood for percentagei
.fit.max_log_likelihood float64 (restricted) maximum log likelihood corresponding to \(\hat{\delta}\) fit.log_delta_grid array<float64> values of \(\mathrm{ln}(\delta)\) used in the grid search fit.log_likelihood_values array<float64> (restricted) log likelihood of \(y\) given \(X\) and \(\mathrm{ln}(\delta)\) at the (RE)ML fit of \(\beta\) and \(\sigma_g^2\) These global annotations are also added to
hail.log
, with the ranked evals and \(\delta\) grid with values in .tsv tabular form. Usegrep 'linear mixed regression' hail.log
to find the lines just above each table.If Step 5 is performed,
linear_mixed_regression()
also adds four linear regression row fields. These annotations are stored as row_root, which defaults tolmmreg
. Once again, the prefix is not displayed in the table.Field Type Value beta float64 fit genotype coefficient, \(\hat\beta_0\) sigma_g_squared float64 fit coefficient of genetic variance component, \(\hat{\sigma}_g^2\) chi_sq_stat float64 \(\chi^2\) statistic of the likelihood ratio test p_value float64 \(p\)-value Those variants that don’t vary across the included samples (e.g., all genotypes are HomRef) will have missing annotations.
Performance
Hail’s initial version of
linear_mixed_regression()
scales beyond 15k samples and to an essentially unbounded number of variants, making it particularly well-suited to modern sequencing studies and complementary to tools designed for SNP arrays. Analysts have usedlinear_mixed_regression()
in research to compute kinship from 100k common variants and test 32 million non-rare variants on 8k whole genomes in about 10 minutes on Google cloud.While
linear_mixed_regression()
computes the kinship matrix \(K\) using distributed matrix multiplication (Step 2), the full eigendecomposition (Step 3) is currently run on a single core of master using the LAPACK routine DSYEVD, which we empirically find to be the most performant of the four available routines; laptop performance plots showing cubic complexity in \(n\) are available here. On Google cloud, eigendecomposition takes about 2 seconds for 2535 sampes and 1 minute for 8185 samples. If you see worse performance, check that LAPACK natives are being properly loaded (see “BLAS and LAPACK” in Getting Started).Given the eigendecomposition, fitting the global model (Step 4) takes on the order of a few seconds on master. Association testing (Step 5) is fully distributed by variant with per-variant time complexity that is completely independent of the number of sample covariates and dominated by multiplication of the genotype vector \(v\) by the matrix of eigenvectors \(U^T\) as described below, which we accelerate with a sparse representation of \(v\). The matrix \(U^T\) has size about \(8n^2\) bytes and is currently broadcast to each Spark executor. For example, with 15k samples, storing \(U^T\) consumes about 3.6GB of memory on a 16-core worker node with two 8-core executors. So for large \(n\), we recommend using a high-memory configuration such as highmem workers.
Linear mixed model
linear_mixed_regression()
estimates the genetic proportion of residual phenotypic variance (narrow-sense heritability) under a kinship-based linear mixed model, and then optionally tests each variant for association using the likelihood ratio test. Inference is exact.We first describe the sample-covariates-only model used to estimate heritability, which we simply refer to as the global model. With \(n\) samples and \(c\) sample covariates, we define:
- \(y = n \times 1\) vector of phenotypes
- \(X = n \times c\) matrix of sample covariates and intercept column of ones
- \(K = n \times n\) kinship matrix
- \(I = n \times n\) identity matrix
- \(\beta = c \times 1\) vector of covariate coefficients
- \(\sigma_g^2 =\) coefficient of genetic variance component \(K\)
- \(\sigma_e^2 =\) coefficient of environmental variance component \(I\)
- \(\delta = \frac{\sigma_e^2}{\sigma_g^2} =\) ratio of environmental and genetic variance component coefficients
- \(h^2 = \frac{\sigma_g^2}{\sigma_g^2 + \sigma_e^2} = \frac{1}{1 + \delta} =\) genetic proportion of residual phenotypic variance
Under a linear mixed model, \(y\) is sampled from the \(n\)-dimensional multivariate normal distribution with mean \(X \beta\) and variance components that are scalar multiples of \(K\) and \(I\):
\[y \sim \mathrm{N}\left(X\beta, \sigma_g^2 K + \sigma_e^2 I\right)\]Thus the model posits that the residuals \(y_i - X_{i,:}\beta\) and \(y_j - X_{j,:}\beta\) have covariance \(\sigma_g^2 K_{ij}\) and approximate correlation \(h^2 K_{ij}\). Informally: phenotype residuals are correlated as the product of overall heritability and pairwise kinship. By contrast, standard (unmixed) linear regression is equivalent to fixing \(\sigma_2\) (equivalently, \(h^2\)) at 0 above, so that all phenotype residuals are independent.
Caution: while it is tempting to interpret \(h^2\) as the narrow-sense heritability of the phenotype alone, note that its value depends not only the phenotype and genetic data, but also on the choice of sample covariates.
Fitting the global model
The core algorithm is essentially a distributed implementation of the spectral approach taken in FastLMM. Let \(K = USU^T\) be the eigendecomposition of the real symmetric matrix \(K\). That is:
- \(U = n \times n\) orthonormal matrix whose columns are the eigenvectors of \(K\)
- \(S = n \times n\) diagonal matrix of eigenvalues of \(K\) in descending order. \(S_{ii}\) is the eigenvalue of eigenvector \(U_{:,i}\)
- \(U^T = n \times n\) orthonormal matrix, the transpose (and inverse) of \(U\)
A bit of matrix algebra on the multivariate normal density shows that the linear mixed model above is mathematically equivalent to the model
\[U^Ty \sim \mathrm{N}\left(U^TX\beta, \sigma_g^2 (S + \delta I)\right)\]for which the covariance is diagonal (e.g., unmixed). That is, rotating the phenotype vector (\(y\)) and covariate vectors (columns of \(X\)) in \(\mathbb{R}^n\) by \(U^T\) transforms the model to one with independent residuals. For any particular value of \(\delta\), the restricted maximum likelihood (REML) solution for the latter model can be solved exactly in time complexity that is linear rather than cubic in \(n\). In particular, having rotated, we can run a very efficient 1-dimensional optimization procedure over \(\delta\) to find the REML estimate \((\hat{\delta}, \hat{\beta}, \hat{\sigma}_g^2)\) of the triple \((\delta, \beta, \sigma_g^2)\), which in turn determines \(\hat{\sigma}_e^2\) and \(\hat{h}^2\).
We first compute the maximum log likelihood on a \(\delta\)-grid that is uniform on the log scale, with \(\mathrm{ln}(\delta)\) running from -8 to 8 by 0.01, corresponding to \(h^2\) decreasing from 0.9995 to 0.0005. If \(h^2\) is maximized at the lower boundary then standard linear regression would be more appropriate and Hail will exit; more generally, consider using standard linear regression when \(\hat{h}^2\) is very small. A maximum at the upper boundary is highly suspicious and will also cause Hail to exit. In any case, the log file records the table of grid values for further inspection, beginning under the info line containing “linear mixed regression: table of delta”.
If the optimal grid point falls in the interior of the grid as expected, we then use Brent’s method to find the precise location of the maximum over the same range, with initial guess given by the optimal grid point and a tolerance on \(\mathrm{ln}(\delta)\) of 1e-6. If this location differs from the optimal grid point by more than 0.01, a warning will be displayed and logged, and one would be wise to investigate by plotting the values over the grid.
Note that \(h^2\) is related to \(\mathrm{ln}(\delta)\) through the sigmoid function. More precisely,
\[h^2 = 1 - \mathrm{sigmoid}(\mathrm{ln}(\delta)) = \mathrm{sigmoid}(-\mathrm{ln}(\delta))\]Hence one can change variables to extract a high-resolution discretization of the likelihood function of \(h^2\) over \([0, 1]\) at the corresponding REML estimators for \(\beta\) and \(\sigma_g^2\), as well as integrate over the normalized likelihood function using change of variables and the sigmoid differential equation.
For convenience, lmmreg.fit.normalized_likelihood_h_squared records the the likelihood function of \(h^2\) normalized over the discrete grid \(0.01, 0.02, \ldots, 0.98, 0.99\). The length of the array is 101 so that index
i
contains the likelihood at percentagei
. The values at indices 0 and 100 are left undefined.By the theory of maximum likelihood estimation, this normalized likelihood function is approximately normally distributed near the maximum likelihood estimate. So we estimate the standard error of the estimator of \(h^2\) as follows. Let \(x_2\) be the maximum likelihood estimate of \(h^2\) and let \(x_ 1\) and \(x_3\) be just to the left and right of \(x_2\). Let \(y_1\), \(y_2\), and \(y_3\) be the corresponding values of the (unnormalized) log likelihood function. Setting equal the leading coefficient of the unique parabola through these points (as given by Lagrange interpolation) and the leading coefficient of the log of the normal distribution, we have:
\[\frac{x_3 (y_2 - y_1) + x_2 (y_1 - y_3) + x_1 (y_3 - y_2))} {(x_2 - x_1)(x_1 - x_3)(x_3 - x_2)} = -\frac{1}{2 \sigma^2}\]The standard error \(\hat{\sigma}\) is then estimated by solving for \(\sigma\).
Note that the mean and standard deviation of the (discretized or continuous) distribution held in lmmreg.fit.normalized_likelihood_h_squared will not coincide with \(\hat{h}^2\) and \(\hat{\sigma}\), since this distribution only becomes normal in the infinite sample limit. One can visually assess normality by plotting this distribution against a normal distribution with the same mean and standard deviation, or use this distribution to approximate credible intervals under a flat prior on \(h^2\).
Testing each variant for association
Fixing a single variant, we define:
- \(v = n \times 1\) input vector, with missing values imputed as the mean of the non-missing values
- \(X_v = \left[v | X \right] = n \times (1 + c)\) matrix concatenating \(v\) and \(X\)
- \(\beta_v = (\beta^0_v, \beta^1_v, \ldots, \beta^c_v) = (1 + c) \times 1\) vector of covariate coefficients
Fixing \(\delta\) at the global REML estimate \(\hat{\delta}\), we find the REML estimate \((\hat{\beta}_v, \hat{\sigma}_{g, v}^2)\) via rotation of the model
\[y \sim \mathrm{N}\left(X_v\beta_v, \sigma_{g,v}^2 (K + \hat{\delta} I)\right)\]Note that the only new rotation to compute here is \(U^T v\).
To test the null hypothesis that the genotype coefficient \(\beta^0_v\) is zero, we consider the restricted model with parameters \(((0, \beta^1_v, \ldots, \beta^c_v), \sigma_{g,v}^2)\) within the full model with parameters \((\beta^0_v, \beta^1_v, \ldots, \beta^c_v), \sigma_{g_v}^2)\), with \(\delta\) fixed at \(\hat\delta\) in both. The latter fit is simply that of the global model, \(((0, \hat{\beta}^1, \ldots, \hat{\beta}^c), \hat{\sigma}_g^2)\). The likelihood ratio test statistic is given by
\[\chi^2 = n \, \mathrm{ln}\left(\frac{\hat{\sigma}^2_g}{\hat{\sigma}_{g,v}^2}\right)\]and follows a chi-squared distribution with one degree of freedom. Here the ratio \(\hat{\sigma}^2_g / \hat{\sigma}_{g,v}^2\) captures the degree to which adding the variant \(v\) to the global model reduces the residual phenotypic variance.
Kinship Matrix
FastLMM uses the Realized Relationship Matrix (RRM) for kinship. This can be computed with
rrm()
. However, any instance ofKinshipMatrix
may be used, so long assample_list()
contains the complete samples of the caller variant dataset in the same order.Low-rank approximation of kinship for improved performance
linear_mixed_regression()
can implicitly use a low-rank approximation of the kinship matrix to more rapidly fit delta and the statistics for each variant. The computational complexity per variant is proportional to the number of eigenvectors used. This number can be specified in two ways. Specify the parameter n_eigenvectors to use only the top n_eigenvectors eigenvectors. Alternatively, specify dropped_variance_fraction to use as many eigenvectors as necessary to capture all but at most this fraction of the sample variance (also known as the trace, or the sum of the eigenvalues). For example, setting dropped_variance_fraction to 0.01 will use the minimal number of eigenvectors to account for 99% of the sample variance. Specifying both parameters will apply the more stringent (fewest eigenvectors) of the two.Further background
For the history and mathematics of linear mixed models in genetics, including FastLMM, see Christoph Lippert’s PhD thesis. For an investigation of various approaches to defining kinship, see Comparison of Methods to Account for Relatedness in Genome-Wide Association Studies with Family-Based Data.
Parameters: - kinship_matrix (
KinshipMatrix
) – Kinship matrix to be used. - y (
Float64Expression
) – Column-indexed response expression. - x (
Float64Expression
) – Entry-indexed expression for input variable. - covariates (
list
ofFloat64Expression
) – List of column-indexed covariate expressions. - global_root (
str
) – Global field root. - row_root (
str
) – Row-indexed field root. - run_assoc (
bool
) – If true, run association testing in addition to fitting the global model. - use_ml (
bool
) – Use ML instead of REML throughout. - delta (
float
orNone
) – Fixed delta value to use in the global model, overrides fitting delta. - sparsity_threshold (
float
) – Genotype vector sparsity at or below which to use sparse genotype vector in rotation (advanced). - n_eigenvectors (
int
) – Number of eigenvectors of the kinship matrix used to fit the model. - dropped_variance_fraction (
float
) – Upper bound on fraction of sample variance lost by dropping eigenvectors with small eigenvalues.
Returns: MatrixTable
– Matrix table with regression results in new global and (optionally) row-indexed fields.
-
hail.methods.
pca
(entry_expr, k=10, compute_loadings=False) → Tuple[[List[float], hail.table.Table], hail.table.Table][source]¶ Run principal component analysis (PCA) on numeric columns derived from a matrix table.
Examples
For a matrix table with variant rows, sample columns, and genotype entries, compute the top 2 PC sample scores and eigenvalues of the matrix of 0s and 1s encoding missingness of genotype calls.
>>> eigenvalues, scores, _ = hl.pca(hl.int(hl.is_defined(dataset.GT)), ... k=2)
Warning
This method does not automatically mean-center or normalize each column. If desired, such transformations should be incorporated in entry_expr.
Hail will return an error if entry_expr evaluates to missing, nan, or infinity on any entry.
Notes
PCA is run on the columns of the numeric matrix obtained by evaluating entry_expr on each entry of the matrix table, or equivalently on the rows of the transposed numeric matrix \(M\) referenced below.
PCA computes the SVD
\[M = USV^T\]where columns of \(U\) are left singular vectors (orthonormal in \(\mathbb{R}^n\)), columns of \(V\) are right singular vectors (orthonormal in \(\mathbb{R}^m\)), and \(S=\mathrm{diag}(s_1, s_2, \ldots)\) with ordered singular values \(s_1 \ge s_2 \ge \cdots \ge 0\). Typically one computes only the first \(k\) singular vectors and values, yielding the best rank \(k\) approximation \(U_k S_k V_k^T\) of \(M\); the truncations \(U_k\), \(S_k\) and \(V_k\) are \(n \times k\), \(k \times k\) and \(m \times k\) respectively.
From the perspective of the rows of \(M\) as samples (data points), \(V_k\) contains the loadings for the first \(k\) PCs while \(MV_k = U_k S_k\) contains the first \(k\) PC scores of each sample. The loadings represent a new basis of features while the scores represent the projected data on those features. The eigenvalues of the Gramian \(MM^T\) are the squares of the singular values \(s_1^2, s_2^2, \ldots\), which represent the variances carried by the respective PCs. By default, Hail only computes the loadings if the
loadings
parameter is specified.Scores are stored in a
Table
with the column key of the matrix table as key and a field scores of typearray<float64>
containing the principal component scores.Loadings are stored in a
Table
with the row key of the matrix table as key and a field loadings of typearray<float64>
containing the principal component loadings.The eigenvalues are returned in descending order, with scores and loadings given the corresponding array order.
Parameters: - entry_expr (
Expression
) – Numeric expression for matrix entries. - k (
int
) – Number of principal components. - compute_loadings (
bool
) – IfTrue
, compute row loadings.
Returns: (
list
offloat
,Table
,Table
) – List of eigenvalues, table with column scores, table with row loadings.- entry_expr (