Aggregators¶
The aggregators
module is exposed as hl.agg
, e.g. hl.agg.sum
.
collect (expr) |
Collect records into an array. |
collect_as_set (expr) |
Collect records into a set. |
count ([expr]) |
Count the number of records. |
count_where (condition) |
Count the number of records where a predicate is True . |
counter (expr) |
Count the occurrences of each unique record and return a dictionary. |
any (condition) |
Returns True if condition is True for any record. |
all (condition) |
Returns True if condition is True for every record. |
take (expr, n[, ordering]) |
Take n records of expr, optionally ordered by ordering. |
min (expr) |
Compute the minimum expr. |
max (expr) |
Compute the maximum expr. |
sum (expr) |
Compute the sum of all records of expr. |
array_sum (expr) |
Compute the coordinate-wise sum of all records of expr. |
mean (expr) |
Compute the mean value of records of expr. |
stats (expr) |
Compute a number of useful statistics about expr. |
product (expr) |
Compute the product of all records of expr. |
fraction (predicate) |
Compute the fraction of records where predicate is True . |
hardy_weinberg_test (expr) |
Performs test of Hardy-Weinberg equilibrium. |
explode (expr) |
Explode an array or set expression to aggregate the elements of all records. |
filter (condition, expr) |
Filter records according to a predicate. |
inbreeding (expr, prior) |
Compute inbreeding statistics on calls. |
call_stats (call, alleles) |
Compute useful call statistics. |
info_score (gp) |
Compute the IMPUTE information score. |
hist (expr, start, end, bins) |
Compute binned counts of a numeric expression. |
linreg (y, x[, nested_dim, weight]) |
Compute multivariate linear regression statistics. |
corr (x, y) |
Computes the Pearson correlation coefficient between x and y. |
group_by (group, agg_expr) |
Compute aggregation statistics stratified by one or more groups. |
downsample (x, y[, label, n_divisions]) |
Downsample (x, y) coordinate datapoints. |
-
hail.expr.aggregators.
collect
(expr) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Collect records into an array.
Examples
Collect the ID field where HT is greater than 68:
>>> table1.aggregate(agg.collect(agg.filter(table1.HT > 68, table1.ID))) [2, 3]
Notes
The element order of the resulting array is not guaranteed, and in some cases is non-deterministic.
Use
collect_as_set()
to collect unique items.Warning
Collecting a large number of items can cause out-of-memory exceptions.
Parameters: expr ( Expression
) – Expression to collect.Returns: ArrayExpression
– Array of all expr records.
-
hail.expr.aggregators.
collect_as_set
(expr) → hail.expr.expressions.typed_expressions.SetExpression[source]¶ Collect records into a set.
Examples
Collect the unique ID field where HT is greater than 68:
>>> table1.aggregate(agg.collect_as_set(agg.filter(table1.HT > 68, table1.ID))) set([2, 3]
Warning
Collecting a large number of items can cause out-of-memory exceptions.
Parameters: expr ( Expression
) – Expression to collect.Returns: SetExpression
– Set of unique expr records.
-
hail.expr.aggregators.
count
(expr=None) → hail.expr.expressions.typed_expressions.Int64Expression[source]¶ Count the number of records.
Examples
Group by the SEX field and count the number of rows in each category:
>>> (table1.group_by(table1.SEX) ... .aggregate(n=agg.count()) ... .show()) +-----+-------+ | SEX | n | +-----+-------+ | str | int64 | +-----+-------+ | M | 2 | | F | 2 | +-----+-------+
Notes
If expr is not provided, then this method will count the number of records aggregated. If expr is provided, then the result should make use of
filter()
orexplode()
so that the number of records aggregated changes.Parameters: expr ( Expression
, orNone
) – Expression to count.Returns: Expression
of typetint64
– Total number of records.
-
hail.expr.aggregators.
count_where
(condition) → hail.expr.expressions.typed_expressions.Int64Expression[source]¶ Count the number of records where a predicate is
True
.Examples
Count the number of individuals with HT greater than 68:
>>> table1.aggregate(agg.count_where(table1.HT > 68)) 2
Parameters: condition ( BooleanExpression
) – Criteria for inclusion.Returns: Expression
of typetint64
– Total number of records where condition isTrue
.
-
hail.expr.aggregators.
counter
(expr) → hail.expr.expressions.typed_expressions.DictExpression[source]¶ Count the occurrences of each unique record and return a dictionary.
Examples
Count the number of individuals for each unique SEX value:
>>> table1.aggregate(agg.counter(table1.SEX)) {'M': 2L, 'F': 2L}
Notes
This aggregator method returns a dict expression whose key type is the same type as expr and whose value type is
Expression
of typetint64
. This dict contains a key for each unique value of expr, and the value is the number of times that key was observed.Ensure that the result can be stored in memory on a single machine.
Warning
Using
counter()
with a large number of unique items can cause out-of-memory exceptions.Parameters: expr ( Expression
) – Expression to count by key.Returns: DictExpression
– Dictionary with the number of occurrences of each unique record.
-
hail.expr.aggregators.
any
(condition) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if condition isTrue
for any record.Examples
>>> (table1.group_by(table1.SEX) ... .aggregate(any_over_70 = agg.any(table1.HT > 70)) ... .show()) +-----+-------------+ | SEX | any_over_70 | +-----+-------------+ | str | bool | +-----+-------------+ | M | true | | F | false | +-----+-------------+
Notes
If there are no records to aggregate, the result is
False
.Missing records are not considered. If every record is missing, the result is also
False
.Parameters: condition ( BooleanExpression
) – Condition to test.Returns: BooleanExpression
-
hail.expr.aggregators.
all
(condition) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if condition isTrue
for every record.Examples
>>> (table1.group_by(table1.SEX) ... .aggregate(all_under_70 = agg.all(table1.HT < 70)) ... .show()) +-----+--------------+ | SEX | all_under_70 | +-----+--------------+ | str | bool | +-----+--------------+ | M | false | | F | false | +-----+--------------+
Notes
If there are no records to aggregate, the result is
True
.Missing records are not considered. If every record is missing, the result is also
True
.Parameters: condition ( BooleanExpression
) – Condition to test.Returns: BooleanExpression
-
hail.expr.aggregators.
take
(expr, n, ordering=None) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Take n records of expr, optionally ordered by ordering.
Examples
Take 3 elements of field X:
>>> table1.aggregate(agg.take(table1.X, 3)) [5, 6, 7]
Take the ID and HT fields, ordered by HT (descending):
>>> table1.aggregate(agg.take(hl.struct(ID=table1.ID, HT=table1.HT), ... 3, ... ordering=-table1.HT)) [Struct(ID=2, HT=72), Struct(ID=3, HT=70), Struct(ID=1, HT=65)]
Notes
The resulting array can include fewer than n elements if there are fewer than n total records.
The ordering argument may be an expression, a function, or
None
.If ordering is an expression, this expression’s type should be one with a natural ordering (e.g. numeric).
If ordering is a function, it will be evaluated on each record of expr to compute the value used for ordering. In the above example,
ordering=-table1.HT
andordering=lambda x: -x.HT
would be equivalent.If ordering is
None
, then there is no guaranteed ordering on the elements taken, and and the results may be non-deterministic.Missing values are always sorted last.
Parameters: - expr (
Expression
) – Expression to store. - n (
Expression
of typetint32
) – Number of records to take. - ordering (
Expression
or function ((arg) ->Expression
) or None) – Optional ordering on records.
Returns: ArrayExpression
– Array of up to n records of expr.- expr (
-
hail.expr.aggregators.
min
(expr) → hail.expr.expressions.typed_expressions.NumericExpression[source]¶ Compute the minimum expr.
Examples
Compute the minimum value of HT:
>>> table1.aggregate(agg.min(table1.HT)) min_ht=60
Notes
This method returns the minimum non-missing value. If there are no non-missing values, then the result is missing.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: NumericExpression
– Minimum value of all expr records, same type as expr.
-
hail.expr.aggregators.
max
(expr) → hail.expr.expressions.typed_expressions.NumericExpression[source]¶ Compute the maximum expr.
Examples
Compute the maximum value of HT:
>>> table1.aggregate(agg.max(table1.HT)) max_ht=72
Notes
This method returns the maximum non-missing value. If there are no non-missing values, then the result is missing.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: NumericExpression
– Maximum value of all expr records, same type as expr.
-
hail.expr.aggregators.
sum
(expr)[source]¶ Compute the sum of all records of expr.
Examples
Compute the sum of field C1:
>>> table1.aggregate(agg.sum(table1.C1)) 25
Notes
Missing values are ignored (treated as zero).
If expr is an expression of type
tint32
,tint64
, ortbool
, then the result is an expression of typetint64
. If expr is an expression of typetfloat32
ortfloat64
, then the result is an expression of typetfloat64
.Warning
Boolean values are cast to integers before computing the sum.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: Expression
of typetint64
ortfloat64
– Sum of records of expr.
-
hail.expr.aggregators.
array_sum
(expr) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Compute the coordinate-wise sum of all records of expr.
Examples
Compute the sum of C1 and C2:
>>> table1.aggregate(agg.array_sum([table1.C1, table1.C2])) [25, 46]
Notes
All records must have the same length. Each coordinate is summed independently as described in
sum()
.Parameters: expr ( ArrayNumericExpression
)Returns: ArrayExpression
with element typetint64
ortfloat64
-
hail.expr.aggregators.
mean
(expr) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Compute the mean value of records of expr.
Examples
Compute the mean of field HT:
>>> table1.aggregate(agg.mean(table1.HT)) 66.75
Notes
Missing values are ignored.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: Expression
of typetfloat64
– Mean value of records of expr.
-
hail.expr.aggregators.
stats
(expr) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute a number of useful statistics about expr.
Examples
Compute statistics about field HT:
>>> table1.aggregate(agg.stats(table1.HT)) Struct(min=60.0, max=72.0, sum=267.0, stdev=4.65698400255, n=4, mean=66.75)
Notes
Computes a struct with the following fields:
- min (
tfloat64
) - Minimum value. - max (
tfloat64
) - Maximum value. - mean (
tfloat64
) - Mean value, - stdev (
tfloat64
) - Standard deviation. - n (
tfloat64
) - Number of non-missing records. - sum (
tfloat64
) - Sum.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: StructExpression
– Struct expression with fields mean, stdev, min, max, n, and sum.- min (
-
hail.expr.aggregators.
product
(expr)[source]¶ Compute the product of all records of expr.
Examples
Compute the product of field C1:
>>> table1.aggregate(agg.product(table1.C1)) 440
Notes
Missing values are ignored (treated as one).
If expr is an expression of type
tint32
,tint64
ortbool
, then the result is an expression of typetint64
. If expr is an expression of typetfloat32
ortfloat64
, then the result is an expression of typetfloat64
.Warning
Boolean values are cast to integers before computing the product.
Parameters: expr ( NumericExpression
) – Numeric expression.Returns: Expression
of typetint64
ortfloat64
– Product of records of expr.
-
hail.expr.aggregators.
fraction
(predicate) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Compute the fraction of records where predicate is
True
.Examples
Compute the fraction of rows where SEX is “F” and HT > 65:
>>> table1.aggregate(agg.fraction((table1.SEX == 'F') & (table1.HT > 65))) 0.25
Notes
Missing values for predicate are treated as
False
.Parameters: predicate ( BooleanExpression
) – Boolean predicate.Returns: Expression
of typetfloat64
– Fraction of records where predicate isTrue
.
-
hail.expr.aggregators.
hardy_weinberg_test
(expr) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Performs test of Hardy-Weinberg equilibrium.
Examples
Test each row of a dataset:
>>> dataset_result = dataset.annotate_rows(hwe = agg.hardy_weinberg_test(dataset.GT))
Test each row on a sub-population:
>>> dataset_result = dataset.annotate_rows( ... hwe_eas = agg.hardy_weinberg_test(agg.filter(dataset.pop == 'EAS', dataset.GT)))
Notes
This method performs the test described in
functions.hardy_weinberg_test()
based solely on the counts of homozygous reference, heterozygous, and homozygous variant calls.The resulting struct expression has two fields:
- het_freq_hwe (
tfloat64
) - Expected frequency of heterozygous calls under Hardy-Weinberg equilibrium. - p_value (
tfloat64
) - p-value from test of Hardy-Weinberg equilibrium.
Hail computes the exact p-value with mid-p-value correction, i.e. the probability of a less-likely outcome plus one-half the probability of an equally-likely outcome. See this document for details on the Levene-Haldane distribution and references.
Warning
Non-diploid calls (
ploidy != 2
) are ignored in the counts. While the counts are defined for multiallelic variants, this test is only statistically rigorous in the biallelic setting; usesplit_multi()
to split multiallelic variants beforehand.Parameters: expr ( CallExpression
) – Call to test for Hardy-Weinberg equilibrium.Returns: StructExpression
– Struct expression with fields het_freq_hwe and p_value.- het_freq_hwe (
-
hail.expr.aggregators.
explode
(expr) → hail.expr.expressions.base_expression.Aggregable[source]¶ Explode an array or set expression to aggregate the elements of all records.
Examples
Compute the mean of all elements in fields C1, C2, and C3:
>>> table1.aggregate(agg.mean(agg.explode([table1.C1, table1.C2, table1.C3]))) 24.8333333333
Compute the set of all observed elements in the filters field (
Set[String]
):>>> dataset.aggregate_rows(agg.collect_as_set(agg.explode(dataset.filters))) set([u'VQSRTrancheSNP99.80to99.90', u'VQSRTrancheINDEL99.95to100.00', u'VQSRTrancheINDEL99.00to99.50', u'VQSRTrancheINDEL97.00to99.00', u'VQSRTrancheSNP99.95to100.00', u'VQSRTrancheSNP99.60to99.80', u'VQSRTrancheINDEL99.50to99.90', u'VQSRTrancheSNP99.90to99.95', u'VQSRTrancheINDEL96.00to97.00']))
Notes
This method can be used with aggregator functions to aggregate the elements of collection types (
tarray
andtset
).The result of the
explode()
andfilter()
methods is anAggregable
expression which can be used only in aggregator methods.Parameters: expr ( CollectionExpression
) – Expression of typetarray
ortset
.Returns: Aggregable
– Aggregable expression.
-
hail.expr.aggregators.
filter
(condition, expr) → hail.expr.expressions.base_expression.Aggregable[source]¶ Filter records according to a predicate.
Examples
Collect the ID field where HT >= 70:
>>> table1.aggregate(agg.collect(agg.filter(table1.HT >= 70, table1.ID))) [2, 3]
Notes
This method can be used with aggregator functions to remove records from aggregation.
The result of the
explode()
andfilter()
methods is anAggregable
expression which can be used only in aggregator methods.Parameters: - condition (
BooleanExpression
or function ( (arg) ->BooleanExpression
)) – Filter expression, or a function to evaluate for each record. - expr (
Expression
) – Expression to filter.
Returns: Aggregable
– Aggregable expression.- condition (
-
hail.expr.aggregators.
inbreeding
(expr, prior) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute inbreeding statistics on calls.
Examples
Compute inbreeding statistics per column:
>>> dataset_result = dataset.annotate_cols(IB = agg.inbreeding(dataset.GT, dataset.variant_qc.AF[1])) >>> dataset_result.cols().show() +----------------+--------------+-------------+------------------+------------------+ | s | IB.f_stat | IB.n_called | IB.expected_homs | IB.observed_homs | +----------------+--------------+-------------+------------------+------------------+ | str | float64 | int64 | float64 | int64 | +----------------+--------------+-------------+------------------+------------------+ | C1046::HG02024 | -1.23867e-01 | 338 | 2.96180e+02 | 291 | | C1046::HG02025 | 2.02944e-02 | 339 | 2.97151e+02 | 298 | | C1046::HG02026 | 5.47269e-02 | 336 | 2.94742e+02 | 297 | | C1047::HG00731 | -1.89046e-02 | 337 | 2.95779e+02 | 295 | | C1047::HG00732 | 1.38718e-01 | 337 | 2.95202e+02 | 301 | | C1047::HG00733 | 3.50684e-01 | 338 | 2.96418e+02 | 311 | | C1048::HG02024 | -1.95603e-01 | 338 | 2.96180e+02 | 288 | | C1048::HG02025 | 2.02944e-02 | 339 | 2.97151e+02 | 298 | | C1048::HG02026 | 6.74296e-02 | 338 | 2.96180e+02 | 299 | | C1049::HG00731 | -1.00467e-02 | 337 | 2.95418e+02 | 295 | +----------------+--------------+-------------+------------------+------------------+
Notes
E
is total number of expected homozygous calls, given by the sum of1 - 2.0 * prior * (1 - prior)
across records.O
is the observed number of homozygous calls across records.N
is the number of non-missing calls.F
is the inbreeding coefficient, and is computed by:(O - E) / (N - E)
.This method returns a struct expression with four fields:
Parameters: - expr (
CallExpression
) – Call expression. - prior (
Expression
of typetfloat64
) – Alternate allele frequency prior.
Returns: StructExpression
– Struct expression with fields f_stat, n_called, expected_homs, observed_homs.- expr (
-
hail.expr.aggregators.
call_stats
(call, alleles) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute useful call statistics.
Examples
Compute call statistics per row:
>>> dataset_result = dataset.annotate_rows(gt_stats = agg.call_stats(dataset.GT, dataset.alleles)) >>> dataset_result.rows().key_by('locus').select('gt_stats').show() +---------------+--------------+----------------+-------------+---------------------------+ | locus | gt_stats.AC | gt_stats.AF | gt_stats.AN | gt_stats.homozygote_count | +---------------+--------------+----------------+-------------+---------------------------+ | locus<GRCh37> | array<int32> | array<float64> | int32 | array<int32> | +---------------+--------------+----------------+-------------+---------------------------+ | 20:10579373 | [199,1] | [0.995,0.005] | 200 | [99,0] | | 20:13695607 | [177,23] | [0.885,0.115] | 200 | [77,0] | | 20:13698129 | [198,2] | [0.99,0.01] | 200 | [98,0] | | 20:14306896 | [142,58] | [0.71,0.29] | 200 | [51,9] | | 20:14306953 | [121,79] | [0.605,0.395] | 200 | [38,17] | | 20:15948325 | [172,2] | [0.989,0.012] | 174 | [85,0] | | 20:15948326 | [174,8] | [0.956,0.043] | 182 | [83,0] | | 20:17479423 | [199,1] | [0.995,0.005] | 200 | [99,0] | | 20:17600357 | [79,121] | [0.395,0.605] | 200 | [24,45] | | 20:17640833 | [193,3] | [0.985,0.015] | 196 | [95,0] | +---------------+--------------+----------------+-------------+---------------------------+
Notes
This method is meaningful for computing call metrics per variant, but not especially meaningful for computing metrics per sample.
This method returns a struct expression with three fields:
- AC (
tarray
oftint32
) - Allele counts. One element for each allele, including the reference. - AF (
tarray
oftfloat64
) - Allele frequencies. One element for each allele, including the reference. - AN (
tint32
) - Allele number. The total number of called alleles, or the number of non-missing calls * 2. - homozygote_count (
tarray
oftint32
) - Homozygote genotype counts for each allele, including the reference. Only diploid genotype calls are counted.
Parameters: - call (
CallExpression
) - alleles (
ArrayStringExpression
) – Variant alleles.
Returns: StructExpression
– Struct expression with fields AC, AF, AN, and homozygote_count.- AC (
-
hail.expr.aggregators.
info_score
(gp) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute the IMPUTE information score.
Examples
Calculate the info score per variant:
>>> gen_mt = hl.import_gen('data/example.gen', sample_file='data/example.sample') >>> gen_mt = gen_mt.annotate_rows(info_score = hl.agg.info_score(gen_mt.GP))
Calculate group-specific info scores per variant:
>>> gen_mt = hl.import_gen('data/example.gen', sample_file='data/example.sample') >>> gen_mt = gen_mt.annotate_cols(is_case = hl.rand_bool(0.5)) >>> gen_mt = gen_mt.annotate_rows(info_score = hl.agg.group_by(gen_mt.is_case, hl.agg.info_score(gen_mt.GP)))
Notes
The result of
info_score()
is a struct with two fields:- score (
float64
) – Info score. - n_included (
int32
) – Number of non-missing samples included in the calculation.
We implemented the IMPUTE info measure as described in the supplementary information from Marchini & Howie. Genotype imputation for genome-wide association studies. Nature Reviews Genetics (2010). To calculate the info score \(I_{A}\) for one SNP:
\[\begin{split}I_{A} = \begin{cases} 1 - \frac{\sum_{i=1}^{N}(f_{i} - e_{i}^2)}{2N\hat{\theta}(1 - \hat{\theta})} & \text{when } \hat{\theta} \in (0, 1) \\ 1 & \text{when } \hat{\theta} = 0, \hat{\theta} = 1\\ \end{cases}\end{split}\]- \(N\) is the number of samples with imputed genotype probabilities [\(p_{ik} = P(G_{i} = k)\) where \(k \in \{0, 1, 2\}\)]
- \(e_{i} = p_{i1} + 2p_{i2}\) is the expected genotype per sample
- \(f_{i} = p_{i1} + 4p_{i2}\)
- \(\hat{\theta} = \frac{\sum_{i=1}^{N}e_{i}}{2N}\) is the MLE for the population minor allele frequency
Hail will not generate identical results to QCTOOL for the following reasons:
- Hail automatically removes genotype probability distributions that do not
meet certain requirements on data import with
import_gen()
andimport_bgen()
. - Hail does not use the population frequency to impute genotype probabilities when a genotype probability distribution has been set to missing.
- Hail calculates the same statistic for sex chromosomes as autosomes while QCTOOL incorporates sex information.
- The floating point number Hail stores for each genotype probability is slightly different than the original data due to rounding and normalization of probabilities.
Warning
- The info score Hail reports will be extremely different from QCTOOL when a SNP has a high missing rate.
- If the gp array must contain 3 elements, and its elements may not be missing.
- If the genotype data was not imported using the
import_gen()
orimport_bgen()
functions, then the results for all variants will bescore = NA
andn_included = 0
. - It only makes semantic sense to compute the info score per variant. While the aggregator will run in any context if its arguments are the right type, the results are only meaningful in a narrow context.
Parameters: gp ( ArrayNumericExpression
) – Genotype probability array. Must have 3 elements, all of which must be defined.Returns: StructExpression
– Struct with fields score and n_included.- score (
-
hail.expr.aggregators.
hist
(expr, start, end, bins) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute binned counts of a numeric expression.
Examples
Compute a histogram of field GQ:
>>> dataset.aggregate_entries(agg.hist(dataset.GQ, 0, 100, 10)) Struct(bin_edges=[0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0], bin_freq=[2194L, 637L, 2450L, 1081L, 518L, 402L, 11168L, 1918L, 1379L, 11973L]), n_smaller=0, n_greater=0)
Notes
This method returns a struct expression with four fields:
- bin_edges (
tarray
oftfloat64
): Bin edges. Bin i contains values in the left-inclusive, right-exclusive range[ bin_edges[i], bin_edges[i+1] )
. - bin_freq (
tarray
oftint64
): Bin frequencies. The number of records found in each bin. - n_smaller (
tint64
): The number of records smaller than the start of the first bin. - n_larger (
tint64
): The number of records larger than the end of the last bin.
Parameters: - expr (
NumericExpression
) – Target numeric expression. - start (
int
orfloat
) – Start of histogram range. - end (
int
orfloat
) – End of histogram range. - bins (
int
orfloat
) – Number of bins.
Returns: StructExpression
– Struct expression with fields bin_edges, bin_freq, n_smaller, and n_larger.- bin_edges (
-
hail.expr.aggregators.
linreg
(y, x, nested_dim=1, weight=None) → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Compute multivariate linear regression statistics.
Examples
Regress HT against an intercept (1), SEX, and C1:
>>> table1.aggregate(agg.linreg(table1.HT, [1, table1.SEX == 'F', table1.C1])) Struct(beta=[88.50000000000014, 81.50000000000057, -10.000000000000068], standard_error=[14.430869689661844, 59.70552738231206, 7.000000000000016], t_stat=[6.132686518775844, 1.365032746099571, -1.428571428571435], p_value=[0.10290201427537926, 0.40250974549499974, 0.3888002244284281], multiple_standard_error=4.949747468305833, multiple_r_squared=0.7175792507204611, adjusted_r_squared=0.1527377521613834, f_stat=1.2704081632653061, multiple_p_value=0.5314327326007864, n=4)
Regress blood pressure against an intercept (1), genotype, age, and the interaction of genotype and age:
>>> ds_ann = ds.annotate_rows(linreg = ... hl.agg.linreg(ds.pheno.blood_pressure, ... [1, ... ds.GT.n_alt_alleles(), ... ds.pheno.age, ... ds.GT.n_alt_alleles() * ds.pheno.age]))
Warning
As in the example, the intercept covariate
1
must be included explicitly if desired.Notes
In relation to lm.summary in R,
linreg(y, x = [1, mt.x1, mt.x2])
computessummary(lm(y ~ x1 + x2))
andlinreg(y, x = [mt.x1, mt.x2], nested_dim=0)
computessummary(lm(y ~ x1 + x2 - 1))
.More generally, nested_dim defines the number of effects to fit in the nested (null) model, with the effects on the remaining covariates fixed to zero.
- The returned struct has ten fields:
- beta (
tarray
oftfloat64
): Estimated regression coefficient for each covariate. - standard_error (
tarray
oftfloat64
): Estimated standard error for each covariate. - t_stat (
tarray
oftfloat64
): t-statistic for each covariate. - p_value (
tarray
oftfloat64
): p-value for each covariate. - multiple_standard_error (
tfloat64
): Estimated standard deviation of the random error. - multiple_r_squared (
tfloat64
): Coefficient of determination for nested models. - adjusted_r_squared (
tfloat64
): Adjusted multiple_r_squared taking into account degrees of freedom. - f_stat (
tfloat64
): F-statistic for nested models. - multiple_p_value (
tfloat64
): p-value for the F-test of nested models. - n (
tint64
): Number of samples included in the regression. A sample is included if and only if y, all elements of x, and weight (if set) are non-missing.
- beta (
All but the last field are missing if n is less than or equal to the number of covariates or if the covariates are linearly dependent.
If set, the weight parameter generalizes the model to weighted least squares, useful for heteroscedastic (diagonal but non-constant) variance.
Warning
If any weight is negative, the resulting statistics will be
nan
.Parameters: - y (
Float64Expression
) – Response (dependent variable). - x (
Float64Expression
orlist
ofFloat64Expression
) – Covariates (independent variables). - nested_dim (
int
) – The null model includes the first nested_dim covariates. Must be between 0 and k (the length of x). - weight (
Float64Expression
, optional) – Non-negative weight for weighted least squares.
Returns: StructExpression
– Struct of regression results.
-
hail.expr.aggregators.
corr
(x, y) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Computes the Pearson correlation coefficient between x and y.
Examples
>>> ds.aggregate_cols(hl.agg.corr(ds.pheno.age, ds.pheno.blood_pressure)) 0.159882536301
Notes
Only records where both x and y are non-missing will be included in the calculation.
In the case that there are no non-missing pairs, the result will be missing.
See also
Parameters: - x (
Expression
of typetfloat64
) - y (
Expression
of typetfloat64
)
Returns: - x (
-
hail.expr.aggregators.
group_by
(group, agg_expr) → hail.expr.expressions.typed_expressions.DictExpression[source]¶ Compute aggregation statistics stratified by one or more groups.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Examples
Compute linear regression statistics stratified by SEX:
>>> table1.aggregate(agg.group_by(table1.SEX, ... agg.linreg(table1.HT, table1.C1, nested_dim=0))) { 'F': Struct(beta=[6.153846153846154], standard_error=[0.7692307692307685], t_stat=[8.000000000000009], p_value=[0.07916684832113098], multiple_standard_error=11.4354374979373, multiple_r_squared=0.9846153846153847, adjusted_r_squared=0.9692307692307693, f_stat=64.00000000000014, multiple_p_value=0.07916684832113098, n=2), 'M': Struct(beta=[34.25], standard_error=[1.75], t_stat=[19.571428571428573], p_value=[0.03249975499062629], multiple_standard_error=4.949747468305833, multiple_r_squared=0.9973961101073441, adjusted_r_squared=0.9947922202146882, f_stat=383.0408163265306, multiple_p_value=0.03249975499062629, n=2) }
Compute call statistics stratified by population group and case status:
>>> ann = ds.annotate_rows(call_stats=hl.agg.group_by(hl.struct(pop=ds.pop, is_case=ds.is_case), ... hl.agg.call_stats(ds.GT, ds.alleles)))
Parameters: - group (
Expression
orlist
ofExpression
) – Group to stratify the result by. - agg_expr (
Expression
) – Aggregation or scan expression to compute per grouping.
Returns: DictExpression
– Dictionary where the keys are group and the values are the result of computing agg_expr for each unique value of group.- group (
-
hail.expr.aggregators.
downsample
(x, y, label=None, n_divisions=500) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Downsample (x, y) coordinate datapoints.
Parameters: - x (
NumericExpression
) – X-values to be downsampled. - y (
NumericExpression
) – Y-values to be downsampled. - label (
StringExpression
orArrayExpression
) – Additional data for each (x, y) coordinate. Can pass in multiple fields in anArrayExpression
. - n_divisions (
int
) – Factor by which to downsample (default value = 500). A lower input results in fewer output datapoints.
Returns: ArrayExpression
– Expression for downsampled coordinate points (x, y). The element type of the array isttuple
oftfloat64
,tfloat64
, andtarray
oftstring
- x (