GroupedMatrixTable¶
-
class
hail.
GroupedMatrixTable
(parent: hail.matrixtable.MatrixTable, row_keys=None, col_keys=None)[source]¶ Matrix table grouped by row or column that can be aggregated into a new matrix table.
The main operation on a grouped matrix table is
GroupedMatrixTable.aggregate()
.A grouped matrix table with a non-trivial grouping cannot be grouped again.
Methods
__init__
Initialize self. aggregate
Aggregate by group, used after MatrixTable.group_rows_by()
orMatrixTable.group_cols_by()
.describe
Print information about grouped matrix table. group_cols_by
Group columns, used with GroupedMatrixTable.aggregate()
.group_rows_by
Group rows, used with GroupedMatrixTable.aggregate()
.partition_by
Set the partition key. partition_hint
Set the target number of partitions for aggregation. -
aggregate
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Aggregate by group, used after
MatrixTable.group_rows_by()
orMatrixTable.group_cols_by()
.Examples
Aggregate to a matrix with genes as row keys, computing the number of non-reference calls as an entry field:
>>> dataset_result = (dataset.group_rows_by(dataset.gene) ... .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
Parameters: named_exprs (varargs of Expression
) – Aggregation expressions.Returns: MatrixTable
– Aggregated matrix table.
-
group_cols_by
(*exprs, **named_exprs) → GroupedMatrixTable[source]¶ Group columns, used with
GroupedMatrixTable.aggregate()
.Examples
Aggregate to a matrix with cohort as column keys, computing the call rate as an entry field:
>>> dataset_result = (dataset.group_cols_by(dataset.cohort) ... .aggregate(call_rate = agg.fraction(hl.is_defined(dataset.GT))))
Notes
All complex expressions must be passed as named expressions.
Parameters: - exprs (args of
str
orExpression
) – Column fields to group by. - named_exprs (keyword args of
Expression
) – Column-indexed expressions to group by.
Returns: GroupedMatrixTable
– Grouped matrix, can be used to callGroupedMatrixTable.aggregate()
.- exprs (args of
-
group_rows_by
(*exprs, **named_exprs) → GroupedMatrixTable[source]¶ Group rows, used with
GroupedMatrixTable.aggregate()
.Examples
Aggregate to a matrix with genes as row keys, computing the number of non-reference calls as an entry field:
>>> dataset_result = (dataset.group_rows_by(dataset.gene) ... .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
Notes
All complex expressions must be passed as named expressions.
Parameters: - exprs (args of
str
orExpression
) – Row fields to group by. - named_exprs (keyword args of
Expression
) – Row-indexed expressions to group by.
Returns: GroupedMatrixTable
- Grouped matrix. Can be used to call
GroupedMatrixTable.aggregate()
.
- exprs (args of
-
partition_by
(*fields) → hail.matrixtable.GroupedMatrixTable[source]¶ Set the partition key.
Parameters: fields (varargs of str
) – Row partition key. Must be a prefix of the key. By default, the partition key is the entire key.Returns: GroupedMatrixTable
– Self.
-
partition_hint
(n: int) → hail.matrixtable.GroupedMatrixTable[source]¶ Set the target number of partitions for aggregation.
Examples
Use partition_hint in a
MatrixTable.group_rows_by()
/GroupedMatrixTable.aggregate()
pipeline:>>> dataset_result = (dataset.group_rows_by(dataset.gene) ... .partition_hint(5) ... .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
Notes
Until Hail’s query optimizer is intelligent enough to sample records at all stages of a pipeline, it can be necessary in some places to provide some explicit hints.
The default number of partitions for
GroupedMatrixTable.aggregate()
is the number of partitions in the upstream dataset. If the aggregation greatly reduces the size of the dataset, providing a hint for the target number of partitions can accelerate downstream operations.Parameters: n (int) – Number of partitions. Returns: GroupedMatrixTable
– Same grouped matrix table with a partition hint.
-