GroupedMatrixTable

class hail.GroupedMatrixTable(parent: hail.matrixtable.MatrixTable, row_keys=None, col_keys=None)[source]

Matrix table grouped by row or column that can be aggregated into a new matrix table.

The main operation on a grouped matrix table is GroupedMatrixTable.aggregate().

A grouped matrix table with a non-trivial grouping cannot be grouped again.

Methods

__init__ Initialize self.
aggregate Aggregate by group, used after MatrixTable.group_rows_by() or MatrixTable.group_cols_by().
describe Print information about grouped matrix table.
group_cols_by Group columns, used with GroupedMatrixTable.aggregate().
group_rows_by Group rows, used with GroupedMatrixTable.aggregate().
partition_by Set the partition key.
partition_hint Set the target number of partitions for aggregation.
aggregate(**named_exprs) → MatrixTable[source]

Aggregate by group, used after MatrixTable.group_rows_by() or MatrixTable.group_cols_by().

Examples

Aggregate to a matrix with genes as row keys, computing the number of non-reference calls as an entry field:

>>> dataset_result = (dataset.group_rows_by(dataset.gene)
...                          .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
Parameters:named_exprs (varargs of Expression) – Aggregation expressions.
Returns:MatrixTable – Aggregated matrix table.
describe()[source]

Print information about grouped matrix table.

group_cols_by(*exprs, **named_exprs) → GroupedMatrixTable[source]

Group columns, used with GroupedMatrixTable.aggregate().

Examples

Aggregate to a matrix with cohort as column keys, computing the call rate as an entry field:

>>> dataset_result = (dataset.group_cols_by(dataset.cohort)
...                          .aggregate(call_rate = agg.fraction(hl.is_defined(dataset.GT))))

Notes

All complex expressions must be passed as named expressions.

Parameters:
  • exprs (args of str or Expression) – Column fields to group by.
  • named_exprs (keyword args of Expression) – Column-indexed expressions to group by.
Returns:

GroupedMatrixTable – Grouped matrix, can be used to call GroupedMatrixTable.aggregate().

group_rows_by(*exprs, **named_exprs) → GroupedMatrixTable[source]

Group rows, used with GroupedMatrixTable.aggregate().

Examples

Aggregate to a matrix with genes as row keys, computing the number of non-reference calls as an entry field:

>>> dataset_result = (dataset.group_rows_by(dataset.gene)
...                          .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))

Notes

All complex expressions must be passed as named expressions.

Parameters:
  • exprs (args of str or Expression) – Row fields to group by.
  • named_exprs (keyword args of Expression) – Row-indexed expressions to group by.
Returns:

partition_by(*fields) → hail.matrixtable.GroupedMatrixTable[source]

Set the partition key.

Parameters:fields (varargs of str) – Row partition key. Must be a prefix of the key. By default, the partition key is the entire key.
Returns:GroupedMatrixTable – Self.
partition_hint(n: int) → hail.matrixtable.GroupedMatrixTable[source]

Set the target number of partitions for aggregation.

Examples

Use partition_hint in a MatrixTable.group_rows_by() / GroupedMatrixTable.aggregate() pipeline:

>>> dataset_result = (dataset.group_rows_by(dataset.gene)
...                          .partition_hint(5)
...                          .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))

Notes

Until Hail’s query optimizer is intelligent enough to sample records at all stages of a pipeline, it can be necessary in some places to provide some explicit hints.

The default number of partitions for GroupedMatrixTable.aggregate() is the number of partitions in the upstream dataset. If the aggregation greatly reduces the size of the dataset, providing a hint for the target number of partitions can accelerate downstream operations.

Parameters:n (int) – Number of partitions.
Returns:GroupedMatrixTable – Same grouped matrix table with a partition hint.