MatrixTable¶
-
class
hail.
MatrixTable
(jvds)[source]¶ Hail’s distributed implementation of a structured matrix.
Use
read_matrix_table()
to read a matrix table that was written withMatrixTable.write()
.Examples
Add annotations:
>>> dataset = dataset.annotate_globals(pli={'SCN1A': 0.999, 'SONIC': 0.014}, ... populations = ['AFR', 'EAS', 'EUR', 'SAS', 'AMR', 'HIS'])
>>> dataset = dataset.annotate_cols(pop = dataset.populations[hl.int(hl.rand_unif(0, 6))], ... sample_gq = agg.mean(dataset.GQ), ... sample_dp = agg.mean(dataset.DP))
>>> dataset = dataset.annotate_rows(variant_gq = agg.mean(dataset.GQ), ... variant_dp = agg.mean(dataset.GQ), ... sas_hets = agg.count_where(dataset.GT.is_het()))
>>> dataset = dataset.annotate_entries(gq_by_dp = dataset.GQ / dataset.DP)
Filter:
>>> dataset = dataset.filter_cols(dataset.pop != 'EUR')
>>> datasetm = dataset.filter_rows((dataset.variant_gq > 10) & (dataset.variant_dp > 5))
>>> dataset = dataset.filter_entries(dataset.gq_by_dp > 1)
Query:
>>> col_stats = dataset.aggregate_cols(hl.struct(pop_counts=agg.counter(dataset.pop), ... high_quality=agg.fraction((dataset.sample_gq > 10) & (dataset.sample_dp > 5)))) >>> print(col_stats.pop_counts) >>> print(col_stats.high_quality)
>>> het_dist = dataset.aggregate_rows(agg.stats(dataset.sas_hets)) >>> print(het_dist)
>>> entry_stats = dataset.aggregate_entries(hl.struct(call_rate=agg.fraction(hl.is_defined(dataset.GT)), ... global_gq_mean=agg.mean(dataset.GQ))) >>> print(entry_stats.call_rate) >>> print(entry_stats.global_gq_mean)
Attributes
col
Returns a struct expression of all column-indexed fields, including keys. col_key
Column key struct. col_value
Returns a struct expression including all non-key column-indexed fields. entry
Returns a struct expression including all row-and-column-indexed fields. globals
Returns a struct expression including all global fields. row
Returns a struct expression of all row-indexed fields, including keys. row_key
Row key struct. row_value
Returns a struct expression including all non-key row-indexed fields. Methods
__init__
Initialize self. add_col_index
Add the integer index of each column as a new column field. add_row_index
Add the integer index of each row as a new row field. aggregate_cols
Aggregate over columns to a local value. aggregate_entries
Aggregate over entries to a local value. aggregate_rows
Aggregate over rows to a local value. annotate_cols
Create new column-indexed fields by name. annotate_entries
Create new row-and-column-indexed fields by name. annotate_globals
Create new global fields by name. annotate_rows
Create new row-indexed fields by name. cache
Persist the dataset in memory. choose_cols
Choose a new set of columns from a list of old column indices. collect_cols_by_key
Collect values for each unique column key into arrays. cols
Returns a table with all column fields in the matrix. count
Count the number of rows and columns in the matrix. count_cols
Count the number of columns in the matrix. count_rows
Count the number of rows in the matrix. describe
Print information about the fields in the matrix. distinct_by_col
Remove columns with a duplicate row key. distinct_by_row
Remove rows with a duplicate row key. drop
Drop fields. drop_cols
Drop all columns of the matrix. drop_rows
Drop all rows of the matrix. entries
Returns a matrix in coordinate table form. explode_cols
Explodes a column field of type array or set, copying the entire column for each element. explode_rows
Explodes a row field of type array or set, copying the entire row for each element. filter_cols
Filter columns of the matrix. filter_entries
Filter entries of the matrix. filter_rows
Filter rows of the matrix. from_rows_table
Construct matrix table with no columns from a table. globals_table
Returns a table with a single row with the globals of the matrix table. group_cols_by
Group columns, used with GroupedMatrixTable.aggregate()
.group_rows_by
Group rows, used with GroupedMatrixTable.aggregate()
.head
Subset matrix to first n rows. index_cols
Expose the column values as if looked up in a dictionary, indexing with exprs. index_entries
Expose the entries as if looked up in a dictionary, indexing with exprs. index_globals
Return this matrix table’s global variables for use in another expression context. index_rows
Expose the row values as if looked up in a dictionary, indexing with exprs. key_cols_by
Key columns by a new set of fields. key_rows_by
Key rows by a new set of fields. make_table
Make a table from a matrix table with one field per sample. n_partitions
Number of partitions. naive_coalesce
Naively decrease the number of partitions. persist
Persist this table in memory or on disk. rename
Rename fields of a matrix table. repartition
Increase or decrease the number of partitions. rows
Returns a table with all row fields in the matrix. sample_rows
Downsample the matrix table by keeping each row with probability p
.select_cols
Select existing column fields or create new fields by name, dropping the rest. select_entries
Select existing entry fields or create new fields by name, dropping the rest. select_globals
Select existing global fields or create new fields by name, dropping the rest. select_rows
Select existing row fields or create new fields by name, dropping all other non-key fields. transmute_cols
Similar to MatrixTable.annotate_cols()
, but drops referenced fields.transmute_entries
Similar to MatrixTable.annotate_entries()
, but drops referenced fields.transmute_globals
Similar to MatrixTable.annotate_globals()
, but drops referenced fields.transmute_rows
Similar to MatrixTable.annotate_rows()
, but drops referenced fields.union_cols
Take the union of dataset columns. union_rows
Take the union of dataset rows. unpersist
Unpersists this dataset from memory/disk. write
Write to disk. -
add_col_index
(name: str = 'col_idx') → MatrixTable[source]¶ Add the integer index of each column as a new column field.
Examples
>>> dataset_result = dataset.add_col_index()
Notes
The field added is type
tint32
.The column index is 0-indexed; the values are found in the range
[0, N)
, whereN
is the total number of columns.Parameters: name ( str
) – Name for column index field.Returns: MatrixTable
– Dataset with new field.
-
add_row_index
(name: str = 'row_idx') → MatrixTable[source]¶ Add the integer index of each row as a new row field.
Examples
>>> dataset_result = dataset.add_row_index()
Notes
The field added is type
tint64
.The row index is 0-indexed; the values are found in the range
[0, N)
, whereN
is the total number of rows.Parameters: name ( str
) – Name for row index field.Returns: MatrixTable
– Dataset with new field.
-
aggregate_cols
(expr) → Any[source]¶ Aggregate over columns to a local value.
Examples
Aggregate over columns:
>>> dataset.aggregate_cols( ... hl.struct(fraction_female=agg.fraction(dataset.pheno.is_female), ... case_ratio=agg.count_where(dataset.is_case) / agg.count())) Struct(fraction_female=0.5102222, case_ratio=0.35156)
Notes
Unlike most
MatrixTable
methods, this method does not support meaningful references to fields that are not global or indexed by column.This method should be thought of as a more convenient alternative to the following:
>>> cols_table = dataset.cols() >>> cols_table.aggregate( ... hl.struct(fraction_female=agg.fraction(cols_table.pheno.is_female), ... case_ratio=agg.count_where(cols_table.is_case) / agg.count()))
Note
This method supports (and expects!) aggregation over columns.
Parameters: expr ( Expression
) – Aggregation expression.Returns: any – Aggregated value dependent on expr.
-
aggregate_entries
(expr) → Any[source]¶ Aggregate over entries to a local value.
Examples
Aggregate over entries:
>>> dataset.aggregate_entries(hl.struct(global_gq_mean=agg.mean(dataset.GQ), ... call_rate=agg.fraction(hl.is_defined(dataset.GT)))) Struct(global_gq_mean=31.16200, call_rate=0.981682)
Notes
This method should be thought of as a more convenient alternative to the following:
>>> entries_table = dataset.entries() >>> entries_table.aggregate(hl.struct(global_gq_mean=agg.mean(entries_table.GQ), ... call_rate=agg.fraction(hl.is_defined(entries_table.GT))))
Note
This method supports (and expects!) aggregation over entries.
Parameters: expr ( Expression
) – Aggregation expressions.Returns: any – Aggregated value dependent on expr.
-
aggregate_rows
(expr) → Any[source]¶ Aggregate over rows to a local value.
Examples
Aggregate over rows:
>>> dataset.aggregate_rows(hl.struct(n_high_quality=agg.count_where(dataset.qual > 40), ... mean_qual=agg.mean(dataset.qual))) Struct(n_high_quality=100150224, mean_qual=50.12515572)
Notes
Unlike most
MatrixTable
methods, this method does not support meaningful references to fields that are not global or indexed by row.This method should be thought of as a more convenient alternative to the following:
>>> rows_table = dataset.rows() >>> rows_table.aggregate(hl.struct(n_high_quality=agg.count_where(rows_table.qual > 40), ... mean_qual=agg.mean(rows_table.qual)))
Note
This method supports (and expects!) aggregation over rows.
Parameters: expr ( Expression
) – Aggregation expression.Returns: any – Aggregated value dependent on expr.
-
annotate_cols
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Create new column-indexed fields by name.
Examples
Compute statistics about the GQ distribution per sample:
>>> dataset_result = dataset.annotate_cols(sample_gq_stats = agg.stats(dataset.GQ))
Add sample metadata from a
hail.Table
.>>> dataset_result = dataset.annotate_cols(population = s_metadata[dataset.s].pop)
Note
This method supports aggregation over rows. For instance, the usage:
>>> dataset_result = dataset.annotate_cols(mean_GQ = agg.mean(dataset.GQ))
will compute the mean per column.
Notes
This method creates new column fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an column field foo. However, it would be possible to create an column field foo and later create another column field foo, overwriting the first.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.Parameters: named_exprs (keyword args of Expression
) – Field names and the expressions to compute them.Returns: MatrixTable
– Matrix table with new column-indexed field(s).
-
annotate_entries
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Create new row-and-column-indexed fields by name.
Examples
Compute the allele dosage using the PL field:
>>> def get_dosage(pl): ... # convert to linear scale ... linear_scaled = pl.map(lambda x: 10 ** - (x / 10)) ... ... # normalize to sum to 1 ... ls_sum = hl.sum(linear_scaled) ... linear_scaled = linear_scaled.map(lambda x: x / ls_sum) ... ... # multiply by [0, 1, 2] and sum ... return hl.sum(linear_scaled * [0, 1, 2]) >>> >>> dataset_result = dataset.annotate_entries(dosage = get_dosage(dataset.PL))
Note
This method does not support aggregation.
Notes
This method creates new entry fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an entry field foo. However, it would be possible to create an entry field foo and later create another entry field foo, overwriting the first.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.Parameters: named_exprs (keyword args of Expression
) – Field names and the expressions to compute them.Returns: MatrixTable
– Matrix table with new row-and-column-indexed field(s).
-
annotate_globals
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Create new global fields by name.
Examples
Add two global fields:
>>> pops_1kg = {'EUR', 'AFR', 'EAS', 'SAS', 'AMR'} >>> dataset_result = dataset.annotate_globals(pops_in_1kg = pops_1kg, ... gene_list = ['SHH', 'SCN1A', 'SPTA1', 'DISC1'])
Add global fields from another table and matrix table:
>>> dataset_result = dataset.annotate_globals(thing1 = dataset2.index_globals().global_field, ... thing2 = v_metadata.index_globals().global_field)
Note
This method does not support aggregation.
Notes
This method creates new global fields, but can also overwrite existing fields. Only same-scope fields can be overwritten: for example, it is not possible to annotate a row field foo and later create an global field foo. However, it would be possible to create an global field foo and later create another global field foo, overwriting the first.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.Parameters: named_exprs (keyword args of Expression
) – Field names and the expressions to compute them.Returns: MatrixTable
– Matrix table with new global field(s).
-
annotate_rows
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Create new row-indexed fields by name.
Examples
Compute call statistics for high quality samples per variant:
>>> high_quality_calls = agg.filter(dataset.sample_qc.gq_stats.mean > 20, dataset.GT) >>> dataset_result = dataset.annotate_rows(call_stats = agg.call_stats(high_quality_calls, dataset.alleles))
Add functional annotations from a
Table
keyed byTVariant
:, and anotherMatrixTable
.>>> dataset_result = dataset.annotate_rows(consequence = v_metadata[dataset.locus, dataset.alleles].consequence, ... dataset2_AF = dataset2.index_rows(dataset.row_key).info.AF)
Note
This method supports aggregation over columns. For instance, the usage:
>>> dataset_result = dataset.annotate_rows(mean_GQ = agg.mean(dataset.GQ))
will compute the mean per row.
Notes
This method creates new row fields, but can also overwrite existing fields. Only non-key, same-scope fields can be overwritten: for example, it is not possible to annotate a global field foo and later create an row field foo. However, it would be possible to create an row field foo and later create another row field foo, overwriting the first, as long as foo is not a row key.
The arguments to the method should either be
Expression
objects, or should be implicitly interpretable as expressions.Parameters: named_exprs (keyword args of Expression
) – Field names and the expressions to compute them.Returns: MatrixTable
– Matrix table with new row-indexed field(s).
-
cache
() → hail.matrixtable.MatrixTable[source]¶ Persist the dataset in memory.
Examples
Persist the dataset in memory:
>>> dataset = dataset.cache()
Notes
This method is an alias for
persist("MEMORY_ONLY")
.Returns: MatrixTable
– Cached dataset.
-
choose_cols
(indices: List[int]) → MatrixTable[source]¶ Choose a new set of columns from a list of old column indices.
Examples
Randomly shuffle column order:
>>> import random >>> indices = list(range(dataset.count_cols())) >>> random.shuffle(indices) >>> dataset_reordered = dataset.choose_cols(indices)
Take the first ten columns:
>>> dataset_result = dataset.choose_cols(list(range(10)))
Parameters: indices ( list
ofint
) – List of old column indices.Returns: MatrixTable
-
col
¶ Returns a struct expression of all column-indexed fields, including keys.
Examples
Get all column field names:
>>> list(dataset.col) ['s', 'sample_qc', 'is_case', 'pheno', 'cov', 'cov1', 'cov2', 'cohorts', 'pop']
Returns: StructExpression
– Struct of all column fields.
-
col_key
¶ Column key struct.
Examples
Get the column key field names:
>>> list(dataset.col_key) ['s']
Returns: StructExpression
-
col_value
¶ Returns a struct expression including all non-key column-indexed fields.
Examples
Get all non-key column field names:
>>> list(dataset.col_value) ['sample_qc', 'is_case', 'pheno', 'cov', 'cov1', 'cov2', 'cohorts', 'pop']
Returns: StructExpression
– Struct of all column fields, minus keys.
-
collect_cols_by_key
() → hail.matrixtable.MatrixTable[source]¶ Collect values for each unique column key into arrays.
Examples
>>> mt = hl.utils.range_matrix_table(3, 3) >>> col_dict = hl.literal({0: [1], 1: [2, 3], 2: [4, 5, 6]}) >>> mt = (mt.annotate_cols(foo = col_dict.get(mt.col_idx)) ... .explode_cols('foo')) >>> mt = mt.annotate_entries(bar = mt.row_idx * mt.foo)
>>> mt.cols().show() +---------+-------+ | col_idx | foo | +---------+-------+ | int32 | int32 | +---------+-------+ | 0 | 1 | | 1 | 2 | | 1 | 3 | | 2 | 4 | | 2 | 5 | | 2 | 6 | +---------+-------+
>>> mt.entries().show() +---------+---------+-------+-------+ | row_idx | col_idx | foo | bar | +---------+---------+-------+-------+ | int32 | int32 | int32 | int32 | +---------+---------+-------+-------+ | 0 | 0 | 1 | 0 | | 0 | 1 | 2 | 0 | | 0 | 1 | 3 | 0 | | 0 | 2 | 4 | 0 | | 0 | 2 | 5 | 0 | | 0 | 2 | 6 | 0 | | 1 | 0 | 1 | 1 | | 1 | 1 | 2 | 2 | | 1 | 1 | 3 | 3 | | 1 | 2 | 4 | 4 | +---------+---------+-------+-------+ showing top 10 rows
>>> mt = mt.collect_cols_by_key() >>> mt.cols().show() +---------+--------------+ | col_idx | foo | +---------+--------------+ | int32 | array<int32> | +---------+--------------+ | 1 | [2,3] | | 0 | [1] | | 2 | [4,5,6] | +---------+--------------+
>>> mt.entries().show() +---------+---------+--------------+--------------+ | row_idx | col_idx | foo | bar | +---------+---------+--------------+--------------+ | int32 | int32 | array<int32> | array<int32> | +---------+---------+--------------+--------------+ | 0 | 1 | [2,3] | [0,0] | | 0 | 0 | [1] | [0] | | 0 | 2 | [4,5,6] | [0,0,0] | | 1 | 1 | [2,3] | [2,3] | | 1 | 0 | [1] | [1] | | 1 | 2 | [4,5,6] | [4,5,6] | | 2 | 1 | [2,3] | [4,6] | | 2 | 0 | [1] | [2] | | 2 | 2 | [4,5,6] | [8,10,12] | +---------+---------+--------------+--------------+
Notes
Each entry field and each non-key column field of type t is replaced by a field of type array<t>. The value of each such field is an array containing all values of that field sharing the corresponding column key. In each column, the newly collected arrays all have the same length, and the values of each pre-collection column are guaranteed to be located at the same index in their corresponding arrays.
Note
The order of the columns is not guaranteed.
Returns: MatrixTable
-
cols
() → hail.table.Table[source]¶ Returns a table with all column fields in the matrix.
Examples
Extract the column table:
>>> cols_table = dataset.cols()
Warning
Matrix table columns are typically sorted by the order at import, and not necessarily by column key. Since tables are always sorted by key, the table which results from this command will have its rows sorted by the column key (which becomes the table key). To preserve the original column order as the table row order, first unkey the columns using
key_cols_by()
with no arguments.Returns: Table
– Table with all column fields from the matrix, with one row per column of the matrix.
-
count
() → Tuple[int, int][source]¶ Count the number of rows and columns in the matrix.
Examples
>>> dataset.count()
Returns: int
,int
– Number of rows, number of cols.
-
count_cols
() → int[source]¶ Count the number of columns in the matrix.
Examples
Count the number of columns:
>>> n_cols = dataset.count_cols()
Returns: int
– Number of columns in the matrix.
-
count_rows
() → int[source]¶ Count the number of rows in the matrix.
Examples
Count the number of rows:
>>> n_rows = dataset.count_rows()
Returns: int
– Number of rows in the matrix.
-
describe
(handler=<built-in function print>)[source]¶ Print information about the fields in the matrix.
-
distinct_by_col
()[source]¶ Remove columns with a duplicate row key.
Returns: MatrixTable
-
distinct_by_row
()[source]¶ Remove rows with a duplicate row key.
Returns: MatrixTable
-
drop
(*exprs) → MatrixTable[source]¶ Drop fields.
Examples
Drop fields PL (an entry field), info (a row field), and pheno (a column field): using strings:
>>> dataset_result = dataset.drop('PL', 'info', 'pheno')
Drop fields PL (an entry field), info (a row field), and pheno (a column field): using field references:
>>> dataset_result = dataset.drop(dataset.PL, dataset.info, dataset.pheno)
Drop a list of fields:
>>> fields_to_drop = ['PL', 'info', 'pheno'] >>> dataset_result = dataset.drop(*fields_to_drop)
Notes
This method can be used to drop global, row-indexed, column-indexed, or row-and-column-indexed (entry) fields. The arguments can be either strings (
'field'
), or top-level field references (table.field
ortable['field']
).Key fields (belonging to either the row key or the column key) cannot be dropped using this method. In order to drop a key field, use
key_rows_by()
orkey_cols_by()
to remove the field from the key before dropping.While many operations exist independently for rows, columns, entries, and globals, only one is needed for dropping due to the lack of any necessary contextual information.
Parameters: exprs (varargs of str
orExpression
) – Names of fields to drop or field reference expressions.Returns: MatrixTable
– Matrix table without specified fields.
-
drop_cols
() → hail.matrixtable.MatrixTable[source]¶ Drop all columns of the matrix. Is equivalent to:
>>> dataset_result = dataset.filter_cols(False)
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Returns: MatrixTable
– Matrix table with no columns.
-
drop_rows
() → hail.matrixtable.MatrixTable[source]¶ Drop all rows of the matrix. Is equivalent to:
>>> dataset_result = dataset.filter_rows(False)
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Returns: MatrixTable
– Matrix table with no rows.
-
entries
() → hail.table.Table[source]¶ Returns a matrix in coordinate table form.
Examples
Extract the entry table:
>>> entries_table = dataset.entries()
Warning
The table returned by this method should be used for aggregation or queries, but never exported or written to disk without extensive filtering and field selection – the disk footprint of an entries_table could be 100x (or more!) larger than its parent matrix. This means that if you try to export the entries table of a 10 terabyte matrix, you could write a petabyte of data!
Warning
Matrix table columns are typically sorted by the order at import, and not necessarily by column key. Since tables are always sorted by key, the table which results from this command will have its rows sorted by the compound (row key, column key) which becomes the table key. To preserve the original row-major entry order as the table row order, first unkey the columns using
key_cols_by()
with no arguments.Returns: Table
– Table with all non-global fields from the matrix, with one row per entry of the matrix.
-
entry
¶ Returns a struct expression including all row-and-column-indexed fields.
Examples
Get all entry field names:
>>> list(dataset.entry) ['GT', 'AD', 'DP', 'GQ', 'PL']
Returns: StructExpression
– Struct of all entry fields.
-
explode_cols
(field_expr) → MatrixTable[source]¶ Explodes a column field of type array or set, copying the entire column for each element.
Examples
Explode columns by annotated cohorts:
>>> dataset_result = dataset.explode_cols(dataset.cohorts)
Notes
The new matrix table will have N copies of each column, where N is the number of elements that column contains for the field denoted by field_expr. The field referenced in field_expr is replaced in the sequence of duplicated columns by the sequence of elements in the array or set. All other fields remain the same, including entry fields.
If the field referenced with field_expr is missing or empty, the column is removed entirely.
Parameters: field_expr (str or Expression
) – Field name or (possibly nested) field reference expression.Returns: MatrixTable
– Matrix table exploded column-wise for each element of field_expr.
-
explode_rows
(field_expr) → MatrixTable[source]¶ Explodes a row field of type array or set, copying the entire row for each element.
Examples
Explode rows by annotated genes:
>>> dataset_result = dataset.explode_rows(dataset.gene)
Notes
The new matrix table will have N copies of each row, where N is the number of elements that row contains for the field denoted by field_expr. The field referenced in field_expr is replaced in the sequence of duplicated rows by the sequence of elements in the array or set. All other fields remain the same, including entry fields.
If the field referenced with field_expr is missing or empty, the row is removed entirely.
Parameters: field_expr (str or Expression
) – Field name or (possibly nested) field reference expression.Returns: class:MatrixTable` – Matrix table exploded row-wise for each element of field_expr.
-
filter_cols
(expr, keep: bool = True) → MatrixTable[source]¶ Filter columns of the matrix.
Examples
Keep columns where pheno.is_case is
True
and pheno.age is larger than 50:>>> dataset_result = dataset.filter_cols(dataset.pheno.is_case & ... (dataset.pheno.age > 50), ... keep=True)
Remove columns where sample_qc.gq_stats.mean is less than 20:
>>> dataset_result = dataset.filter_cols(dataset.sample_qc.gq_stats.mean < 20, ... keep=False)
Remove columns where s is found in a Python set:
>>> samples_to_remove = {'NA12878', 'NA12891', 'NA12892'} >>> set_to_remove = hl.literal(samples_to_remove) >>> dataset_result = dataset.filter_cols(~set_to_remove.contains(dataset['s']))
Notes
The expression expr will be evaluated for every column of the table. If keep is
True
, then columns where expr evaluates toFalse
will be removed (the filter keeps the columns where the predicate evaluates toTrue
). If keep isFalse
, then columns where expr evaluates toFalse
will be removed (the filter removes the columns where the predicate evaluates toTrue
).Warning
When expr evaluates to missing, the column will be removed regardless of keep.
Note
This method supports aggregation over rows. For instance,
>>> dataset_result = dataset.filter_cols(agg.mean(dataset.GQ) > 20.0)
will remove columns where the mean GQ of all entries in the column is smaller than 20.
Parameters: - expr (bool or
BooleanExpression
) – Filter expression. - keep (bool) – Keep columns where expr is true.
Returns: MatrixTable
– Filtered matrix table.- expr (bool or
-
filter_entries
(expr, keep: bool = True) → MatrixTable[source]¶ Filter entries of the matrix.
Examples
Keep entries where the sum of AD is greater than 10 and GQ is greater than 20:
>>> dataset_result = dataset.filter_entries((hl.sum(dataset.AD) > 10) & (dataset.GQ > 20))
Notes
The expression expr will be evaluated for every entry of the table. If keep is
True
, then entries where expr evaluates toFalse
will be removed (the filter keeps the entries where the predicate evaluates toTrue
). If keep isFalse
, then entries where expr evaluates toFalse
will be removed (the filter removes the entries where the predicate evaluates toTrue
).Note
“Removal” of an entry constitutes setting all its fields to missing. There is some debate about what removing an entry of a matrix means semantically, given the representation of a
MatrixTable
as a whole workspace in Hail.Warning
When expr evaluates to missing, the entry will be removed regardless of keep.
Note
This method does not support aggregation.
Parameters: - expr (bool or
BooleanExpression
) – Filter expression. - keep (bool) – Keep entries where expr is true.
Returns: MatrixTable
– Filtered matrix table.- expr (bool or
-
filter_rows
(expr, keep: bool = True) → MatrixTable[source]¶ Filter rows of the matrix.
Examples
Keep rows where variant_qc.AF is below 1%:
>>> dataset_result = dataset.filter_rows(dataset.variant_qc.AF[1] < 0.01, keep=True)
Remove rows where filters is non-empty:
>>> dataset_result = dataset.filter_rows(dataset.filters.size() > 0, keep=False)
Notes
The expression expr will be evaluated for every row of the table. If keep is
True
, then rows where expr evaluates toFalse
will be removed (the filter keeps the rows where the predicate evaluates toTrue
). If keep isFalse
, then rows where expr evaluates toFalse
will be removed (the filter removes the rows where the predicate evaluates toTrue
).Warning
When expr evaluates to missing, the row will be removed regardless of keep.
Note
This method supports aggregation over columns. For instance,
>>> dataset_result = dataset.filter_rows(agg.mean(dataset.GQ) > 20.0)
will remove rows where the mean GQ of all entries in the row is smaller than 20.
Parameters: - expr (bool or
BooleanExpression
) – Filter expression. - keep (bool) – Keep rows where expr is true.
Returns: MatrixTable
– Filtered matrix table.- expr (bool or
-
classmethod
from_rows_table
(table: hail.table.Table) → MatrixTable[source]¶ Construct matrix table with no columns from a table.
Danger
This functionality is experimental. It may not be tested as well as other parts of Hail and the interface is subject to change.
Examples
Import a text table and construct a rows-only matrix table:
>>> table = hl.import_table('data/variant-lof.tsv') >>> table = table.transmute(**hl.parse_variant(table['v'])).key_by('locus', 'alleles') >>> sites_vds = hl.MatrixTable.from_rows_table(table)
Notes
All fields in the table become row-indexed fields in the result.
Parameters: table ( Table
) – The table to be converted.Returns: MatrixTable
-
globals
¶ Returns a struct expression including all global fields.
Returns: StructExpression
-
globals_table
() → hail.table.Table[source]¶ Returns a table with a single row with the globals of the matrix table.
Examples
Extract the globals table:
>>> globals_table = dataset.globals_table()
Returns: Table
– Table with the globals from the matrix, with a single row.
-
group_cols_by
(*exprs, **named_exprs) → GroupedMatrixTable[source]¶ Group columns, used with
GroupedMatrixTable.aggregate()
.Examples
Aggregate to a matrix with cohort as column keys, computing the call rate as an entry field:
>>> dataset_result = (dataset.group_cols_by(dataset.cohort) ... .aggregate(call_rate = agg.fraction(hl.is_defined(dataset.GT))))
Notes
All complex expressions must be passed as named expressions.
Parameters: - exprs (args of
str
orExpression
) – Column fields to group by. - named_exprs (keyword args of
Expression
) – Column-indexed expressions to group by.
Returns: GroupedMatrixTable
– Grouped matrix, can be used to callGroupedMatrixTable.aggregate()
.- exprs (args of
-
group_rows_by
(*exprs, **named_exprs) → GroupedMatrixTable[source]¶ Group rows, used with
GroupedMatrixTable.aggregate()
.Examples
Aggregate to a matrix with genes as row keys, computing the number of non-reference calls as an entry field:
>>> dataset_result = (dataset.group_rows_by(dataset.gene) ... .aggregate(n_non_ref = agg.count_where(dataset.GT.is_non_ref())))
Notes
All complex expressions must be passed as named expressions.
Parameters: - exprs (args of
str
orExpression
) – Row fields to group by. - named_exprs (keyword args of
Expression
) – Row-indexed expressions to group by.
Returns: GroupedMatrixTable
– Grouped matrix. Can be used to callGroupedMatrixTable.aggregate()
.- exprs (args of
-
head
(n: int) → MatrixTable[source]¶ Subset matrix to first n rows.
Examples
Subset to the first three rows of the matrix:
>>> dataset_result = dataset.head(3) >>> dataset_result.count_rows() 3
Notes
The number of partitions in the new matrix is equal to the number of partitions containing the first n rows.
Parameters: n ( int
) – Number of rows to include.Returns: MatrixTable
– Matrix including the first n rows.
-
index_cols
(*exprs)[source]¶ Expose the column values as if looked up in a dictionary, indexing with exprs.
Examples
>>> dataset_result = dataset.annotate_cols(pheno = dataset2.index_cols(dataset.s).pheno)
Or equivalently: >>> dataset_result = dataset.annotate_cols(pheno = dataset2.index_cols(dataset.col_key).pheno)
Parameters: exprs (variable-length args of Expression
) – Index expressions.Notes
index_cols(exprs)()
is equivalent tocols().index(exprs)
orcols()[exprs]
.The type of the resulting struct is the same as the type of
col_value()
.Returns: StructExpression
-
index_entries
(row_exprs, col_exprs)[source]¶ Expose the entries as if looked up in a dictionary, indexing with exprs.
Examples
>>> dataset_result = dataset.annotate_entries(GQ2 = dataset2.index_entries(dataset.row_key, dataset.col_key).GQ)
Or equivalently: >>> dataset_result = dataset.annotate_entries(GQ2 = dataset2[dataset.row_key, dataset.col_key].GQ)
Parameters: - row_exprs (tuple of
Expression
) – Row index expressions. - col_exprs (tuple of
Expression
) – Column index expressions.
Notes
The type of the resulting struct is the same as the type of
entry()
.Note
There is a shorthand syntax for
MatrixTable.index_entries()
using square brackets (the Python__getitem__
syntax). This syntax is preferred.>>> dataset_result = dataset.annotate_entries(GQ2 = dataset2[dataset.row_key, dataset.col_key].GQ)
Returns: StructExpression
- row_exprs (tuple of
-
index_globals
() → hail.expr.expressions.base_expression.Expression[source]¶ Return this matrix table’s global variables for use in another expression context.
Examples
>>> dataset1 = dataset.annotate_globals(pli={'SCN1A': 0.999, 'SONIC': 0.014}) >>> pli_dict = dataset1.index_globals().pli >>> dataset_result = dataset2.annotate_rows(gene_pli = dataset2.gene.map(lambda x: pli_dict.get(x)))
Returns: StructExpression
-
index_rows
(*exprs)[source]¶ Expose the row values as if looked up in a dictionary, indexing with exprs.
Examples
>>> dataset_result = dataset.annotate_rows(qual = dataset2.index_rows(dataset.locus, dataset.alleles).qual)
Or equivalently: >>> dataset_result = dataset.annotate_rows(qual = dataset2.index_rows(dataset.row_key).qual)
Parameters: exprs (variable-length args of Expression
) – Index expressions.Notes
index_rows(exprs)()
is equivalent torows().index(exprs)
orrows()[exprs]
.The type of the resulting struct is the same as the type of
row_value()
.Returns: StructExpression
-
key_cols_by
(*keys, **named_keys) → MatrixTable[source]¶ Key columns by a new set of fields.
See
Table.key_by()
for more information on defining a key.Parameters: - keys (varargs of
str
orExpression
.) – Column fields to key by. - named_keys (keyword args of
Expression
.) – Column fields to key by.
Returns: - keys (varargs of
-
key_rows_by
(*keys, **named_keys) → MatrixTable[source]¶ Key rows by a new set of fields.
Examples
>>> dataset_result = dataset.key_rows_by('locus') >>> dataset_result = dataset.key_rows_by(dataset['locus']) >>> dataset_result = dataset.key_rows_by(**dataset.row_key.drop('alleles'))
All of these expressions key the dataset by the ‘locus’ field, dropping the ‘alleles’ field from the row key.
>>> dataset_result = dataset.key_rows_by(contig=dataset['locus'].contig, ... position=dataset['locus'].position, ... alleles=dataset['alleles'])
This keys the dataset by the newly defined fields, ‘contig’ and ‘position’, and the ‘alleles’ field. The old row key field, ‘locus’, is preserved as a non-key field.
Notes
See
Table.key_by()
for more information on defining a key.Parameters: - keys (varargs of
str
orExpression
.) – Row fields to key by. - named_keys (keyword args of
Expression
.) – Row fields to key by.
Returns: - keys (varargs of
-
make_table
(separator='.') → hail.table.Table[source]¶ Make a table from a matrix table with one field per sample.
Examples
Consider a matrix table with the following schema:
Global fields: 'batch': str Column fields: 's': str Row fields: 'locus': locus<GRCh37> 'alleles': array<str> Entry fields: 'GT': call 'GQ': int32 Column key: 's': str Row key: 'locus': locus<GRCh37> 'alleles': array<str>
and three sample IDs: A, B and C. Then the result of
make_table()
:>>> ht = mt.make_table()
has the original row fields along with 6 additional fields, one for each sample and entry field:
Global fields: 'batch': str Row fields: 'locus': locus<GRCh37> 'alleles': array<str> 'A.GT': call 'A.GQ': int32 'B.GT': call 'B.GQ': int32 'C.GT': call 'C.GQ': int32 Key: 'locus': locus<GRCh37> 'alleles': array<str>
Notes
The table has one row for each row of the input matrix. The per sample and entry fields are formed by concatenating the sample ID with the entry field name using separator. If the entry field name is empty, the separator is omitted.
The table inherits the globals from the matrix table.
Parameters: separator ( str
) – Separator between sample IDs and entry field names.Returns: Table
-
n_partitions
() → int[source]¶ Number of partitions.
Notes
The data in a dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. Partitions are a core concept of distributed computation in Spark, see here for details.
Returns: int – Number of partitions.
-
naive_coalesce
(max_partitions: int) → MatrixTable[source]¶ Naively decrease the number of partitions.
Example
Naively repartition to 10 partitions:
>>> dataset_result = dataset.naive_coalesce(10)
Warning
naive_coalesce()
simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance, unlikerepartition()
, so it can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.Parameters: max_partitions (int) – Desired number of partitions. If the current number of partitions is less than or equal to max_partitions, do nothing. Returns: MatrixTable
– Matrix table with at most max_partitions partitions.
-
persist
(storage_level: str = 'MEMORY_AND_DISK') → MatrixTable[source]¶ Persist this table in memory or on disk.
Examples
Persist the dataset to both memory and disk:
>>> dataset = dataset.persist()
Notes
The
MatrixTable.persist()
andMatrixTable.cache()
methods store the current dataset on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines. This method is not a substitution forTable.write()
, which stores a permanent file.Most users should use the “MEMORY_AND_DISK” storage level. See the Spark documentation for a more in-depth discussion of persisting data.
Parameters: storage_level (str) – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP Returns: MatrixTable
– Persisted dataset.
-
rename
(fields: Dict[str, str]) → MatrixTable[source]¶ Rename fields of a matrix table.
Examples
Rename column key s to SampleID, still keying by SampleID.
>>> dataset_result = dataset.rename({'s': 'SampleID'})
You can rename a field to a field name that already exists, as long as that field also gets renamed (no name collisions). Here, we rename the column key s to info, and the row field info to vcf_info:
>>> dataset_result = dataset.rename({'s': 'info', 'info': 'vcf_info'})
Parameters: fields ( dict
fromstr
tostr
) – Mapping from old field names to new field names.Returns: MatrixTable
– Matrix table with renamed fields.
-
repartition
(n_partitions: int, shuffle: bool = True) → MatrixTable[source]¶ Increase or decrease the number of partitions.
Examples
Repartition to 500 partitions:
>>> dataset_result = dataset.repartition(500)
Notes
Check the current number of partitions with
n_partitions()
.The data in a dataset is divided into chunks called partitions, which may be stored together or across a network, so that each partition may be read and processed in parallel by available cores. When a matrix with \(M\) rows is first imported, each of the \(k\) partitions will contain about \(M/k\) of the rows. Since each partition has some computational overhead, decreasing the number of partitions can improve performance after significant filtering. Since it’s recommended to have at least 2 - 4 partitions per core, increasing the number of partitions can allow one to take advantage of more cores. Partitions are a core concept of distributed computation in Spark, see their documentation for details. With
shuffle=True
, Hail does a full shuffle of the data and creates equal sized partitions. Withshuffle=False
, Hail combines existing partitions to avoid a full shuffle. These algorithms correspond to the repartition and coalesce commands in Spark, respectively. In particular, whenshuffle=False
,n_partitions
cannot exceed current number of partitions.Note
If shuffle is
False
, the number of partitions may only be reduced, not increased.Parameters: - n_partitions (int) – Desired number of partitions.
- shuffle (bool) – If
True
, use full shuffle to repartition.
Returns: MatrixTable
– Repartitioned dataset.
-
row
¶ Returns a struct expression of all row-indexed fields, including keys.
Examples
Get the first five row field names:
>>> list(dataset.row)[:5] ['locus', 'alleles', 'rsid', 'qual', 'filters']
Returns: StructExpression
– Struct of all row fields.
-
row_key
¶ Row key struct.
Examples
Get the row key field names:
>>> list(dataset.row_key) ['locus', 'alleles']
Returns: StructExpression
-
row_value
¶ Returns a struct expression including all non-key row-indexed fields.
Examples
Get the first five non-key row field names:
>>> list(dataset.row)[:5] ['rsid', 'qual', 'filters']
Returns: StructExpression
– Struct of all row fields, minus keys.
-
rows
() → hail.table.Table[source]¶ Returns a table with all row fields in the matrix.
Examples
Extract the row table:
>>> rows_table = dataset.rows()
Returns: Table
– Table with all row fields from the matrix, with one row per row of the matrix.
-
sample_rows
(p: float, seed=None) → MatrixTable[source]¶ Downsample the matrix table by keeping each row with probability
p
.Examples
Downsample the dataset to approximately 1% of its rows.
>>> small_dataset = dataset.sample_rows(0.01)
Parameters: - p (
float
) – Probability of keeping each row. - seed (
int
) – Random seed.
Returns: MatrixTable
– Matrix table with approximatelyp * n_rows
rows.- p (
-
select_cols
(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]¶ Select existing column fields or create new fields by name, dropping the rest.
Examples
Select existing fields and compute a new one:
>>> dataset_result = dataset.select_cols( ... dataset.sample_qc, ... dataset.pheno.age, ... isCohort1 = dataset.pheno.cohort_name == 'Cohort1')
Notes
This method creates new column fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.
Note
See
Table.select()
for more information about usingselect
methods.Note
This method supports aggregation over rows. For instance, the usage:
>>> dataset_result = dataset.select_cols(mean_GQ = agg.mean(dataset.GQ))
will compute the mean per column.
Parameters: - exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions. - named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
Returns: MatrixTable
– MatrixTable with specified column fields.- exprs (variable-length args of
-
select_entries
(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]¶ Select existing entry fields or create new fields by name, dropping the rest.
Examples
Drop all entry fields aside from GT:
>>> dataset_result = dataset.select_entries(dataset.GT)
Notes
This method creates new entry fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.
Note
See
Table.select()
for more information about usingselect
methods.Note
This method does not support aggregation.
Parameters: - exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions. - named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
Returns: MatrixTable
– MatrixTable with specified entry fields.- exprs (variable-length args of
-
select_globals
(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]¶ Select existing global fields or create new fields by name, dropping the rest.
Examples
Select one existing field and compute a new one:
>>> dataset_result = dataset.select_globals(dataset.global_field_1, ... another_global=['AFR', 'EUR', 'EAS', 'AMR', 'SAS'])
Notes
This method creates new global fields. If a created field shares its name with a differently-indexed field of the table, the method will fail.
Note
See
Table.select()
for more information about usingselect
methods.Note
This method does not support aggregation.
Parameters: - exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions. - named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
Returns: MatrixTable
– MatrixTable with specified global fields.- exprs (variable-length args of
-
select_rows
(*exprs, **named_exprs) → hail.matrixtable.MatrixTable[source]¶ Select existing row fields or create new fields by name, dropping all other non-key fields.
Examples
Select existing fields and compute a new one:
>>> dataset_result = dataset.select_rows( ... dataset.variant_qc.gq_stats.mean, ... high_quality_cases = agg.count_where((dataset.GQ > 20) & ... dataset.is_case))
Notes
This method creates new row fields. If a created field shares its name with a differently-indexed field of the table, or with a row key, the method will fail.
Row keys are preserved. To drop or change a row key field, use
MatrixTable.key_rows_by()
.Note
See
Table.select()
for more information about usingselect
methods.Note
This method supports aggregation over columns. For instance, the usage:
>>> dataset_result = dataset.select_rows(mean_GQ = agg.mean(dataset.GQ))
will compute the mean per row.
Parameters: - exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions. - named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
Returns: MatrixTable
– MatrixTable with specified row fields.- exprs (variable-length args of
-
transmute_cols
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Similar to
MatrixTable.annotate_cols()
, but drops referenced fields.Notes
This method adds new column fields according to named_exprs, and drops all column fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.Note
transmute_cols()
will not drop key fields.Note
This method supports aggregation over rows.
Parameters: named_exprs (keyword args of Expression
) – Annotation expressions.Returns: MatrixTable
-
transmute_entries
(**named_exprs)[source]¶ Similar to
MatrixTable.annotate_entries()
, but drops referenced fields.Notes
This method adds new entry fields according to named_exprs, and drops all entry fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.Parameters: named_exprs (keyword args of Expression
) – Annotation expressions.Returns: MatrixTable
-
transmute_globals
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Similar to
MatrixTable.annotate_globals()
, but drops referenced fields.Notes
This method adds new global fields according to named_exprs, and drops all global fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.Parameters: named_exprs (keyword args of Expression
) – Annotation expressions.Returns: MatrixTable
-
transmute_rows
(**named_exprs) → hail.matrixtable.MatrixTable[source]¶ Similar to
MatrixTable.annotate_rows()
, but drops referenced fields.Notes
This method adds new row fields according to named_exprs, and drops all row fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.Note
transmute_rows()
will not drop key fields.Note
This method supports aggregation over columns.
Parameters: named_exprs (keyword args of Expression
) – Annotation expressions.Returns: MatrixTable
-
union_cols
(other: MatrixTable) → MatrixTable[source]¶ Take the union of dataset columns.
Examples
Union the columns of two datasets:
>>> dataset_result = dataset_to_union_1.union_cols(dataset_to_union_2)
Notes
In order to combine two datasets, three requirements must be met:
- The row keys must match.
- The column key schemas and column schemas must match.
- The entry schemas must match.
The row fields in the resulting dataset are the row fields from the first dataset; the row schemas do not need to match.
This method performs an inner join on rows and concatenates entries from the two datasets for each row.
This method does not deduplicate; if a column key exists identically in two datasets, then it will be duplicated in the result.
Parameters: other ( MatrixTable
) – Dataset to concatenate.Returns: MatrixTable
– Dataset with columns from both datasets.
-
union_rows
() → MatrixTable[source]¶ Take the union of dataset rows.
Examples
Union the rows of two datasets:
>>> dataset_result = dataset_to_union_1.union_rows(dataset_to_union_2)
Given a list of datasets, take the union of all rows:
>>> all_datasets = [dataset_to_union_1, dataset_to_union_2]
The following three syntaxes are equivalent:
>>> dataset_result = dataset_to_union_1.union_rows(dataset_to_union_2) >>> dataset_result = all_datasets[0].union_rows(*all_datasets[1:]) >>> dataset_result = hl.MatrixTable.union_rows(*all_datasets)
Notes
In order to combine two datasets, three requirements must be met:
- The column keys must be identical, both in type, value, and ordering.
- The row key schemas and row schemas must match.
- The entry schemas must match.
The column fields in the resulting dataset are the column fields from the first dataset; the column schemas do not need to match.
This method does not deduplicate; if a row exists identically in two datasets, then it will be duplicated in the result.
Warning
This method can trigger a shuffle, if partitions from two datasets overlap.
Parameters: datasets (varargs of MatrixTable
) – Datasets to combine.Returns: MatrixTable
– Dataset with rows from each member of datasets.
-
unpersist
() → hail.matrixtable.MatrixTable[source]¶ Unpersists this dataset from memory/disk.
Notes
This function will have no effect on a dataset that was not previously persisted.
Returns: MatrixTable
– Unpersisted dataset.
-
write
(output: str, overwrite: bool = False, stage_locally: bool = False, _codec_spec: Union[str, NoneType] = None)[source]¶ Write to disk.
Examples
>>> dataset.write('output/dataset.mt')
Warning
Do not write to a path that is being read from in the same computation.
Parameters: - output (str) – Path at which to write.
- stage_locally (bool) – If
True
, major output will be written to temporary local storage before being copied tooutput
- overwrite (bool) – If
True
, overwrite an existing file at the destination.
-