Table¶
-
class
hail.
Table
(jt)[source]¶ Hail’s distributed implementation of a dataframe or SQL table.
Use
read_table()
to read a table that was written withTable.write()
. Useto_spark()
andTable.from_spark()
to inter-operate with PySpark’s SQL and machine learning functionality.Examples
The examples below use
table1
andtable2
, which are imported from text files usingimport_table()
.>>> table1 = hl.import_table('data/kt_example1.tsv', impute=True, key='ID') >>> table1.show()
+-------+-------+-----+-------+-------+-------+-------+-------+ | ID | HT | SEX | X | Z | C1 | C2 | C3 | +-------+-------+-----+-------+-------+-------+-------+-------+ | int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | +-------+-------+-----+-------+-------+-------+-------+-------+ | 1 | 65 | M | 5 | 4 | 2 | 50 | 5 | | 2 | 72 | M | 6 | 3 | 2 | 61 | 1 | | 3 | 70 | F | 7 | 3 | 10 | 81 | -5 | | 4 | 60 | F | 8 | 2 | 11 | 90 | -10 | +-------+-------+-----+-------+-------+-------+-------+-------+
>>> table2 = hl.import_table('data/kt_example2.tsv', impute=True, key='ID') >>> table2.show()
+-------+-------+--------+ | ID | A | B | +-------+-------+--------+ | int32 | int32 | str | +-------+-------+--------+ | 1 | 65 | cat | | 2 | 72 | dog | | 3 | 70 | mouse | | 4 | 60 | rabbit | +-------+-------+--------+
Define new annotations:
>>> height_mean_m = 68 >>> height_sd_m = 3 >>> height_mean_f = 65 >>> height_sd_f = 2.5 >>> >>> def get_z(height, sex): ... return hl.cond(sex == 'M', ... (height - height_mean_m) / height_sd_m, ... (height - height_mean_f) / height_sd_f) >>> >>> table1 = table1.annotate(height_z = get_z(table1.HT, table1.SEX)) >>> table1 = table1.annotate_globals(global_field_1 = [1, 2, 3])
Filter rows of the table:
>>> table2 = table2.filter(table2.B != 'rabbit')
Compute global aggregation statistics:
>>> t1_stats = table1.aggregate(hl.struct(mean_c1 = agg.mean(table1.C1), ... mean_c2 = agg.mean(table1.C2), ... stats_c3 = agg.stats(table1.C3))) >>> print(t1_stats)
Group by a field and aggregate to produce a new table:
>>> table3 = (table1.group_by(table1.SEX) ... .aggregate(mean_height_data = agg.mean(table1.HT))) >>> table3.show()
Join tables together inside an annotation expression:
>>> table2 = table2.key_by('ID') >>> table1 = table1.annotate(B = table2[table1.ID].B) >>> table1.show()
Attributes
globals
Returns a struct expression including all global fields. key
Row key struct. row
Returns a struct expression of all row-indexed fields, including keys. row_value
Returns a struct expression including all non-key row-indexed fields. Methods
__init__
Initialize self. add_index
Add the integer index of each row as a new row field. aggregate
Aggregate over rows into a local value. all
Evaluate whether a boolean expression is true for all rows. annotate
Add new fields. annotate_globals
Add new global fields. any
Evaluate whether a Boolean expression is true for at least one row. cache
Persist this table in memory. collect
Collect the rows of the table into a local list. collect_by_key
Collect values for each unique key into an array. count
Count the number of rows in the table. describe
Print information about the fields in the table. distinct
Keep only one row for each unique key. drop
Drop fields from the table. expand_types
Expand complex types into structs and arrays. explode
Explode rows along a top-level field of the table. export
Export to a TSV file. filter
Filter rows. flatten
Flatten nested structs. from_pandas
Create table from Pandas DataFrame from_spark
Convert PySpark SQL DataFrame to a table. group_by
Group by a new key for use with GroupedTable.aggregate()
.head
Subset table to first n rows. index
Expose the row values as if looked up in a dictionary, indexing with exprs. index_globals
Return this table’s global variables for use in another expression context. join
Join two tables together. key_by
Key table by a new set of fields. n_partitions
Returns the number of partitions in the table. order_by
Sort by the specified fields. parallelize
persist
Persist this table in memory or on disk. rename
Rename fields of the table. repartition
Change the number of distributed partitions. sample
Downsample the table by keeping each row with probability p
.select
Select existing fields or create new fields by name, dropping the rest. select_globals
Select existing global fields or create new fields by name, dropping the rest. show
Print the first few rows of the table to the console. take
Collect the first n rows of the table into a local list. to_matrix_table
Construct a matrix table from a table in coordinate representation. to_pandas
Converts this table to a Pandas DataFrame. to_spark
Converts this table to a Spark DataFrame. transmute
Add new fields and drop fields referenced. transmute_globals
Similar to Table.annotate_globals()
, but drops referenced fields.union
Union the rows of multiple tables. unpersist
Unpersists this table from memory/disk. write
Write to disk. -
add_index
(name='idx')[source]¶ Add the integer index of each row as a new row field.
Examples
>>> table_result = table1.add_index() >>> table_result.show() +-------+-------+-----+-------+-------+-------+-------+-------+-------+ | ID | HT | SEX | X | Z | C1 | C2 | C3 | idx | +-------+-------+-----+-------+-------+-------+-------+-------+-------+ | int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | int64 | +-------+-------+-----+-------+-------+-------+-------+-------+-------+ | 1 | 65 | M | 5 | 4 | 2 | 50 | 5 | 0 | | 2 | 72 | M | 6 | 3 | 2 | 61 | 1 | 1 | | 3 | 70 | F | 7 | 3 | 10 | 81 | -5 | 2 | | 4 | 60 | F | 8 | 2 | 11 | 90 | -10 | 3 | +-------+-------+-----+-------+-------+-------+-------+-------+-------+
Notes
This method returns a table with a new field whose name is given by the name parameter, with type
tint64
. The value of this field is the integer index of each row, starting from 0. Methods that respect ordering (likeTable.take()
orTable.export()
) will return rows in order.This method is also helpful for creating a unique integer index for rows of a table so that more complex types can be encoded as a simple number for performance reasons.
Parameters: name (str) – Name of index field. Returns: Table
– Table with a new index field.
-
aggregate
(expr)[source]¶ Aggregate over rows into a local value.
Examples
Aggregate over rows:
>>> table1.aggregate(hl.struct(fraction_male=agg.fraction(table1.SEX == 'M'), ... mean_x=agg.mean(table1.X))) Struct(fraction_male=0.5, mean_x=6.5)
Note
This method supports (and expects!) aggregation over rows.
Parameters: expr ( Expression
) – Aggregation expression.Returns: any – Aggregated value dependent on expr.
-
all
(expr)[source]¶ Evaluate whether a boolean expression is true for all rows.
Examples
Test whether C1 is greater than 5 in all rows of the table:
>>> if table1.all(table1.C1 == 5): ... print("All rows have C1 equal 5.")
Parameters: expr ( BooleanExpression
) – Expression to test.Returns: bool
-
annotate
(**named_exprs)[source]¶ Add new fields.
Examples
Add field Y by computing the square of X:
>>> table_result = table1.annotate(Y = table1.X ** 2)
Add multiple fields simultaneously:
>>> table_result = table1.annotate(A = table1.X / 2, ... B = table1.X + 21)
Parameters: named_exprs (keyword args of Expression
) – Expressions for new fields.Returns: Table
– Table with new fields.
-
annotate_globals
(**named_exprs)[source]¶ Add new global fields.
Examples
Add a new global field:
>>> table_result = table1.annotate_globals(pops = ['EUR', 'AFR', 'EAS', 'SAS'])
Note
This method does not support aggregation.
Parameters: named_exprs (varargs of Expression
) – Annotation expressions.Returns: Table
– Table with new global field(s).
-
any
(expr)[source]¶ Evaluate whether a Boolean expression is true for at least one row.
Examples
Test whether C1 is equal to 5 any row in any row of the table:
>>> if table1.any(table1.C1 == 5): ... print("At least one row has C1 equal 5.")
Parameters: expr ( BooleanExpression
) – Boolean expression.Returns: bool
–True
if the predicate evaluated forTrue
for any row, otherwiseFalse
.
-
cache
()[source]¶ Persist this table in memory.
Examples
Persist the table in memory:
>>> table = table.cache()
Notes
This method is an alias for
persist("MEMORY_ONLY")
.Returns: Table
– Cached table.
-
collect
()[source]¶ Collect the rows of the table into a local list.
Examples
Collect a list of all X records:
>>> all_xs = [row['X'] for row in table1.select(table1.X).collect()]
Notes
This method returns a list whose elements are of type
Struct
. Fields of these structs can be accessed similarly to fields on a table, using dot methods (struct.foo
) or string indexing (struct['foo']
).Warning
Using this method can cause out of memory errors. Only collect small tables.
Returns: list
ofStruct
– List of rows.
-
collect_by_key
(name: str = 'values') → hail.table.Table[source]¶ Collect values for each unique key into an array.
Note
Requires a keyed table.
Examples
>>> t1 = hl.Table.parallelize([ ... {'t': 'foo', 'x': 4, 'y': 'A'}, ... {'t': 'bar', 'x': 2, 'y': 'B'}, ... {'t': 'bar', 'x': -3, 'y': 'C'}, ... {'t': 'quam', 'x': 0, 'y': 'D'}], ... hl.tstruct(t=hl.tstr, x=hl.tint32, y=hl.tstr), ... key='t')
>>> t1.show() +------+-------+-----+ | t | x | y | +------+-------+-----+ | str | int32 | str | +------+-------+-----+ | foo | 4 | A | | bar | 2 | B | | bar | -3 | C | | quam | 0 | D | +------+-------+-----+
>>> t1.collect_by_key().show() +------+------------------------------------+ | t | values | +------+------------------------------------+ | str | array<struct{x: int32, y: str}> | +------+------------------------------------+ | bar | [{"x":2,"y":"B"},{"x":-3,"y":"C"}] | | foo | [{"x":4,"y":"A"}] | | quam | [{"x":0,"y":"D"}] | +------+------------------------------------+
Notes
The order of the values array is not guaranteed.
Parameters: name ( str
) – Field name for all values per key.Returns: Table
-
distinct
() → hail.table.Table[source]¶ Keep only one row for each unique key.
Note
Requires a keyed table.
Examples
>>> t1 = hl.Table.parallelize([ ... {'a': 'foo', 'b': 1}, ... {'a': 'bar', 'b': 5}, ... {'a': 'bar', 'b': 2}], ... hl.tstruct(a=hl.tstr, b=hl.tint32), ... key='a')
>>> t1.show() +-----+-------+ | a | b | +-----+-------+ | str | int32 | +-----+-------+ | foo | 1 | | bar | 5 | | bar | 2 | +-----+-------+
>>> t1.distinct().show() +-----+-------+ | a | b | +-----+-------+ | str | int32 | +-----+-------+ | bar | 5 | | foo | 1 | +-----+-------+
Notes
The row chosen per distinct key is not guaranteed.
Returns: Table
-
drop
(*exprs)[source]¶ Drop fields from the table.
Examples
Drop fields C1 and C2 using strings:
>>> table_result = table1.drop('C1', 'C2')
Drop fields C1 and C2 using field references:
>>> table_result = table1.drop(table1.C1, table1.C2)
Drop a list of fields:
>>> fields_to_drop = ['C1', 'C2'] >>> table_result = table1.drop(*fields_to_drop)
Notes
This method can be used to drop global or row-indexed fields. The arguments can be either strings (
'field'
), or top-level field references (table.field
ortable['field']
).Parameters: exprs (varargs of str
orExpression
) – Names of fields to drop or field reference expressions.Returns: Table
– Table without specified fields.
-
expand_types
()[source]¶ Expand complex types into structs and arrays.
Examples
>>> table_result = table1.expand_types()
Notes
Expands the following types:
tlocus
,tinterval
,tset
,tdict
,ttuple
.The only types that will remain after this method are:
tbool
,tint32
,tint64
,tfloat64
,tfloat32
,tarray
,tstruct
.Returns: Table
– Expanded table.
-
explode
(field, name=None)[source]¶ Explode rows along a top-level field of the table.
Each row is copied for each element of field. The explode operation unpacks the elements in a field of type
Array
orSet
into its own row. If an emptyArray
orSet
is exploded, the entire row is removed from the table.Examples
people_table is a
Table
with three fields: Name, Age and Children.>>> people_table.show() +----------+-------+--------------------------+ | Name | Age | Children | +----------+-------+--------------------------+ | str | int32 | array<str> | +----------+-------+--------------------------+ | Alice | 34 | ["Dave","Ernie","Frank"] | | Bob | 51 | ["Gaby","Helen"] | | Caroline | 10 | [] | +----------+-------+--------------------------+
Table.explode()
can be used to produce a distinct row for each element in the Children field:>>> exploded = people_table.explode('Children') >>> exploded.show() +-------+-------+----------+ | Name | Age | Children | +-------+-------+----------+ | str | int32 | str | +-------+-------+----------+ | Alice | 34 | Dave | | Alice | 34 | Ernie | | Alice | 34 | Frank | | Bob | 51 | Gaby | | Bob | 51 | Helen | +-------+-------+----------+
The name parameter can be used to produce more appropriate field names:
>>> exploded = people_table.explode('Children', name='Child') >>> exploded.show() +-------+-------+-------+ | Name | Age | Child | +-------+-------+-------+ | str | int32 | str | +-------+-------+-------+ | Alice | 34 | Dave | | Alice | 34 | Ernie | | Alice | 34 | Frank | | Bob | 51 | Gaby | | Bob | 51 | Helen | +-------+-------+-------+
Notes
Empty arrays or sets produce no rows in the resulting table. In the example above, notice that the name “Caroline” is not found in the exploded table.
Missing arrays or sets are treated as empty.
Parameters: - field (
str
orExpression
) – Top-level field name or expression. - name (
str
or None) – If not None, rename the exploded field to name.
Returns: Table
– Table with exploded field.- field (
-
export
(output, types_file=None, header=True, parallel=None)[source]¶ Export to a TSV file.
Examples
Export to a tab-separated file:
>>> table1.export('output/table1.tsv.bgz')
Note
It is highly recommended to export large files with a
.bgz
extension, which will use a block gzipped compression codec. These files can be read natively with any Hail method, as well as with Python’sgzip.open
and R’sread.table
.Parameters: - output (str) – URI at which to write exported file.
- types_file (str or None) – URI at which to write file containing field type information.
- header (bool) – Include a header in the file.
- parallel (str or None) – If None, a single file is produced, otherwise a folder of file shards is produced. If ‘separate_header’, the header file is output separately from the file shards. If ‘header_per_shard’, each file shard has a header. If set to None the export will be slower.
-
filter
(expr, keep=True)[source]¶ Filter rows.
Examples
Keep rows where
C1
equals 5:>>> table_result = table1.filter(table1.C1 == 5)
Remove rows where
C1
equals 10:>>> table_result = table1.filter(table1.C1 == 10, keep=False)
Notes
The expression expr will be evaluated for every row of the table. If keep is
True
, then rows where expr evaluates toFalse
will be removed (the filter keeps the rows where the predicate evaluates toTrue
). If keep isFalse
, then rows where expr evaluates toFalse
will be removed (the filter removes the rows where the predicate evaluates toTrue
).Warning
When expr evaluates to missing, the row will be removed regardless of keep.
Note
This method does not support aggregation.
Parameters: - expr (bool or
BooleanExpression
) – Filter expression. - keep (bool) – Keep rows where expr is true.
Returns: Table
– Filtered table.- expr (bool or
-
flatten
()[source]¶ Flatten nested structs.
Examples
Flatten table:
>>> table_result = table1.flatten()
Notes
Consider a table with signature
a: struct{ p: int32, q: str }, b: int32, c: struct{ x: str, y: array<struct{ y: str, z: str }> }
and key
a
. The result of flatten isa.p: int32 a.q: str b: int32 c.x: str c.y: array<struct{ y: str, z: str }>
with key
a.p, a.q
.Note, structures inside collections like arrays or sets will not be flattened.
Warning
Flattening a table will produces fields that cannot be referenced using the
table.<field>
syntax, e.g. “a.b”. Reference these fields using square bracket lookups:table['a.b']
.Returns: Table
– Table with a flat schema (no struct fields).
-
static
from_pandas
(key=[])[source]¶ Create table from Pandas DataFrame
Examples
>>> t = hl.Table.from_pandas(df)
Parameters: - df (
pandas.DataFrame
) – Pandas DataFrame. - key (
str
orlist
ofstr
) – Key fields.
Returns: - df (
-
static
from_spark
(key=None)[source]¶ Convert PySpark SQL DataFrame to a table.
Examples
>>> t = Table.from_spark(df)
Notes
Spark SQL data types are converted to Hail types as follows:
BooleanType => :py:data:`.tbool` IntegerType => :py:data:`.tint32` LongType => :py:data:`.tint64` FloatType => :py:data:`.tfloat32` DoubleType => :py:data:`.tfloat64` StringType => :py:data:`.tstr` BinaryType => :class:`.TBinary` ArrayType => :class:`.tarray` StructType => :class:`.tstruct`
Unlisted Spark SQL data types are currently unsupported.
Parameters: - df (
pyspark.sql.DataFrame
) – PySpark DataFrame. - key (
str
orlist
ofstr
) – Key fields.
Returns: Table
– Table constructed from the Spark SQL DataFrame.- df (
-
globals
¶ Returns a struct expression including all global fields.
Examples
The data type of the globals struct:
>>> table1.globals.dtype dtype('struct{}')
The number of global fields:
>>> len(table1.globals) 0
Returns: StructExpression
– Struct of all global fields.
-
group_by
(*exprs, **named_exprs) → hail.table.GroupedTable[source]¶ Group by a new key for use with
GroupedTable.aggregate()
.Examples
Compute the mean value of X and the sum of Z per unique ID:
>>> table_result = (table1.group_by(table1.ID) ... .aggregate(meanX = agg.mean(table1.X), sumZ = agg.sum(table1.Z)))
Group by a height bin and compute sex ratio per bin:
>>> table_result = (table1.group_by(height_bin = table1.HT // 20) ... .aggregate(fraction_female = agg.fraction(table1.SEX == 'F')))
Notes
This function is always followed by
GroupedTable.aggregate()
. Follow the link for documentation on the aggregation step.Note
Using group_by
group_by and its sibling methods (
MatrixTable.group_rows_by()
andMatrixTable.group_cols_by()
) accept both variable-length (f(x, y, z)
) and keyword (f(a=x, b=y, c=z)
) arguments.Variable-length arguments can be either strings or expressions that reference a (possibly nested) field of the table. Keyword arguments can be arbitrary expressions.
The following three usages are all equivalent, producing a
GroupedTable
grouped by fields C1 and C2 of table1.First, variable-length string arguments:
>>> table_result = (table1.group_by('C1', 'C2') ... .aggregate(meanX = agg.mean(table1.X)))
Second, field reference variable-length arguments:
>>> table_result = (table1.group_by(table1.C1, table1.C2) ... .aggregate(meanX = agg.mean(table1.X)))
Last, expression keyword arguments:
>>> table_result = (table1.group_by(C1 = table1.C1, C2 = table1.C2) ... .aggregate(meanX = agg.mean(table1.X)))
Additionally, the variable-length argument syntax also permits nested field references. Given the following struct field s:
>>> table3 = table1.annotate(s = hl.struct(x=table1.X, z=table1.Z))
The following two usages are equivalent, grouping by one field, x:
>>> table_result = (table3.group_by(table3.s.x) ... .aggregate(meanX = agg.mean(table3.X)))
>>> table_result = (table3.group_by(x = table3.s.x) ... .aggregate(meanX = agg.mean(table3.X)))
The keyword argument syntax permits arbitrary expressions:
>>> table_result = (table1.group_by(foo=table1.X ** 2 + 1) ... .aggregate(meanZ = agg.mean(table1.Z)))
These syntaxes can be mixed together, with the stipulation that all keyword arguments must come at the end due to Python language restrictions.
>>> table_result = (table1.group_by(table1.C1, 'C2', height_bin = table1.HT // 20) ... .aggregate(meanX = agg.mean(table1.X)))
Note
This method does not support aggregation in key expressions.
Parameters: - exprs (varargs of type str or
Expression
) – Field names or field reference expressions. - named_exprs (keyword args of type
Expression
) – Field names and expressions to compute them.
Returns: GroupedTable
– Grouped table; useGroupedTable.aggregate()
to complete the aggregation.- exprs (varargs of type str or
-
head
(n)[source]¶ Subset table to first n rows.
Examples
Subset to the first three rows:
>>> table_result = table1.head(3) >>> table_result.count() 3
Notes
The number of partitions in the new table is equal to the number of partitions containing the first n rows.
Parameters: n (int) – Number of rows to include. Returns: Table
– Table including the first n rows.
-
index
(*exprs)[source]¶ Expose the row values as if looked up in a dictionary, indexing with exprs.
Examples
In the example below, both table1 and table2 are keyed by one field ID of type
int
.>>> table_result = table1.select(B = table2.index(table1.ID).B) >>> table_result.B.show() +-------+--------+ | ID | B | +-------+--------+ | int32 | str | +-------+--------+ | 1 | cat | | 2 | dog | | 3 | mouse | | 4 | rabbit | +-------+--------+
Using key as the sole index expression is equivalent to passing all key fields individually:
>>> table_result = table1.select(B = table2.index(table1.key).B)
It is also possible to use non-key fields or expressions as the index expressions:
>>> table_result = table1.select(B = table2.index(table1.C1 % 4).B) >>> table_result.show() +-------+-------+ | ID | B | +-------+-------+ | int32 | str | +-------+-------+ | 1 | dog | | 2 | dog | | 3 | dog | | 4 | mouse | +-------+-------+
Notes
Table.index()
is used to expose one table’s fields for use in expressions involving the another table or matrix table’s fields. The result of the method call is a struct expression that is usable in the same scope as exprs, just as if exprs were used to look up values of the table in a dictionary.The type of the struct expression is the same as the indexed table’s
row_value()
(the key fields are removed, as they are available in the form of the index expressions).Note
There is a shorthand syntax for
Table.index()
using square brackets (the Python__getitem__
syntax). This syntax is preferred.>>> table_result = table1.select(B = table2[table1.ID].B)
Parameters: exprs (variable-length args of Expression
) – Index expressions.Returns: StructExpression
-
index_globals
()[source]¶ Return this table’s global variables for use in another expression context.
Examples
>>> table_result = table2.annotate(C = table2.A * table1.index_globals().global_field_1)
Returns: StructExpression
-
join
(right: Table, how='inner') → Table[source]¶ Join two tables together.
Examples
Join table1 to table2 to produce table_joined:
>>> table_joined = table1.key_by('ID').join(table2.key_by('ID'))
Notes
Hail supports four types of joins specified by how:
- inner – Key must be present in both the left and right tables.
- outer – Key present in either the left or the right. For keys only in the left table, the right table’s fields will be missing. For keys only in the right table, the left table’s fields will be missing.
- left – Key present in the left table. For keys not found on the right, the right table’s fields will be missing.
- right – Key present in the right table. For keys not found on the right, the right table’s fields will be missing.
Both tables must have the same number of keys and the corresponding types of each key must be the same (order matters), but the key names can be different. For example, if table1 is keyed by fields
['a', 'b']
, both of typeint32
, and table2 is keyed by fields['c', 'd']
, both of typeint32
, then the two tables can be joined (their rows will be joined wheretable1.a == table2.c
andtable1.b == table2.d
).The key fields and order from the left table are preserved, while the key fields from the right table are not present in the result.
Note
These join methods implement a traditional Cartesian product join, and the number of records in the resulting table can be larger than the number of records on the left or right if duplicate keys are present.
Parameters: - right (
Table
) – Table with which to join. - how (
str
) – Join type. One of “inner”, “outer”, “left”, “right”.
Returns: Table
– Joined table.
-
key
¶ Row key struct.
Examples
List of key field names:
>>> list(table1.key) ['ID']
Number of key fields:
>>> len(table1.key) 1
Returns: StructExpression
-
key_by
(*keys, **named_keys) → Table[source]¶ Key table by a new set of fields.
Examples
Assume table1 is a
Table
with three fields: C1, C2 and C3.Changing key fields:
>>> table_result = table1.key_by('C2', 'C3')
This keys the table by ‘C2’ and ‘C3’, preserving old keys as value fields.
>>> table_result = table1.key_by(table1.C1)
This keys the table by ‘C1’, preserving old keys as value fields.
>>> table_result = table1.key_by(C1 = table1.C2, foo = table1.C1)
This keys the table by fields named ‘C1’ and ‘foo’, which have values corresponding to the original ‘C2’ and ‘C1’ fields respectively. The original ‘C1’ field has been overwritten by the new assignment, but the original ‘C2’ field is preserved as a value field.
Remove key:
>>> table_result = table1.key_by()
Notes
This method is used to specify all the fields of a new row key. The old key fields may be overwritten by newly-assigned fields, as described in
Table.annotate()
. If not overwritten, they are preserved as non-key fields.See
Table.select()
for more information about how to define new key fields.Warning
Setting the key to the empty key using
>>> table1.key_by([])
will cause the entire table to condense into a single partition.
Parameters: keys (varargs of type str
) – Field(s) to key by.Returns: Table
– Table with a new key.
-
order_by
(*exprs)[source]¶ Sort by the specified fields. Unkeys the table, if keyed.
Examples
Four equivalent ways to order the table by field HT, ascending:
>>> sorted_table = table1.order_by(table1.HT)
>>> sorted_table = table1.order_by('HT')
>>> sorted_table = table1.order_by(hl.asc(table1.HT))
>>> sorted_table = table1.order_by(hl.asc('HT'))
Notes
Missing values are sorted after non-missing values. When multiple fields are passed, the table will be sorted first by the first argument, then the second, etc.
Note
This method unkeys the table.
Parameters: exprs (varargs of Ascending
orDescending
orExpression
orstr
) – Fields to sort by.Returns: Table
– Table sorted by the given fields.
-
persist
(storage_level='MEMORY_AND_DISK')[source]¶ Persist this table in memory or on disk.
Examples
Persist the table to both memory and disk:
>>> table = table.persist()
Notes
The
Table.persist()
andTable.cache()
methods store the current table on disk or in memory temporarily to avoid redundant computation and improve the performance of Hail pipelines. This method is not a substitution forTable.write()
, which stores a permanent file.Most users should use the “MEMORY_AND_DISK” storage level. See the Spark documentation for a more in-depth discussion of persisting data.
Parameters: storage_level (str) – Storage level. One of: NONE, DISK_ONLY, DISK_ONLY_2, MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, OFF_HEAP Returns: Table
– Persisted table.
-
rename
(mapping)[source]¶ Rename fields of the table.
Examples
Rename C1 to col1 and C2 to col2:
>>> table_result = table1.rename({'C1' : 'col1', 'C2' : 'col2'})
Parameters: mapping ( dict
ofstr
,str
) – Mapping from old field names to new field names.Notes
Any field that does not appear as a key in mapping will not be renamed.
Returns: Table
– Table with renamed fields.
-
repartition
(n, shuffle=True)[source]¶ Change the number of distributed partitions.
Examples
Repartition to 10 partitions:
>>> table_result = table1.repartition(10)
Warning
When shuffle is
False
, repartition can only decrease the number of partitions and simply combines adjacent partitions to achieve the desired number. It does not attempt to rebalance and so can produce a heavily unbalanced dataset. An unbalanced dataset can be inefficient to operate on because the work is not evenly distributed across partitions.Parameters: - n (int) – Desired number of partitions.
- shuffle (bool) – If
True
, shuffle data. Otherwise, naively coalesce.
Returns: Table
– Repartitioned table.
-
row
¶ Returns a struct expression of all row-indexed fields, including keys.
Examples
The data type of the row struct:
>>> table1.row.dtype dtype('struct{ID: int32, HT: int32, SEX: str, X: int32, Z: int32, C1: int32, C2: int32, C3: int32}')
The number of row fields:
>>> len(table1.row) 8
Returns: StructExpression
– Struct of all row fields, including key fields.
-
row_value
¶ Returns a struct expression including all non-key row-indexed fields.
Examples
The data type of the row struct:
>>> table1.row_value.dtype dtype('struct{HT: int32, SEX: str, X: int32, Z: int32, C1: int32, C2: int32, C3: int32}')
The number of row fields:
>>> len(table1.row_value) 7
Returns: StructExpression
– Struct of all non-key row fields.
-
sample
(p, seed=0)[source]¶ Downsample the table by keeping each row with probability
p
.Examples
Downsample the table to approximately 1% of its rows.
>>> small_table1 = table1.sample(0.01)
Parameters: - p (
float
) – Probability of keeping each row. - seed (
int
) – Random seed.
Returns: Table
– Table with approximatelyp * n_rows
rows.- p (
-
select
(*exprs, **named_exprs) → Table[source]¶ Select existing fields or create new fields by name, dropping the rest.
Examples
Select a few old fields and compute a new one:
>>> table_result = table1.select(table1.C1, Y=table1.Z - table1.X)
Notes
This method creates new row-indexed fields. If a created field shares its name with a global field of the table, the method will fail.
Note
Using select
Select and its sibling methods (
Table.select_globals()
,MatrixTable.select_globals()
,MatrixTable.select_rows()
,MatrixTable.select_cols()
, andMatrixTable.select_entries()
) accept both variable-length (f(x, y, z)
) and keyword (f(a=x, b=y, c=z)
) arguments.Select methods will always preserve the key along that axis; e.g. for
Table.select()
, the table key will aways be kept. To modify the key, usekey_by()
.Variable-length arguments can be either strings or expressions that reference a (possibly nested) field of the table. Keyword arguments can be arbitrary expressions.
The following three usages are all equivalent, producing a new table with fields C1 and C2 of table1, and the table key ID.
First, variable-length string arguments:
>>> table_result = table1.select('C1', 'C2')
Second, field reference variable-length arguments:
>>> table_result = table1.select(table1.C1, table1.C2)
Last, expression keyword arguments:
>>> table_result = table1.select(C1 = table1.C1, C2 = table1.C2)
Additionally, the variable-length argument syntax also permits nested field references. Given the following struct field s:
>>> table3 = table1.annotate(s = hl.struct(x=table1.X, z=table1.Z))
The following two usages are equivalent, producing a table with one field, x.:
>>> table3_result = table3.select(table3.s.x)
>>> table3_result = table3.select(x = table3.s.x)
The keyword argument syntax permits arbitrary expressions:
>>> table_result = table1.select(foo=table1.X ** 2 + 1)
These syntaxes can be mixed together, with the stipulation that all keyword arguments must come at the end due to Python language restrictions.
>>> table_result = table1.select(table1.X, 'Z', bar = [table1.C1, table1.C2])
Note
This method does not support aggregation.
Parameters: - exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions. - named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
Returns: Table
– Table with specified fields.- exprs (variable-length args of
-
select_globals
(*exprs, **named_exprs)[source]¶ Select existing global fields or create new fields by name, dropping the rest.
Examples
Select one existing field and compute a new one:
>>> table_result = table1.select_globals(table1.global_field_1, ... another_global=['AFR', 'EUR', 'EAS', 'AMR', 'SAS'])
Notes
This method creates new global fields. If a created field shares its name with a row-indexed field of the table, the method will fail.
Note
See
Table.select()
for more information about usingselect
methods.Note
This method does not support aggregation.
Parameters: - exprs (variable-length args of
str
orExpression
) – Arguments that specify field names or nested field reference expressions. - named_exprs (keyword args of
Expression
) – Field names and the expressions to compute them.
Returns: Table
– Table with specified global fields.- exprs (variable-length args of
-
show
(n=10, width=90, truncate=None, types=True)[source]¶ Print the first few rows of the table to the console.
Examples
Show the first lines of the table:
>>> table1.show() +-------+-------+-----+-------+-------+-------+-------+-------+ | ID | HT | SEX | X | Z | C1 | C2 | C3 | +-------+-------+-----+-------+-------+-------+-------+-------+ | int32 | int32 | str | int32 | int32 | int32 | int32 | int32 | +-------+-------+-----+-------+-------+-------+-------+-------+ | 1 | 65 | M | 5 | 4 | 2 | 50 | 5 | | 2 | 72 | M | 6 | 3 | 2 | 61 | 1 | | 3 | 70 | F | 7 | 3 | 10 | 81 | -5 | | 4 | 60 | F | 8 | 2 | 11 | 90 | -10 | +-------+-------+-----+-------+-------+-------+-------+-------+
Parameters: - n (
int
) – Maximum number of rows to show. - width (
int
) – Horizontal width at which to break fields. - truncate (
int
, optional) – Truncate each field to the given number of characters. IfNone
, truncate fields to the given width. - types (
bool
) – Print an extra header line with the type of each field.
- n (
-
take
(n)[source]¶ Collect the first n rows of the table into a local list.
Examples
Take the first three rows:
>>> first3 = table1.take(3) >>> print(first3) [Struct(HT=65, SEX=M, X=5, C3=5, C2=50, C1=2, Z=4, ID=1), Struct(HT=72, SEX=M, X=6, C3=1, C2=61, C1=2, Z=3, ID=2), Struct(HT=70, SEX=F, X=7, C3=-5, C2=81, C1=10, Z=3, ID=3)]
Notes
This method does not need to look at all the data in the table, and allows for fast queries of the start of the table.
This method is equivalent to
Table.head()
followed byTable.collect()
.Parameters: n (int) – Number of rows to take. Returns: list
ofStruct
– List of row structs.
-
to_matrix_table
(row_key, col_key, row_fields=[], col_fields=[], partition_key=None, n_partitions=None)[source]¶ Construct a matrix table from a table in coordinate representation.
Notes
Any row fields in the table that do not appear in one of the arguments to this method are assumed to be entry fields of the resulting matrix table.
Parameters: - row_key (Sequence[str]) – Fields to be used as row key.
- col_key (Sequence[str]) – Fields to be used as column key.
- row_fields (Sequence[str]) – Fields to be stored once per row.
- col_fields (Sequence[str]) – Fields to be stored once per column.
- partition_key (Sequence[str] or None) – Fields to be used as partition key.
- n_partitions (int or None) – Number of partitions.
Returns:
-
to_pandas
(flatten=True)[source]¶ Converts this table to a Pandas DataFrame.
Because conversion to Pandas is done through Spark, and Spark cannot represent complex types, types are expanded before flattening or conversion.
Parameters: flatten ( bool
) – IfTrue
,flatten()
before converting to Pandas DataFrame.Returns: pandas.DataFrame
-
to_spark
(flatten=True)[source]¶ Converts this table to a Spark DataFrame.
Because Spark cannot represent complex types, types are expanded before flattening or conversion.
Parameters: flatten ( bool
) – IfTrue
,flatten()
before converting to Spark DataFrame.Returns: pyspark.sql.DataFrame
-
transmute
(**named_exprs)[source]¶ Add new fields and drop fields referenced.
Examples
Create a single field from an expression of C1, C2, and C3.
>>> table4.show() +-------+------+-------+-------+-------+-------+-------+-------+ | A | B.B0 | B.B1 | C | D.cat | D.dog | E.A | E.B | +-------+------+-------+-------+-------+-------+-------+-------+ | int32 | bool | str | bool | int32 | int32 | int32 | int32 | +-------+------+-------+-------+-------+-------+-------+-------+ | 32 | true | hello | false | 5 | 7 | 5 | 7 | +-------+------+-------+-------+-------+-------+-------+-------+
>>> table_result = table4.transmute(F=table4.A + 2 * table4.E.B) >>> table_result.show() +------+-------+-------+-------+-------+-------+ | B.B0 | B.B1 | C | D.cat | D.dog | F | +------+-------+-------+-------+-------+-------+ | bool | str | bool | int32 | int32 | int32 | +------+-------+-------+-------+-------+-------+ | true | hello | false | 5 | 7 | 46 | +------+-------+-------+-------+-------+-------+
Notes
This method functions to create new row-indexed fields and consume fields found in the expressions in named_exprs.
All row-indexed top-level fields found in an expression are dropped after the new fields are created.
Note
transmute()
will not drop key fields.Warning
References to fields inside a top-level struct will remove the entire struct, as field E was removed in the example above since E.B was referenced.
Note
This method does not support aggregation.
Parameters: named_exprs (keyword args of Expression
) – New field expressions.Returns: Table
– Table with transmuted fields.
-
transmute_globals
(**named_exprs)[source]¶ Similar to
Table.annotate_globals()
, but drops referenced fields.Notes
This method adds new global fields according to named_exprs, and drops all global fields referenced in those expressions. See
Table.transmute()
for full documentation on how transmute methods work.Parameters: named_exprs (keyword args of Expression
) – Annotation expressions.Returns: Table
-
union
(*tables)[source]¶ Union the rows of multiple tables.
Examples
Take the union of rows from two tables:
>>> union_table = table1.union(other_table)
Notes
If a row appears in both tables identically, it is duplicated in the result. The left and right tables must have the same schema and key.
Parameters: tables (varargs of Table
) – Tables to union.Returns: Table
– Table with all rows from each component table.
-
unpersist
()[source]¶ Unpersists this table from memory/disk.
Notes
This function will have no effect on a table that was not previously persisted.
Returns: Table
– Unpersisted table.
-
write
(output: str, overwrite=False, stage_locally: bool = False, _codec_spec: Union[str, NoneType] = None)[source]¶ Write to disk.
Examples
>>> table1.write('output/table1.ht')
Warning
Do not write to a path that is being read from in the same computation.
Parameters: - output (str) – Path at which to write.
- stage_locally (bool) – If
True
, major output will be written to temporary local storage before being copied tooutput
. - overwrite (bool) – If
True
, overwrite an existing file at the destination.
-