Expression Tutorial¶
This tutorial covers data representation with Hail’s expression classes. We will go over Hail’s data types and the expressions that represent them, as well as a few features of expressions, such as lazy evaluation and missingness. We will also cover how expressions can refer to fields in a table or matrix table.
As you are working through the tutorial, you can also check out the expression API for documentation on specific expressions and their methods, or the expression page in the Hailpedia for more information on expressions.
Start by importing the Hail module, which we typically abbreviate as
hl
, and initializing Hail and Spark with the
init method:
In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://10.56.40.9:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-90a5cab4aab8
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181015-1343-devel-90a5cab4aab8.log
Hail’s Data Types¶
Each object in Python has a data type, which can be accessed with
Python’s type
method. Here is a Python string, which has type
str
.
In [2]:
type("Python")
Out[2]:
str
Hail has its own data types for representing data. Here is a Hail
string, which we construct with the
str
method. We can access the string’s Hail type with the dtype
field.
In [3]:
hl.str("Hail").dtype
Out[3]:
dtype('str')
Hail has primitive and container types, as well as a few types specific to the field of genetics.
- primitive types: int32, int64, float32, float64, bool, str
- container types: arrays, sets, dicts, tuples, structs, intervals
- genetics types: locus, call
Each of these types has its own constructor method, which returns an expression:
In [4]:
hl.str("Hail")
Out[4]:
<StringExpression of type str>
What is an Expression?¶
Data types in Hail are represented by
expression
classes. Each data type has its own expression class. For example, an
integer of type tint32
is represented by an Int32Expression
.
We can construct an integer expression in Hail with the int32 function.
In [5]:
hl.int32(3)
Out[5]:
<Int32Expression of type int32>
To automatically impute the type when converting a Python object to a Hail expression, use the literal method. Let’s try it out on a Python list.
In [6]:
hl.literal(['a', 'b', 'c'])
Out[6]:
<ArrayExpression of type array<str>>
The Python list is converted to an ArrayExpression of type
array<str>
. In other words, an array of strings.
Expressions are Lazy¶
In languages like Python and R, expressions are evaluated and stored immediately. This is called eager evalutation.
In [7]:
1 + 2
Out[7]:
3
Eager evaluation won’t work on datasets that won’t fit in memory. Consider the UK Biobank BGEN file, which is ~2TB but decompresses to >100TB in memory.
In order to process datasets of this size, Hail uses lazy evaluation. When you enter an expression, Hail doesn’t execute the expression immediately; it only records what you asked to do.
In [8]:
one = hl.int32(1)
three = one + 2
three
Out[8]:
<Int32Expression of type int32>
Hail evaluates an expression only when it must. For example:
- when performing an aggregation
- when calling the methods take, collect, and show
- when exporting or writing to disk
Hail evaluates expressions by streaming to accomodate very large datasets.
If you want to force the evaluation of an expression, you can do so by
evaluating
it.
Note that this can only be done on an expression with no index, such as
hl.int32(1) + 2
. If the expression has an index, e.g.
table.idx + 1
, then the eval
method will fail. The section on
indices below explains this concept further.
In [9]:
hl.eval(three)
Out[9]:
3
The show method can also be used to evaluate and display the expression.
In [10]:
three.show()
+--------+
| <expr> |
+--------+
| int32 |
+--------+
| 3 |
+--------+
Missing data¶
All expressions in Hail can represent missing data. Hail has a collection of primitive operations for dealing with missingness.
The null constructor can be used to create a missing expression of a specific type, such as a missing string:
In [11]:
missing_string = hl.null(hl.tstr)
Use is_defined or is_missing to test an expression for missingness.
In [12]:
hl.eval(hl.is_defined(missing_string))
Out[12]:
False
In [13]:
hl.eval(hl.is_missing(missing_string))
Out[13]:
True
Expressions handle missingness in the following ways:
- a missing value plus another value is always missing
- a conditional statement with a missing predicate is missing
- when aggregating a sum of values, the missing values are ignored
This is different from Python’s treatment of missingness, where
None + 5
would produce an error. In Hail, hl.null(hl.tint32) + 5
produces a missing result, not an error.
In [14]:
hl.eval(hl.is_missing(hl.null(hl.tint32) + 5))
Out[14]:
True
Here are a few more examples to illustrate how missingness is treated in Hail:
Missingness is ignored in a summation:
In [15]:
hl.eval(hl.sum(hl.array([1, 2, hl.null(hl.tint32)])))
Out[15]:
3
or_missing takes a predicate and a value. If the predicate is True, it returns the value; otherwise, it returns a missing value.
In [16]:
x = hl.int32(5)
hl.eval(hl.or_missing(x>0, x))
Out[16]:
5
In [17]:
print(hl.eval(hl.or_missing(x>10, x)))
None
Indices¶
Expressions carry another piece of information: indices. Indices record
the Table
or MatrixTable
to which the expression refers, and the
axes over which the expression can vary.
Let’s see some examples from the 1000 genomes dataset:
In [18]:
hl.utils.get_1kg('data/')
2018-10-15 13:43:28 Hail: INFO: 1KG files found
In [19]:
mt = hl.read_matrix_table('data/1kg.mt')
mt
Out[19]:
<hail.matrixtable.MatrixTable at 0x7f394d653710>
Let’s add a global field.
In [20]:
mt = mt.annotate_globals(dataset = '1kg')
We can examine any field of the matrix table with the describe method. If we examine the field we just added, notice that it has no indices, because it is a global field.
In [21]:
mt.dataset.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7f394d5f17b8>
Index:
[]
--------------------------------------------------------
The locus
field is a row field, so it will be indexed by row
.
In [22]:
mt.locus.describe()
--------------------------------------------------------
Type:
locus<GRCh37>
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7f394d5f17b8>
Index:
['row']
--------------------------------------------------------
Likewise, a column field s
will be indexed by column
.
In [23]:
mt.s.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7f394d5f17b8>
Index:
['column']
--------------------------------------------------------
And finally, an entry field GT
will be indexed by both the row
and column
.
In [24]:
mt.GT.describe()
--------------------------------------------------------
Type:
call
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7f394d5f17b8>
Index:
['column', 'row']
--------------------------------------------------------
Expressions like locus
, s
, and GT
above do not have a single
value, but rather a value that varies across rows or columns of mt
.
Therefore, calling the hl.eval
function with these expressions will
lead to an error.
Global fields don’t vary across rows or columns, so they can be directly evaluated:
In [25]:
hl.eval(mt.dataset)
Out[25]:
'1kg'
show
, take
, and collect
¶
Although expressions with indices do not have a single realizable value
(calling hl.eval
will fail), you can use show
to print the first
few values, or take
and collect
to localize all values into a
Python list.
show
and take
grab the first 10 rows by default, but you can
specify a number of rows to grab.
In [26]:
mt.s.show()
+-----------+
| s |
+-----------+
| str |
+-----------+
| "HG00096" |
| "HG00099" |
| "HG00105" |
| "HG00118" |
| "HG00129" |
| "HG00148" |
| "HG00177" |
| "HG00182" |
| "HG00242" |
| "HG00254" |
+-----------+
showing top 10 rows
In [27]:
mt.s.take(5)
Out[27]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']
You can collect an expression to localize all values, like getting a list of all sample IDs of a dataset.
But be careful – don’t collect
more data than can fit in memory!
In [28]:
all_sample_ids = mt.s.collect()
all_sample_ids[:5]
Out[28]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']
Learning more¶
Hail has a suite of of functions to transform and build expressions.
For further documentation on expressions, see the expression API and the expression page.