Expressions¶
In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://172.31.30.135:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-1a29de719de9
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
Eager Evaluation¶
Python and R use eager evaulation.
When you enter an expression, the result is computed immediately and stored.
In [2]:
1 + 2
Out[2]:
3
Lazy Evaluation¶
Eager evaluation won’t work on datasets that won’t fit in memory.
Consider the UK Biobank BGEN file, which is ~2TB but decompresses to >100TB in memory.
In order to process datasets of this size, Hail uses lazy evaluation.
When you enter an expression, Hail doesn’t execute the expression immediately: it simply records what you asked to do.
In [3]:
one = hl.int32(1)
three = one + 2
three
Out[3]:
<Int32Expression of type int32>
Hail evaluates an expression only when it must, for example:
- when performing an aggregation,
- when calling
take
,collect
orshow
, - when exporting or writing to disk.
Hail evaluates expressions by streaming to accomodate very large datasets.
You can evaluate expressions with no index by calling value
. The
show
method also prints the type.
In [4]:
three.value
Out[4]:
3
In [5]:
three.show()
+--------+
| <expr> |
+--------+
| int32 |
+--------+
| 3 |
+--------+
Indices¶
Expressions carry another piece of information: indices. Indices record
the Table
or MatrixTable
to which the expression refers, and the
axes over which the expression can vary.
Let’s see some examples from the 1000 genomes dataset:
In [6]:
hl.utils.get_1kg('data/')
2018-07-05 15:23:49 Hail: INFO: 1KG files found
In [7]:
mt = hl.read_matrix_table('data/1kg.mt')
mt
Out[7]:
<hail.matrixtable.MatrixTable at 0x7fd9562761d0>
Let’s add a global field.
In [8]:
mt = mt.annotate_globals(dataset = '1kg')
And examine some fields.
In [9]:
mt.dataset.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
[]
--------------------------------------------------------
In [10]:
mt.locus.describe()
--------------------------------------------------------
Type:
locus<GRCh37>
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
['row']
--------------------------------------------------------
In [11]:
mt.s.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
['column']
--------------------------------------------------------
In [12]:
mt.GT.describe()
--------------------------------------------------------
Type:
call
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
['column', 'row']
--------------------------------------------------------
Expressions like locus
, s
, and GT
above have no one
value
, but rather their value varies across rows or columns of
mt
.
Global fields don’t vary across rows or columns, so they have a
value
:
In [13]:
mt.dataset.value
Out[13]:
'1kg'
show
, take
, and collect
¶
Although expressions with indices have no value
, you can use
show
to print the first few values, or take
and collect
to
localize values to Python.
In [14]:
mt.s.show()
+---------+
| s |
+---------+
| str |
+---------+
| HG00096 |
| HG00099 |
| HG00105 |
| HG00118 |
| HG00129 |
| HG00148 |
| HG00177 |
| HG00182 |
| HG00242 |
| HG00254 |
+---------+
showing top 10 rows
In [15]:
mt.s.take(5)
Out[15]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']
You can collect
an expression to localize all values, like getting a
list of all sample IDs of a dataset.
But be careful – don’t collect
more data than can fit in memory!
In [16]:
all_sample_ids = mt.s.collect()
all_sample_ids[:5]
Out[16]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']
Learning more¶
Hail has a suite of of functions to transform and build expressions.
Also, see the documentation for the expressions themselves.