Expressions

In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://172.31.30.135:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version devel-1a29de719de9
NOTE: This is a beta version. Interfaces may change
  during the beta period. We recommend pulling
  the latest changes weekly.

Eager Evaluation

Python and R use eager evaulation.

When you enter an expression, the result is computed immediately and stored.

In [2]:
1 + 2
Out[2]:
3

Lazy Evaluation

Eager evaluation won’t work on datasets that won’t fit in memory.

Consider the UK Biobank BGEN file, which is ~2TB but decompresses to >100TB in memory.

In order to process datasets of this size, Hail uses lazy evaluation.

When you enter an expression, Hail doesn’t execute the expression immediately: it simply records what you asked to do.

In [3]:
one = hl.int32(1)
three = one + 2
three
Out[3]:
<Int32Expression of type int32>

Hail evaluates an expression only when it must, for example:

  • when performing an aggregation,
  • when calling take, collect or show,
  • when exporting or writing to disk.

Hail evaluates expressions by streaming to accomodate very large datasets.

You can evaluate expressions with no index by calling value. The show method also prints the type.

In [4]:
three.value
Out[4]:
3
In [5]:
three.show()
+--------+
| <expr> |
+--------+
|  int32 |
+--------+
|      3 |
+--------+

Indices

Expressions carry another piece of information: indices. Indices record the Table or MatrixTable to which the expression refers, and the axes over which the expression can vary.

Let’s see some examples from the 1000 genomes dataset:

In [6]:
hl.utils.get_1kg('data/')
2018-07-05 15:23:49 Hail: INFO: 1KG files found
In [7]:
mt = hl.read_matrix_table('data/1kg.mt')
mt
Out[7]:
<hail.matrixtable.MatrixTable at 0x7fd9562761d0>

Let’s add a global field.

In [8]:
mt = mt.annotate_globals(dataset = '1kg')

And examine some fields.

In [9]:
mt.dataset.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
    []
--------------------------------------------------------
In [10]:
mt.locus.describe()
--------------------------------------------------------
Type:
    locus<GRCh37>
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
    ['row']
--------------------------------------------------------
In [11]:
mt.s.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
    ['column']
--------------------------------------------------------
In [12]:
mt.GT.describe()
--------------------------------------------------------
Type:
    call
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7fd9886e8278>
Index:
    ['column', 'row']
--------------------------------------------------------

Expressions like locus, s, and GT above have no one value, but rather their value varies across rows or columns of mt.

Global fields don’t vary across rows or columns, so they have a value:

In [13]:
mt.dataset.value
Out[13]:
'1kg'

show, take, and collect

Although expressions with indices have no value, you can use show to print the first few values, or take and collect to localize values to Python.

In [14]:
mt.s.show()
+---------+
| s       |
+---------+
| str     |
+---------+
| HG00096 |
| HG00099 |
| HG00105 |
| HG00118 |
| HG00129 |
| HG00148 |
| HG00177 |
| HG00182 |
| HG00242 |
| HG00254 |
+---------+
showing top 10 rows

In [15]:
mt.s.take(5)
Out[15]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']

You can collect an expression to localize all values, like getting a list of all sample IDs of a dataset.

But be careful – don’t collect more data than can fit in memory!

In [16]:
all_sample_ids = mt.s.collect()
all_sample_ids[:5]
Out[16]:
['HG00096', 'HG00099', 'HG00105', 'HG00118', 'HG00129']

Learning more

Hail has a suite of of functions to transform and build expressions.

Also, see the documentation for the expressions themselves.