Filtering and Annotation Tutorial¶
Filter¶
You can filter the rows of a table with
Table.filter.
This returns a table of those rows for which the expression evaluates to
True
.
In [1]:
import hail as hl
import seaborn
hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')
Initializing Spark and Hail with default parameters...
Running on Apache Spark version 2.2.0
SparkUI available at http://10.56.40.9:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-90a5cab4aab8
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181015-1344-devel-90a5cab4aab8.log
2018-10-15 13:44:12 Hail: INFO: Movie Lens files found!
In [2]:
users.filter(users.occupation == 'programmer').count()
Out[2]:
66
Annotate¶
You can add new fields to a table with
annotate.
Let’s mean-center and variance-normalize the age
field.
In [3]:
stats = users.aggregate(hl.agg.stats(users.age))
missing_occupations = hl.set(['other', 'none'])
t = users.annotate(
cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
hl.null('str'),
users.occupation))
t.show()
+-------+-------+-----+-----------------+---------+--------------------+
| id | age | sex | occupation | zipcode | cleaned_occupation |
+-------+-------+-----+-----------------+---------+--------------------+
| int32 | int32 | str | str | str | str |
+-------+-------+-----+-----------------+---------+--------------------+
| 1 | 24 | "M" | "technician" | "85711" | "technician" |
| 2 | 53 | "F" | "other" | "94043" | NA |
| 3 | 23 | "M" | "writer" | "32067" | "writer" |
| 4 | 24 | "M" | "technician" | "43537" | "technician" |
| 5 | 33 | "F" | "other" | "15213" | NA |
| 6 | 42 | "M" | "executive" | "98101" | "executive" |
| 7 | 57 | "M" | "administrator" | "91344" | "administrator" |
| 8 | 36 | "M" | "administrator" | "05201" | "administrator" |
| 9 | 29 | "M" | "student" | "01002" | "student" |
| 10 | 53 | "M" | "lawyer" | "90703" | "lawyer" |
+-------+-------+-----+-----------------+---------+--------------------+
showing top 10 rows
Note: annotate
is functional: it doesn’t mutate users
, but
returns a new table. This is also true of filter
. In fact, all
operations in Hail are functional.
In [4]:
users.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
There are two other annotate methods:
select
and
transmute.
select
returns a table with the key and an entirely new set of value
fields. transmute
replaces any fields mentioned on the right-hand
side with the new fields, but leaves unmentioned fields unchanged.
transmute
is useful for transforming data into a new form. How about
some examples?
In [5]:
(users.select(len_occupation = hl.len(users.occupation))
.describe())
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'len_occupation': int32
----------------------------------------
Key: ['id']
----------------------------------------
In [6]:
(users.transmute(
cleaned_occupation = hl.cond(missing_occupations.contains(users.occupation),
hl.null(hl.tstr),
users.occupation))
.describe())
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'zipcode': str
'cleaned_occupation': str
----------------------------------------
Key: ['id']
----------------------------------------
Finally, you can add global fields with annotate_globals. Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps.
In [7]:
t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))
t.describe()
----------------------------------------
Global fields:
'cohort': int32
'cloudable': set<str>
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
In [8]:
t.cloudable
Out[8]:
<SetExpression of type set<str>>
In [9]:
hl.eval(t.cloudable)
Out[9]:
{'sample1', 'sample10', 'sample15'}
Exercises¶
- Z-score normalize
the age field of
users
. - Convert
zip
to an integer. Hint: Not all zipcodes are US zipcodes! Use hl.int32 to convert a string to an integer. Use StringExpression.matches to see if a string matches a regular expression.