Aggregation¶
In the last section, we inspected the structure of the data and displayed a few example values.
How do we get a deeper feel for the data? One of the most natural things to do is to create a summary of a large number of values. For example, you could ask:
- How many women are in the dataset? How many men?
- What is the average age? Youngest age? Oldest age?
- What are all the occupations that appear, and how many times does each appear?
We can answer these questions with aggregation. Aggregation combines many values together to create a summary.
To start, we’ll aggregate all the values in a table. (Later, we’ll learn how to aggregate over subsets.)
We can do this with the Table.aggregate method.
A call to aggregate
has two parts:
- The expression to aggregate over (e.g. a field of a
Table
). - The aggregator to combine the values into the summary.
Hail has a large suite of aggregators for summarizing data. Let’s see some in action!
count
¶
Aggregators live in the hl.agg
module. The simplest aggregator is
count.
It takes no arguments and returns the number of values aggregated.
In [1]:
import hail as hl
from bokeh.io import output_notebook,show
import seaborn
output_notebook()
seaborn.set()
hl.init()
hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')
Running on Apache Spark version 2.2.0
SparkUI available at http://172.31.30.135:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-1a29de719de9
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
2018-07-05 15:24:39 Hail: INFO: Movie Lens files found!
In [2]:
users.aggregate(hl.agg.count())
Out[2]:
943
In [3]:
users.count()
Out[3]:
943
stats
¶
stats
computes useful statistics about a numeric expression at once. There are
also aggregators for mean
, min
, max
, sum
, product
and array_sum
.
In [4]:
users.show()
+-------+-------+-----+---------------+---------+
| id | age | sex | occupation | zipcode |
+-------+-------+-----+---------------+---------+
| int32 | int32 | str | str | str |
+-------+-------+-----+---------------+---------+
| 1 | 24 | M | technician | 85711 |
| 2 | 53 | F | other | 94043 |
| 3 | 23 | M | writer | 32067 |
| 4 | 24 | M | technician | 43537 |
| 5 | 33 | F | other | 15213 |
| 6 | 42 | M | executive | 98101 |
| 7 | 57 | M | administrator | 91344 |
| 8 | 36 | M | administrator | 05201 |
| 9 | 29 | M | student | 01002 |
| 10 | 53 | M | lawyer | 90703 |
+-------+-------+-----+---------------+---------+
showing top 10 rows
In [5]:
users.aggregate(hl.agg.stats(users.age))
Out[5]:
Struct(mean=34.05196182396604, stdev=12.186273150937206, min=7.0, max=73.0, n=943, sum=32110.999999999975)
counter
¶
What about non-numeric data, like the occupation
field?
counter is modeled on the Python Counter object: it counts the number of times each distinct value occurs in the collection of values being aggregated.
In [6]:
users.aggregate(hl.agg.counter(users.occupation))
Out[6]:
{'administrator': 79,
'artist': 28,
'doctor': 7,
'educator': 95,
'engineer': 67,
'entertainment': 18,
'executive': 32,
'healthcare': 16,
'homemaker': 7,
'lawyer': 12,
'librarian': 51,
'marketing': 26,
'none': 9,
'other': 105,
'programmer': 66,
'retired': 14,
'salesman': 12,
'scientist': 31,
'student': 196,
'technician': 27,
'writer': 45}
filter
¶
You can filter elements of a collection before aggregation by using filter.
In [7]:
users.aggregate(hl.agg.count(hl.agg.filter(users.sex == 'M', users.sex)))
Out[7]:
670
The argument to filter
can be a Boolean expression (like above) or a
Python function that maps the values in the collection to a Boolean
(True
if the value should be kept). This mirrors the interface to
the builtin Python filter
function.
In [8]:
users.aggregate(
hl.agg.count(
hl.agg.filter(lambda sex: sex == 'M', users.sex)))
Out[8]:
670
However, to filter and count as above, it’s much simpler to use the count_where aggregator!
In [9]:
users.aggregate(hl.agg.count_where(users.sex == 'M'))
Out[9]:
670
hist
¶
As we saw in the first tutorial, hist can be used to build a histogram over numeric data.
In [10]:
hist = users.aggregate(hl.agg.hist(users.age, 10, 70, 60))
hist
Out[10]:
Struct(bin_edges=[10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0], bin_freq=[1, 1, 0, 5, 3, 6, 5, 14, 18, 23, 32, 27, 37, 28, 33, 38, 34, 35, 36, 32, 39, 25, 28, 26, 17, 27, 21, 19, 17, 22, 21, 10, 21, 13, 23, 15, 12, 14, 20, 19, 20, 20, 6, 12, 4, 11, 6, 9, 3, 3, 9, 3, 2, 3, 2, 3, 1, 0, 2, 5], n_smaller=1, n_larger=1)
In [11]:
p = hl.plot.histogram(hist, legend='Age')
show(p)
take
and collect
¶
There are a few aggregators for collecting values.
take
localizes a few values into an array. It has an optionalordering
.collect
localizes all values into an array.collect_as_set
localizes all unique values into a set.
In [12]:
users.aggregate(hl.agg.take(users.occupation, 5))
Out[12]:
['technician', 'other', 'writer', 'technician', 'other']
In [13]:
users.aggregate(hl.agg.take(users.age, 5, ordering=-users.age))
Out[13]:
[73, 70, 70, 70, 69]
Warning! Aggregators like collect
and counter
return Python
objects and can fail with out of memory errors if you apply them to
collections that are too large (e.g. all 50 trillion genotypes in the UK
Biobank dataset).