Table Tutorial¶
Table is Hail’s
distributed analogue of a data frame or SQL table. It will be familiar
if you’ve used R or pandas
, but Table
differs in 3 important
ways:
- It is distributed. Hail tables can store far more data than can fit on a single computer.
- It carries global fields.
- It is keyed.
A Table
has two different kinds of fields:
- global fields
- row fields
Importing and Reading¶
Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.
You can read a dataset with
hl.read_table.
It take a path and returns a Table
. ht
stands for Hail Table.
We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872.
In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://10.56.40.9:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-90a5cab4aab8
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181015-1343-devel-90a5cab4aab8.log
In [2]:
hl.utils.get_movie_lens('data/')
2018-10-15 13:43:41 Hail: INFO: downloading MovieLens-100k data ...
Source: http://files.grouplens.org/datasets/movielens/ml-100k.zip
2018-10-15 13:43:42 Hail: INFO: importing users table and writing to data/users.ht ...
2018-10-15 13:43:43 Hail: INFO: Reading table to impute column types
2018-10-15 13:43:44 Hail: INFO: Finished type imputation
Loading column 'f0' as type 'int32' (imputed)
Loading column 'f1' as type 'int32' (imputed)
Loading column 'f2' as type 'str' (imputed)
Loading column 'f3' as type 'str' (imputed)
Loading column 'f4' as type 'str' (imputed)
2018-10-15 13:43:44 Hail: INFO: Coerced sorted dataset
2018-10-15 13:43:45 Hail: INFO: wrote 943 items in 1 partition
2018-10-15 13:43:45 Hail: INFO: importing movies table and writing to data/movies.ht ...
2018-10-15 13:43:45 Hail: INFO: Reading table to impute column types
2018-10-15 13:43:45 Hail: INFO: Finished type imputation
Loading column 'f0' as type 'int32' (imputed)
Loading column 'f1' as type 'str' (imputed)
Loading column 'f2' as type 'str' (imputed)
Loading column 'f3' as type 'str' (imputed)
Loading column 'f4' as type 'str' (imputed)
Loading column 'f5' as type 'int32' (imputed)
Loading column 'f6' as type 'int32' (imputed)
Loading column 'f7' as type 'int32' (imputed)
Loading column 'f8' as type 'int32' (imputed)
Loading column 'f9' as type 'int32' (imputed)
Loading column 'f10' as type 'int32' (imputed)
Loading column 'f11' as type 'int32' (imputed)
Loading column 'f12' as type 'int32' (imputed)
Loading column 'f13' as type 'int32' (imputed)
Loading column 'f14' as type 'int32' (imputed)
Loading column 'f15' as type 'int32' (imputed)
Loading column 'f16' as type 'int32' (imputed)
Loading column 'f17' as type 'int32' (imputed)
Loading column 'f18' as type 'int32' (imputed)
Loading column 'f19' as type 'int32' (imputed)
Loading column 'f20' as type 'int32' (imputed)
Loading column 'f21' as type 'int32' (imputed)
Loading column 'f22' as type 'int32' (imputed)
Loading column 'f23' as type 'int32' (imputed)
2018-10-15 13:43:46 Hail: INFO: Coerced sorted dataset
2018-10-15 13:43:46 Hail: INFO: wrote 1682 items in 1 partition
2018-10-15 13:43:46 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2018-10-15 13:43:46 Hail: INFO: Reading table to impute column types
2018-10-15 13:43:47 Hail: INFO: Finished type imputation
Loading column 'f0' as type 'int32' (imputed)
Loading column 'f1' as type 'int32' (imputed)
Loading column 'f2' as type 'int32' (imputed)
Loading column 'f3' as type 'int32' (imputed)
2018-10-15 13:43:47 Hail: INFO: wrote 100000 items in 2 partitions
In [3]:
users = hl.read_table('data/users.ht')
users
Out[3]:
<hail.table.Table at 0x7f10851cf940>
Exploring Tables¶
The describe method prints the structure of a table: the fields and their types.
In [4]:
users.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
You can view the first few rows of the table using
show.
10 rows are displayed by default. Try changing the code in the cell
below to users.show(5)
.
In [5]:
users.show()
+-------+-------+-----+-----------------+---------+
| id | age | sex | occupation | zipcode |
+-------+-------+-----+-----------------+---------+
| int32 | int32 | str | str | str |
+-------+-------+-----+-----------------+---------+
| 1 | 24 | "M" | "technician" | "85711" |
| 2 | 53 | "F" | "other" | "94043" |
| 3 | 23 | "M" | "writer" | "32067" |
| 4 | 24 | "M" | "technician" | "43537" |
| 5 | 33 | "F" | "other" | "15213" |
| 6 | 42 | "M" | "executive" | "98101" |
| 7 | 57 | "M" | "administrator" | "91344" |
| 8 | 36 | "M" | "administrator" | "05201" |
| 9 | 29 | "M" | "student" | "01002" |
| 10 | 53 | "M" | "lawyer" | "90703" |
+-------+-------+-----+-----------------+---------+
showing top 10 rows
You can count the rows of a table.
In [6]:
users.count()
Out[6]:
943
You can access fields of tables with the Python attribute notation
table.field
, or with index notation table['field']
. The latter
is useful when the field names are not valid Python identifiers (if a
field name includes a space, for example).
In [7]:
users.occupation.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f10851cf940>
Index:
['row']
--------------------------------------------------------
In [8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f10851cf940>
Index:
['row']
--------------------------------------------------------
The snippets users.occupation
and users['occupation']
are Hail
expressions. Each data type in Hail is represented by its own expression
class. Since the occupation field is a string type, the expression
users.occupation
will be a StringExpression.
In [9]:
users.occupation
Out[9]:
<StringExpression of type str>
You can also show
an Expression
. Notice that the key is shown as
well.
In [10]:
users.occupation.show()
+-------+-----------------+
| id | occupation |
+-------+-----------------+
| int32 | str |
+-------+-----------------+
| 1 | "technician" |
| 2 | "other" |
| 3 | "writer" |
| 4 | "technician" |
| 5 | "other" |
| 6 | "executive" |
| 7 | "administrator" |
| 8 | "administrator" |
| 9 | "student" |
| 10 | "lawyer" |
+-------+-----------------+
showing top 10 rows
Exercise¶
The movie dataset has two other tables: movies.ht
and
ratings.ht
. Load these tables and have a quick look around.