Table Tutorial

Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas, but Table differs in 3 important ways:

  • It is distributed. Hail tables can store far more data than can fit on a single computer.
  • It carries global fields.
  • It is keyed.

A Table has two different kinds of fields:

  • global fields
  • row fields

Importing and Reading

Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.

You can read a dataset with hl.read_table. It take a path and returns a Table. ht stands for Hail Table.

We’ve provided a method to download and import the MovieLens dataset of movie ratings in the Hail native format. Let’s read it!

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872.

In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://10.56.40.9:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version devel-90a5cab4aab8
NOTE: This is a beta version. Interfaces may change
  during the beta period. We recommend pulling
  the latest changes weekly.
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181015-1343-devel-90a5cab4aab8.log
In [2]:
hl.utils.get_movie_lens('data/')
2018-10-15 13:43:41 Hail: INFO: downloading MovieLens-100k data ...
  Source: http://files.grouplens.org/datasets/movielens/ml-100k.zip
2018-10-15 13:43:42 Hail: INFO: importing users table and writing to data/users.ht ...
2018-10-15 13:43:43 Hail: INFO: Reading table to impute column types
2018-10-15 13:43:44 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'int32' (imputed)
  Loading column 'f2' as type 'str' (imputed)
  Loading column 'f3' as type 'str' (imputed)
  Loading column 'f4' as type 'str' (imputed)
2018-10-15 13:43:44 Hail: INFO: Coerced sorted dataset
2018-10-15 13:43:45 Hail: INFO: wrote 943 items in 1 partition
2018-10-15 13:43:45 Hail: INFO: importing movies table and writing to data/movies.ht ...
2018-10-15 13:43:45 Hail: INFO: Reading table to impute column types
2018-10-15 13:43:45 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'str' (imputed)
  Loading column 'f2' as type 'str' (imputed)
  Loading column 'f3' as type 'str' (imputed)
  Loading column 'f4' as type 'str' (imputed)
  Loading column 'f5' as type 'int32' (imputed)
  Loading column 'f6' as type 'int32' (imputed)
  Loading column 'f7' as type 'int32' (imputed)
  Loading column 'f8' as type 'int32' (imputed)
  Loading column 'f9' as type 'int32' (imputed)
  Loading column 'f10' as type 'int32' (imputed)
  Loading column 'f11' as type 'int32' (imputed)
  Loading column 'f12' as type 'int32' (imputed)
  Loading column 'f13' as type 'int32' (imputed)
  Loading column 'f14' as type 'int32' (imputed)
  Loading column 'f15' as type 'int32' (imputed)
  Loading column 'f16' as type 'int32' (imputed)
  Loading column 'f17' as type 'int32' (imputed)
  Loading column 'f18' as type 'int32' (imputed)
  Loading column 'f19' as type 'int32' (imputed)
  Loading column 'f20' as type 'int32' (imputed)
  Loading column 'f21' as type 'int32' (imputed)
  Loading column 'f22' as type 'int32' (imputed)
  Loading column 'f23' as type 'int32' (imputed)
2018-10-15 13:43:46 Hail: INFO: Coerced sorted dataset
2018-10-15 13:43:46 Hail: INFO: wrote 1682 items in 1 partition
2018-10-15 13:43:46 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2018-10-15 13:43:46 Hail: INFO: Reading table to impute column types
2018-10-15 13:43:47 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'int32' (imputed)
  Loading column 'f2' as type 'int32' (imputed)
  Loading column 'f3' as type 'int32' (imputed)
2018-10-15 13:43:47 Hail: INFO: wrote 100000 items in 2 partitions
In [3]:
users = hl.read_table('data/users.ht')
users
Out[3]:
<hail.table.Table at 0x7f10851cf940>

Exploring Tables

The describe method prints the structure of a table: the fields and their types.

In [4]:
users.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

You can view the first few rows of the table using show. 10 rows are displayed by default. Try changing the code in the cell below to users.show(5).

In [5]:
users.show()
+-------+-------+-----+-----------------+---------+
|    id |   age | sex | occupation      | zipcode |
+-------+-------+-----+-----------------+---------+
| int32 | int32 | str | str             | str     |
+-------+-------+-----+-----------------+---------+
|     1 |    24 | "M" | "technician"    | "85711" |
|     2 |    53 | "F" | "other"         | "94043" |
|     3 |    23 | "M" | "writer"        | "32067" |
|     4 |    24 | "M" | "technician"    | "43537" |
|     5 |    33 | "F" | "other"         | "15213" |
|     6 |    42 | "M" | "executive"     | "98101" |
|     7 |    57 | "M" | "administrator" | "91344" |
|     8 |    36 | "M" | "administrator" | "05201" |
|     9 |    29 | "M" | "student"       | "01002" |
|    10 |    53 | "M" | "lawyer"        | "90703" |
+-------+-------+-----+-----------------+---------+
showing top 10 rows

You can count the rows of a table.

In [6]:
users.count()
Out[6]:
943

You can access fields of tables with the Python attribute notation table.field, or with index notation table['field']. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).

In [7]:
users.occupation.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f10851cf940>
Index:
    ['row']
--------------------------------------------------------
In [8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f10851cf940>
Index:
    ['row']
--------------------------------------------------------

The snippets users.occupation and users['occupation'] are Hail expressions. Each data type in Hail is represented by its own expression class. Since the occupation field is a string type, the expression users.occupation will be a StringExpression.

In [9]:
users.occupation
Out[9]:
<StringExpression of type str>

You can also show an Expression. Notice that the key is shown as well.

In [10]:
users.occupation.show()
+-------+-----------------+
|    id | occupation      |
+-------+-----------------+
| int32 | str             |
+-------+-----------------+
|     1 | "technician"    |
|     2 | "other"         |
|     3 | "writer"        |
|     4 | "technician"    |
|     5 | "other"         |
|     6 | "executive"     |
|     7 | "administrator" |
|     8 | "administrator" |
|     9 | "student"       |
|    10 | "lawyer"        |
+-------+-----------------+
showing top 10 rows

Exercise

The movie dataset has two other tables: movies.ht and ratings.ht. Load these tables and have a quick look around.