Table

Table is Hail’s distributed analogue of a data frame or SQL table. It will be familiar if you’ve used R or pandas, but Table differs in 3 important ways:

  • It is distributed. Hail Tables can store far more data than can fit on a single computer.
  • It carries global fields.
  • It is keyed.

A Table has two different kinds of fields:

  • global fields
  • row fields

Importing and Reading

Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.

You can read a dataset with hl.read_table. It take a path and returns a Table. ht stands for Hail Table.

We’ve prepared a dataset of movie ratings in the Hail native format. Let’s read it!

In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://172.31.20.142:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version devel-f7631a0c96cd
NOTE: This is a beta version. Interfaces may change
  during the beta period. We recommend pulling
  the latest changes weekly.
In [2]:
hl.utils.get_movie_lens('data/')
2018-07-16 22:05:05 Hail: INFO: downloading MovieLens-100k data ...
  Source: http://files.grouplens.org/datasets/movielens/ml-100k.zip
2018-07-16 22:05:06 Hail: INFO: importing users table and writing to data/users.ht ...
2018-07-16 22:05:07 Hail: INFO: Reading table to impute column types
2018-07-16 22:05:07 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'int32' (imputed)
  Loading column 'f2' as type 'str' (imputed)
  Loading column 'f3' as type 'str' (imputed)
  Loading column 'f4' as type 'str' (imputed)
2018-07-16 22:05:08 Hail: INFO: Coerced sorted dataset
2018-07-16 22:05:08 Hail: INFO: wrote 943 items in 1 partition
2018-07-16 22:05:08 Hail: INFO: importing movies table and writing to data/movies.ht ...
2018-07-16 22:05:08 Hail: INFO: Reading table to impute column types
2018-07-16 22:05:09 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'str' (imputed)
  Loading column 'f2' as type 'str' (imputed)
  Loading column 'f3' as type 'str' (imputed)
  Loading column 'f4' as type 'str' (imputed)
  Loading column 'f5' as type 'int32' (imputed)
  Loading column 'f6' as type 'int32' (imputed)
  Loading column 'f7' as type 'int32' (imputed)
  Loading column 'f8' as type 'int32' (imputed)
  Loading column 'f9' as type 'int32' (imputed)
  Loading column 'f10' as type 'int32' (imputed)
  Loading column 'f11' as type 'int32' (imputed)
  Loading column 'f12' as type 'int32' (imputed)
  Loading column 'f13' as type 'int32' (imputed)
  Loading column 'f14' as type 'int32' (imputed)
  Loading column 'f15' as type 'int32' (imputed)
  Loading column 'f16' as type 'int32' (imputed)
  Loading column 'f17' as type 'int32' (imputed)
  Loading column 'f18' as type 'int32' (imputed)
  Loading column 'f19' as type 'int32' (imputed)
  Loading column 'f20' as type 'int32' (imputed)
  Loading column 'f21' as type 'int32' (imputed)
  Loading column 'f22' as type 'int32' (imputed)
  Loading column 'f23' as type 'int32' (imputed)
2018-07-16 22:05:10 Hail: INFO: Coerced sorted dataset
2018-07-16 22:05:10 Hail: INFO: wrote 1682 items in 1 partition
2018-07-16 22:05:10 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2018-07-16 22:05:10 Hail: INFO: Reading table to impute column types
2018-07-16 22:05:11 Hail: INFO: Finished type imputation
  Loading column 'f0' as type 'int32' (imputed)
  Loading column 'f1' as type 'int32' (imputed)
  Loading column 'f2' as type 'int32' (imputed)
  Loading column 'f3' as type 'int32' (imputed)
2018-07-16 22:05:14 Hail: INFO: wrote 100000 items in 2 partitions
In [3]:
users = hl.read_table('data/users.ht')
users
Out[3]:
<hail.table.Table at 0x7f01189fb6d8>

Exploring Tables

The describe method prints the structure of a table: the fields and their types.

In [4]:
users.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'id': int32
    'age': int32
    'sex': str
    'occupation': str
    'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------

describe prints the structure of the table, but what about its contents? To show the first few rows of the table, use show.

In [5]:
users.show()
+-------+-------+-----+---------------+---------+
|    id |   age | sex | occupation    | zipcode |
+-------+-------+-----+---------------+---------+
| int32 | int32 | str | str           | str     |
+-------+-------+-----+---------------+---------+
|     1 |    24 | M   | technician    | 85711   |
|     2 |    53 | F   | other         | 94043   |
|     3 |    23 | M   | writer        | 32067   |
|     4 |    24 | M   | technician    | 43537   |
|     5 |    33 | F   | other         | 15213   |
|     6 |    42 | M   | executive     | 98101   |
|     7 |    57 | M   | administrator | 91344   |
|     8 |    36 | M   | administrator | 05201   |
|     9 |    29 | M   | student       | 01002   |
|    10 |    53 | M   | lawyer        | 90703   |
+-------+-------+-----+---------------+---------+
showing top 10 rows

You can count the rows of a table.

In [6]:
users.count()
Out[6]:
943

You can access fields of tables with the Python attribute notation, or index notation. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).

In [7]:
users.occupation.describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f01189fb6d8>
Index:
    ['row']
--------------------------------------------------------
In [8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
    str
--------------------------------------------------------
Source:
    <hail.table.Table object at 0x7f01189fb6d8>
Index:
    ['row']
--------------------------------------------------------

You can also show an Expression. Notice that the key is shown as well.

In [9]:
users.occupation.show()
+-------+---------------+
|    id | occupation    |
+-------+---------------+
| int32 | str           |
+-------+---------------+
|     1 | technician    |
|     2 | other         |
|     3 | writer        |
|     4 | technician    |
|     5 | other         |
|     6 | executive     |
|     7 | administrator |
|     8 | administrator |
|     9 | student       |
|    10 | lawyer        |
+-------+---------------+
showing top 10 rows

Exercise

The movie dataset has two other tables: movies.ht and ratings.ht. Load these tables and have a quick look around.