Table¶
Table is Hail’s
distributed analogue of a data frame or SQL table. It will be familiar
if you’ve used R or pandas
, but Table
differs in 3 important
ways:
- It is distributed. Hail
Table
s can store far more data than can fit on a single computer. - It carries global fields.
- It is keyed.
A Table
has two different kinds of fields:
- global fields
- row fields
Importing and Reading¶
Hail can import data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also read (and write) a native Hail format.
You can read a dataset with
hl.read_table.
It take a path and returns a Table
. ht
stands for Hail Table.
We’ve prepared a dataset of movie ratings in the Hail native format. Let’s read it!
In [1]:
import hail as hl
hl.init()
Running on Apache Spark version 2.2.0
SparkUI available at http://172.31.30.135:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-1a29de719de9
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
In [2]:
hl.utils.get_movie_lens('data/')
2018-07-05 15:24:08 Hail: INFO: downloading MovieLens-100k data ...
Source: http://files.grouplens.org/datasets/movielens/ml-100k.zip
2018-07-05 15:24:09 Hail: INFO: importing users table and writing to data/users.ht ...
2018-07-05 15:24:10 Hail: INFO: Reading table to impute column types
2018-07-05 15:24:10 Hail: INFO: Finished type imputation
Loading column 'f0' as type 'int32' (imputed)
Loading column 'f1' as type 'int32' (imputed)
Loading column 'f2' as type 'str' (imputed)
Loading column 'f3' as type 'str' (imputed)
Loading column 'f4' as type 'str' (imputed)
2018-07-05 15:24:11 Hail: INFO: Coerced sorted dataset
2018-07-05 15:24:11 Hail: INFO: wrote 943 items in 1 partition
2018-07-05 15:24:11 Hail: INFO: importing movies table and writing to data/movies.ht ...
2018-07-05 15:24:11 Hail: INFO: Reading table to impute column types
2018-07-05 15:24:11 Hail: INFO: Finished type imputation
Loading column 'f0' as type 'int32' (imputed)
Loading column 'f1' as type 'str' (imputed)
Loading column 'f2' as type 'str' (imputed)
Loading column 'f3' as type 'str' (imputed)
Loading column 'f4' as type 'str' (imputed)
Loading column 'f5' as type 'int32' (imputed)
Loading column 'f6' as type 'int32' (imputed)
Loading column 'f7' as type 'int32' (imputed)
Loading column 'f8' as type 'int32' (imputed)
Loading column 'f9' as type 'int32' (imputed)
Loading column 'f10' as type 'int32' (imputed)
Loading column 'f11' as type 'int32' (imputed)
Loading column 'f12' as type 'int32' (imputed)
Loading column 'f13' as type 'int32' (imputed)
Loading column 'f14' as type 'int32' (imputed)
Loading column 'f15' as type 'int32' (imputed)
Loading column 'f16' as type 'int32' (imputed)
Loading column 'f17' as type 'int32' (imputed)
Loading column 'f18' as type 'int32' (imputed)
Loading column 'f19' as type 'int32' (imputed)
Loading column 'f20' as type 'int32' (imputed)
Loading column 'f21' as type 'int32' (imputed)
Loading column 'f22' as type 'int32' (imputed)
Loading column 'f23' as type 'int32' (imputed)
2018-07-05 15:24:13 Hail: INFO: Coerced sorted dataset
2018-07-05 15:24:13 Hail: INFO: wrote 1682 items in 1 partition
2018-07-05 15:24:13 Hail: INFO: importing ratings table and writing to data/ratings.ht ...
2018-07-05 15:24:13 Hail: INFO: Reading table to impute column types
2018-07-05 15:24:13 Hail: INFO: Finished type imputation
Loading column 'f0' as type 'int32' (imputed)
Loading column 'f1' as type 'int32' (imputed)
Loading column 'f2' as type 'int32' (imputed)
Loading column 'f3' as type 'int32' (imputed)
2018-07-05 15:24:16 Hail: INFO: wrote 100000 items in 2 partitions
In [3]:
users = hl.read_table('data/users.ht')
users
Out[3]:
<hail.table.Table at 0x7f24f92edfd0>
Exploring Tables¶
The describe method prints the structure of a table: the fields and their types.
In [4]:
users.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'id': int32
'age': int32
'sex': str
'occupation': str
'zipcode': str
----------------------------------------
Key: ['id']
----------------------------------------
describe
prints the structure of the table, but what about its
contents? To show the first few rows of the table, use
show.
In [5]:
users.show()
+-------+-------+-----+---------------+---------+
| id | age | sex | occupation | zipcode |
+-------+-------+-----+---------------+---------+
| int32 | int32 | str | str | str |
+-------+-------+-----+---------------+---------+
| 1 | 24 | M | technician | 85711 |
| 2 | 53 | F | other | 94043 |
| 3 | 23 | M | writer | 32067 |
| 4 | 24 | M | technician | 43537 |
| 5 | 33 | F | other | 15213 |
| 6 | 42 | M | executive | 98101 |
| 7 | 57 | M | administrator | 91344 |
| 8 | 36 | M | administrator | 05201 |
| 9 | 29 | M | student | 01002 |
| 10 | 53 | M | lawyer | 90703 |
+-------+-------+-----+---------------+---------+
showing top 10 rows
You can count the rows of a table.
In [6]:
users.count()
Out[6]:
943
You can access fields of tables with the Python attribute notation, or index notation. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example).
In [7]:
users.occupation.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f24f92edfd0>
Index:
['row']
--------------------------------------------------------
In [8]:
users['occupation'].describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.table.Table object at 0x7f24f92edfd0>
Index:
['row']
--------------------------------------------------------
You can also show
an Expression
. Notice that the key is shown as
well.
In [9]:
users.occupation.show()
+-------+---------------+
| id | occupation |
+-------+---------------+
| int32 | str |
+-------+---------------+
| 1 | technician |
| 2 | other |
| 3 | writer |
| 4 | technician |
| 5 | other |
| 6 | executive |
| 7 | administrator |
| 8 | administrator |
| 9 | student |
| 10 | lawyer |
+-------+---------------+
showing top 10 rows
Exercise¶
The movie dataset has two other tables: movies.ht
and
ratings.ht
. Load these tables and have a quick look around.