utils¶
Interval (start, end[, includes_start, …]) |
An object representing a range of values between start and end. |
Struct (**kwargs) |
Nested annotation structure. |
hadoop_open (path, mode, buffer_size) |
Open a file through the Hadoop filesystem API. |
hadoop_copy (src, dest) |
Copy a file through the Hadoop filesystem API. |
hadoop_exists (path) |
Returns True if path exists. |
hadoop_is_file (path) |
Returns True if path both exists and is a file. |
hadoop_is_dir (path) |
Returns True if path both exists and is a directory. |
hadoop_stat (path) |
Returns information about the file or directory at a given path. |
hadoop_ls (path) |
Returns information about files at path. |
range_table (n[, n_partitions]) |
Construct a table with the row index and no other fields. |
range_matrix_table (n_rows, n_cols[, …]) |
Construct a matrix table with row and column indices and no entry fields. |
get_1kg (output_dir, overwrite) |
Download subset of the 1000 Genomes dataset and sample annotations. |
get_movie_lens (output_dir, overwrite) |
Download public Movie Lens dataset. |
-
class
hail.utils.
Interval
(start, end, includes_start=True, includes_end=False)[source]¶ An object representing a range of values between start and end.
>>> interval2 = hl.Interval(3, 6)
Parameters: - start (any type) – Object with type point_type.
- end (any type) – Object with type point_type.
- includes_start (
bool
) – Interval includes start. - includes_end (
bool
) – Interval includes end.
-
contains
(value)[source]¶ True if value is contained within the interval.
Examples
>>> interval2.contains(5) True
>>> interval2.contains(6) False
Parameters: value – Object with type point_type()
.Returns: bool
-
end
¶ End point of the interval.
Examples
>>> interval2.end 6
Returns: Object with type point_type()
-
includes_end
¶ True if interval is inclusive of end.
Examples
>>> interval2.includes_end False
Returns: bool
-
includes_start
¶ True if interval is inclusive of start.
Examples
>>> interval2.includes_start True
Returns: bool
-
overlaps
(interval)[source]¶ True if the the supplied interval contains any value in common with this one.
Parameters: interval ( Interval
) – Interval object with the same point type.Returns: bool
-
point_type
¶ Type of each element in the interval.
Examples
>>> interval2.point_type tint32
Returns: Type
-
start
¶ Start point of the interval.
Examples
>>> interval2.start 3
Returns: Object with type point_type()
-
class
hail.utils.
Struct
(**kwargs)[source]¶ Nested annotation structure.
>>> bar = hl.Struct(**{'foo': 5, '1kg': 10})
Struct elements are treated as both ‘items’ and ‘attributes’, which allows either syntax for accessing the element “foo” of struct “bar”:
>>> bar.foo >>> bar['foo']
Field names that are not valid Python identifiers, such as fields that start with numbers or contain spaces, must be accessed with the latter syntax:
>>> bar['1kg']
The
pprint
module can be used to print nested Structs in a more human-readable fashion:>>> from pprint import pprint >>> pprint(bar)
Parameters: attributes – Field names and values. -
annotate
(**kwargs)[source]¶ Add new fields or recompute existing fields.
Notes
If an expression in kwargs shares a name with a field of the struct, then that field will be replaced but keep its position in the struct. New fields will be appended to the end of the struct.
Parameters: kwargs (keyword args) – Fields to add. Returns: Struct
– Struct with new or updated fields.
-
drop
(*args)[source]¶ Drop fields from the struct.
Parameters: fields (varargs of str
) – Fields to drop.Returns: Struct
– Struct without certain fields.
-
select
(*fields, **kwargs)[source]¶ Select existing fields and compute new ones.
Notes
The fields argument is a list of field names to keep. These fields will appear in the resulting struct in the order they appear in fields.
The kwargs arguments are new fields to add.
Parameters: - fields (varargs of
str
) – Field names to keep. - named_exprs (keyword args) – New field.
Returns: Struct
– Struct containing specified existing fields and computed fields.- fields (varargs of
-
-
hail.utils.
hadoop_open
(path: str, mode: str = 'r', buffer_size: int = 8192)[source]¶ Open a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Examples
>>> with hadoop_open('gs://my-bucket/notes.txt') as f: ... for line in f: ... print(line.strip())
>>> with hadoop_open('gs://my-bucket/notes.txt', 'w') as f: ... f.write('result1: %s\n' % result1) ... f.write('result2: %s\n' % result2)
>>> from struct import unpack >>> with hadoop_open('gs://my-bucket/notes.txt', 'rb') as f: ... print(unpack('<f', bytearray(f.read())))
Notes
The supported modes are:
'r'
– Readable text file (io.TextIOWrapper
). Default behavior.'w'
– Writable text file (io.TextIOWrapper
).'x'
– Exclusive writable text file (io.TextIOWrapper
). Throws an error if a file already exists at the path.'rb'
– Readable binary file (io.BufferedReader
).'wb'
– Writable binary file (io.BufferedWriter
).'xb'
– Exclusive writable binary file (io.BufferedWriter
). Throws an error if a file already exists at the path.
The provided destination file path must be a URI (uniform resource identifier).
Caution
These file handles are slower than standard Python file handles. If you are writing a large file (larger than ~50M), it will be faster to write to a local file using standard Python I/O and use
hadoop_copy()
to move your file to a distributed file system.Parameters: - path (
str
) – Path to file. - mode (
str
) – File access mode. - buffer_size (
int
) – Buffer size, in bytes.
Returns: Readable or writable file handle.
-
hail.utils.
hadoop_copy
(src, dest)[source]¶ Copy a file through the Hadoop filesystem API. Supports distributed file systems like hdfs, gs, and s3.
Examples
>>> hadoop_copy('gs://hail-common/LCR.interval_list', 'file:///mnt/data/LCR.interval_list')
Notes
The provided source and destination file paths must be URIs (uniform resource identifiers).
Parameters: - src (
str
) – Source file URI. - dest (
str
) – Destination file URI.
- src (
-
hail.utils.
hadoop_exists
(path: str) → bool[source]¶ Returns
True
if path exists.Parameters: path ( str
)Returns: bool
-
hail.utils.
hadoop_is_file
(path: str) → bool[source]¶ Returns
True
if path both exists and is a file.Parameters: path ( str
)Returns: bool
-
hail.utils.
hadoop_is_dir
(path) → bool[source]¶ Returns
True
if path both exists and is a directory.Parameters: path ( str
)Returns: bool
-
hail.utils.
hadoop_stat
(path: str) → Dict[source]¶ Returns information about the file or directory at a given path.
Notes
Raises an error if path does not exist.
The resulting dictionary contains the following data:
- is_dir (
bool
) – Path is a directory. - size_bytes (
int
) – Size in bytes. - size (
str
) – Size as a readable string. - modification_time (
str
) – Time of last file modification. - owner (
str
) – Owner. - path (
str
) – Path.
Parameters: path ( str
)Returns: Dict
- is_dir (
-
hail.utils.
hadoop_ls
(path: str) → List[Dict][source]¶ Returns information about files at path.
Notes
Raises an error if path does not exist.
If path is a file, returns a list with one element. If path is a directory, returns an element for each file contained in path (does not search recursively).
Each dict element of the result list contains the following data:
- is_dir (
bool
) – Path is a directory. - size_bytes (
int
) – Size in bytes. - size (
str
) – Size as a readable string. - modification_time (
str
) – Time of last file modification. - owner (
str
) – Owner. - path (
str
) – Path.
Parameters: path ( str
)Returns: List[Dict]
- is_dir (
-
hail.utils.
range_table
(n, n_partitions=None) → hail.Table[source]¶ Construct a table with the row index and no other fields.
Examples
>>> df = hl.utils.range_table(100)
>>> df.count() 100
Notes
The resulting table contains one field:
- idx (
tint32
) - Row index (key).
This method is meant for testing and learning, and is not optimized for production performance.
Parameters: - n (int) – Number of rows.
- n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).
Returns: - idx (
-
hail.utils.
range_matrix_table
(n_rows, n_cols, n_partitions=None) → hail.MatrixTable[source]¶ Construct a matrix table with row and column indices and no entry fields.
Examples
>>> range_ds = hl.utils.range_matrix_table(n_rows=100, n_cols=10)
>>> range_ds.count_rows() 100
>>> range_ds.count_cols() 10
Notes
The resulting matrix table contains the following fields:
It contains no entry fields.
This method is meant for testing and learning, and is not optimized for production performance.
Parameters: - n_rows (
int
) – Number of rows. - n_cols (
int
) – Number of columns. - n_partitions (int, optional) – Number of partitions (uses Spark default parallelism if None).
Returns: - n_rows (
-
hail.utils.
get_1kg
(output_dir, overwrite: bool = False)[source]¶ Download subset of the 1000 Genomes dataset and sample annotations.
Notes
The download is about 15M.
Parameters: - output_dir – Directory in which to write data.
- overwrite – If
True
, overwrite any existing files/directories at output_dir.
-
hail.utils.
get_movie_lens
(output_dir, overwrite: bool = False)[source]¶ Download public Movie Lens dataset.
Notes
The download is about 6M.
See the MovieLens website for more information about this dataset.
Parameters: - output_dir – Directory in which to write data.
- overwrite – If
True
, overwrite existing files/directories at those locations.