Experimental¶
This module serves two functions: as a staging area for extensions of Hail not ready for inclusion in the main package, and as a library of lightly reviewed community submissions.
Contribution Guidelines¶
Submissions from the community are welcome! The criteria for inclusion in the experimental module are loose and subject to change:
- Function docstrings are required. Hail uses NumPy style docstrings.
- Tests are not required, but are encouraged. If you do include tests, they must
run in no more than a few seconds. Place tests as a class method on
Tests
inpython/tests/experimental/test_experimental.py
- Code style is not strictly enforced, aside from egregious violations. We do recommend using autopep8 though!
Genetics Methods¶
ld_score (entry_expr, locus_expr, radius[, …]) |
Calculate LD scores. |
filtering_allele_frequency (ac, an, ci) |
Computes a filtering allele frequency (described below) for ac and an with confidence ci. |
hail_metadata (t_path) |
Create a metadata plot for a Hail Table or MatrixTable. |
plot_roc_curve (ht, scores[, tp_label, …]) |
Create ROC curve from Hail Table. |
phase_by_transmission (locus, alleles, …) |
Phases genotype calls in a trio based allele transmission. |
phase_trio_matrix_by_transmission (tm, …) |
Adds a phased genoype entry to a trio MatrixTable based allele transmission in the trio. |
explode_trio_matrix (tm, col_keys) |
Splits a trio MatrixTable back into a sample MatrixTable. |
load_dataset (name, version, reference_genome) |
Load a genetic dataset from Hail’s repository. |
-
hail.experimental.
ld_score
(entry_expr, locus_expr, radius, coord_expr=None, annotation_exprs=None, block_size=None) → hail.table.Table[source]¶ Calculate LD scores.
Example
>>> # Load genetic data into MatrixTable >>> mt = hl.import_plink(bed='data/ldsc.bed', ... bim='data/ldsc.bim', ... fam='data/ldsc.fam')
>>> # Create locus-keyed Table with numeric variant annotations >>> ht = hl.import_table('data/ldsc.annot', ... types={'BP': hl.tint, ... 'binary': hl.tfloat, ... 'continuous': hl.tfloat}) >>> ht = ht.annotate(locus=hl.locus(ht.CHR, ht.BP)) >>> ht = ht.key_by('locus')
>>> # Annotate MatrixTable with external annotations >>> mt = mt.annotate_rows(binary_annotation=ht[mt.locus].binary, ... continuous_annotation=ht[mt.locus].continuous)
>>> # Calculate LD scores using centimorgan coordinates >>> ht_scores = hl.experimental.ld_score(entry_expr=mt.GT.n_alt_alleles(), ... locus_expr=mt.locus, ... radius=1.0, ... coord_expr=mt.cm_position, ... annotation_exprs=[mt.binary_annotation, ... mt.continuous_annotation])
>>> # Show results >>> ht_scores.show(3)
+---------------+-------------------+-----------------------+-------------+ | locus | binary_annotation | continuous_annotation | univariate | +---------------+-------------------+-----------------------+-------------+ | locus<GRCh37> | float64 | float64 | float64 | +---------------+-------------------+-----------------------+-------------+ | 20:82079 | 1.15183e+00 | 7.30145e+01 | 1.60117e+00 | | 20:103517 | 2.04604e+00 | 2.75392e+02 | 4.69239e+00 | | 20:108286 | 2.06585e+00 | 2.86453e+02 | 5.00124e+00 | +---------------+-------------------+-----------------------+-------------+
Warning
ld_score()
will fail ifentry_expr
results in any missing values. The special float valuenan
is not considered a missing value.Further reading
For more in-depth discussion of LD scores, see:
- LD Score regression distinguishes confounding from polygenicity in genome-wide association studies (Bulik-Sullivan et al, 2015)
- Partitioning heritability by functional annotation using genome-wide association summary statistics (Finucane et al, 2015)
Notes
entry_expr, locus_expr, coord_expr (if specified), and annotation_exprs (if specified) must come from the same MatrixTable.
Parameters: - entry_expr (
NumericExpression
) – Expression for entries of genotype matrix (e.g.mt.GT.n_alt_alleles()
). - locus_expr (
LocusExpression
) – Row-indexed locus expression. - radius (
int
orfloat
) – Radius of window for row values (in units of coord_expr if set, otherwise in units of basepairs). - coord_expr (
Float64Expression
, optional) – Row-indexed numeric expression for the row value used to window variants. By default, the row value is given by the locus position. - annotation_exprs (
NumericExpression
or) –list
ofNumericExpression
, optional Annotation expression(s) to partition LD scores. Univariate annotation will always be included and does not need to be specified. - block_size (
int
, optional) – Block size. Default given byBlockMatrix.default_block_size()
.
Returns: Table
– Table keyed by locus_expr with LD scores for each variant and annotation_expr. The function will always return LD scores for the univariate (all SNPs) annotation.
-
hail.experimental.
hail_metadata
(t_path)[source]¶ Create a metadata plot for a Hail Table or MatrixTable.
Parameters: t_path (str) – Path to the Hail Table or MatrixTable files. Returns: bokeh.plotting.figure.Figure
orbokeh.models.widgets.panels.Tabs
orbokeh.models.layouts.Column
-
hail.experimental.
plot_roc_curve
(ht, scores, tp_label='tp', fp_label='fp', colors=None, title='ROC Curve', hover_mode='mouse')[source]¶ Create ROC curve from Hail Table.
One or more score fields must be provided, which are assessed against tp_label and fp_label as truth data.
High scores should correspond to true positives.
Parameters: - ht (
Table
) – Table with required data - scores (
str
orlist
ofstr
) – Top-level location of scores in ht against which to generate PR curves. - tp_label (
str
) – Top-level location of true positives in ht. - fp_label (
str
) – Top-level location of false positives in ht. - colors (
dict
ofstr
) – Optional colors to use (score -> desired color). - title (
str
) – Title of plot. - hover_mode (
str
) – Hover mode; one of ‘mouse’ (default), ‘vline’ or ‘hline’
Returns: tuple
ofFigure
andlist
ofstr
– Figure, and list of AUCs corresponding to scores.- ht (
-
hail.experimental.
filtering_allele_frequency
(ac, an, ci) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Computes a filtering allele frequency (described below) for ac and an with confidence ci.
The filtering allele frequency is the highest true population allele frequency for which the upper bound of the ci (confidence interval) of allele count under a Poisson distribution is still less than the variant’s observed ac (allele count) in the reference sample, given an an (allele number).
This function defines a “filtering AF” that represents the threshold disease-specific “maximum credible AF” at or below which the disease could not plausibly be caused by that variant. A variant with a filtering AF >= the maximum credible AF for the disease under consideration should be filtered, while a variant with a filtering AF below the maximum credible remains a candidate. This filtering AF is not disease-specific: it can be applied to any disease of interest by comparing with a user-defined disease-specific maximum credible AF.
For more details, see: Whiffin et al., 2017
Parameters: - ac (int or
Expression
of typetint32
) - an (int or
Expression
of typetint32
) - ci (float or
Expression
of typetfloat64
)
Returns: Expression
of typetfloat64
- ac (int or
-
hail.experimental.
phase_by_transmission
(locus: hail.expr.expressions.typed_expressions.LocusExpression, alleles: hail.expr.expressions.typed_expressions.ArrayExpression, proband_call: hail.expr.expressions.typed_expressions.CallExpression, father_call: hail.expr.expressions.typed_expressions.CallExpression, mother_call: hail.expr.expressions.typed_expressions.CallExpression) → hail.expr.expressions.typed_expressions.ArrayExpression[source]¶ Phases genotype calls in a trio based allele transmission.
Notes
In the phased calls returned, the order is as follows: - Proband: father_allele | mother_allele - Parents: transmitted_allele | untransmitted_allele
Phasing of sex chromosomes: - Sex chromosomes of male individuals should be haploid to be phased correctly. - If proband_call is diploid on non-par regions of the sex chromosomes, it is assumed to be female.
Returns NA when genotype calls cannot be phased. The following genotype calls combinations cannot be phased by transmission: 1. One of the calls in the trio is missing 2. The proband genotype cannot be obtained from the parents alleles (Mendelian violation) 3. All individuals of the trio are heterozygous for the same two alleles 4. Father is diploid on non-PAR region of X or Y 5. Proband is diploid on non-PAR region of Y
In addition, individual phased genotype calls are returned as missing in the following situations: 1. All mother genotype calls non-PAR region of Y 2. Diploid father genotype calls on non-PAR region of X for a male proband (proband and mother are still phased as father doesn’t participate in allele transmission)
Note
experimental.phase_trio_matrix_by_transmission()
provides a convenience wrapper for phasing a trio matrix.Parameters: - locus (
LocusExpression
) – Expression for the locus in the trio matrix - alleles (
ArrayExpression
) – Expression for the alleles in the trio matrix - proband_call (
CallExpression
) – Expression for the proband call in the trio matrix - father_call (
CallExpression
) – Expression for the father call in the trio matrix - mother_call (
CallExpression
) – Expression for the mother call in the trio matrix
Returns: ArrayExpression
– Array containing: [phased proband call, phased father call, phased mother call]- locus (
-
hail.experimental.
phase_trio_matrix_by_transmission
(tm: hail.matrixtable.MatrixTable, call_field: str = 'GT', phased_call_field: str = 'PBT_GT') → hail.matrixtable.MatrixTable[source]¶ Adds a phased genoype entry to a trio MatrixTable based allele transmission in the trio.
Example
>>> # Create a trio matrix >>> pedigree = hl.Pedigree.read('data/case_control_study.fam') >>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
>>> # Phase trios by transmission >>> phased_trio_dataset = phase_trio_matrix_by_transmission(trio_dataset)
Notes
Uses only a Call field to phase and only phases when all 3 members of the trio are present and have a call.
In the phased genotypes, the order is as follows: - Proband: father_allele | mother_allele - Parents: transmitted_allele | untransmitted_allele
Phasing of sex chromosomes: - Sex chromosomes of male individuals should be haploid to be phased correctly. - If a proband is diploid on non-par regions of the sex chromosomes, it is assumed to be female.
Genotypes that cannot be phased are set to NA. The following genotype calls combinations cannot be phased by transmission (all trio members phased calls set to missing): 1. One of the calls in the trio is missing 2. The proband genotype cannot be obtained from the parents alleles (Mendelian violation) 3. All individuals of the trio are heterozygous for the same two alleles 4. Father is diploid on non-PAR region of X or Y 5. Proband is diploid on non-PAR region of Y
In addition, individual phased genotype calls are returned as missing in the following situations: 1. All mother genotype calls non-PAR region of Y 2. Diploid father genotype calls on non-PAR region of X for a male proband (proband and mother are still phased as father doesn’t participate in allele transmission)
Parameters: - tm (
MatrixTable
) – Trio MatrixTable (entries have to be a Struct with proband_entry, mother_entry and father_entry present) - call_field (str) – genotype field name in the matrix entries to use for phasing
- phased_call_field (str) – name for the phased genotype field in the matrix entries
Returns: MatrixTable
– Trio MatrixTable entry with additional phased genotype field for each individual- tm (
-
hail.experimental.
explode_trio_matrix
(tm: hail.matrixtable.MatrixTable, col_keys: List[str] = ['s']) → hail.matrixtable.MatrixTable[source]¶ Splits a trio MatrixTable back into a sample MatrixTable.
Example
>>> # Create a trio matrix from a sample matrix >>> pedigree = hl.Pedigree.read('data/case_control_study.fam') >>> trio_dataset = hl.trio_matrix(dataset, pedigree, complete_trios=True)
>>> # Explode trio matrix back into a sample matrix >>> exploded_trio_dataset = explode_trio_matrix(trio_dataset)
Notes
This assumes that the input MatrixTable is a trio MatrixTable (similar to the result of
methods.trio_matrix()
) In particular, it should have the following entry schema: - proband_entry - father_entry - mother_entry And the following column schema: - proband - father - motherNote
The only entries kept are proband_entry, father_entry and mother_entry are dropped. The only columns kepy are proband, father and mother
Parameters: - tm (
MatrixTable
) – Trio MatrixTable (entries have to be a Struct with proband_entry, mother_entry and father_entry present) - call_field (
list
of str) – Column key(s) for the resulting sample MatrixTable
Returns: MatrixTable
– Sample MatrixTable- tm (
-
hail.experimental.
load_dataset
(name, version, reference_genome, config_file='gs://hail-datasets/datasets.json')[source]¶ Load a genetic dataset from Hail’s repository.
Example
>>> # Load 1000 Genomes MatrixTable with GRCh38 coordinates >>> mt_1kg = hl.experimental.load_dataset(name='1000_genomes', ... version='phase3', ... reference_genome='GRCh38')
Parameters: - name (
str
) – Name of the dataset to load. - version (
str
) – Version of the named dataset to load (see available versions in documentation). - reference_genome (GRCh37 or GRCh38) – Reference genome build.
Returns: Table
orMatrixTable
- name (