Genetics functions¶
locus (contig, pos, reference_genome, …) |
Construct a locus expression from a chromosome and position. |
locus_from_global_position (global_pos, …) |
Constructs a locus expression from a global position and a reference genome. |
locus_interval (contig, start, end[, …]) |
Construct a locus interval expression. |
parse_locus (s, reference_genome, …) |
Construct a locus expression by parsing a string or string expression. |
parse_variant (s, reference_genome, …) |
Construct a struct with a locus and alleles by parsing a string. |
parse_locus_interval (s, reference_genome, …) |
Construct a locus interval expression by parsing a string or string expression. |
call (*alleles[, phased]) |
Construct a call expression. |
unphased_diploid_gt_index_call (gt_index) |
Construct an unphased, diploid call from a genotype index. |
parse_call (s) |
Construct a call expression by parsing a string or string expression. |
downcode (c, i) |
Create a new call by setting all alleles other than i to ref |
triangle (n) |
Returns the triangle number of n. |
is_snp (ref, alt) |
Returns True if the alleles constitute a single nucleotide polymorphism. |
is_mnp (ref, alt) |
Returns True if the alleles constitute a multiple nucleotide polymorphism. |
is_transition (ref, alt) |
Returns True if the alleles constitute a transition. |
is_transversion (ref, alt) |
Returns True if the alleles constitute a transversion. |
is_insertion (ref, alt) |
Returns True if the alleles constitute an insertion. |
is_deletion (ref, alt) |
Returns True if the alleles constitute a deletion. |
is_indel (ref, alt) |
Returns True if the alleles constitute an insertion or deletion. |
is_star (ref, alt) |
Returns True if the alleles constitute an upstream deletion. |
is_complex (ref, alt) |
Returns True if the alleles constitute a complex polymorphism. |
is_strand_ambiguous (ref, alt) |
Returns True if the alleles are strand ambiguous. |
is_valid_contig (contig[, reference_genome]) |
Returns True if contig is a valid contig name in reference_genome. |
is_valid_locus (contig, position[, …]) |
Returns True if contig and position is a valid site in reference_genome. |
allele_type (ref, alt) |
Returns the type of the polymorphism as a string. |
pl_dosage (pl) |
Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior. |
gp_dosage (gp) |
Return expected genotype dosage from array of genotype probabilities. |
get_sequence (contig, position[, before, …]) |
Return the reference sequence at a given locus. |
mendel_error_code (locus, is_female, father, …) |
Compute a Mendelian violation code for genotypes. |
liftover (x, dest_reference_genome[, min_match]) |
Lift over coordinates to a different reference genome. |
min_rep (locus, alleles) |
Computes the minimal representation of a (locus, alleles) polymorphism. |
-
hail.expr.functions.
locus
(contig, pos, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]¶ Construct a locus expression from a chromosome and position.
Examples
>>> hl.eval(hl.locus("1", 10000)) Locus(contig=1, position=10000, reference_genome=GRCh37)
Parameters: - contig (str or
StringExpression
) – Chromosome. - pos (int or
Expression
of typetint32
) – Base position along the chromosome. - reference_genome (
str
orReferenceGenome
) – Reference genome to use.
Returns: - contig (str or
-
hail.expr.functions.
locus_from_global_position
(global_pos, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]¶ Constructs a locus expression from a global position and a reference genome. The inverse of
LocusExpression.global_position()
.Examples
>>> hl.eval(hl.locus_from_global_position(0)) Locus(contig=1, position=1, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054)) Locus(contig=21, position=42584230, reference_genome=GRCh37)
>>> hl.eval(hl.locus_from_global_position(2824183054, 'GRCh38')) Locus(contig=22, position=1, reference_genome=GRCh38)
Parameters: - global_pos (int or
Expression
of typetint64
) – Global base position along the reference genome. - reference_genome (
str
orReferenceGenome
) – Reference genome to use for converting the global position to a contig and local position.
Returns: - global_pos (int or
-
hail.expr.functions.
locus_interval
(contig, start, end, includes_start=True, includes_end=False, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.IntervalExpression[source]¶ Construct a locus interval expression.
Examples
>>> hl.eval(hl.locus_interval("1", 100, 1000)) Interval(start=Locus(contig=1, position=100, reference_genome=GRCh37), end=Locus(contig=1, position=1000, reference_genome=GRCh37))
Parameters: - contig (
StringExpression
) – Contig name. - start (
Int32Expression
) – Starting base position. - end (
Int32Expression
) – End base position. - includes_start (
BooleanExpression
) – IfTrue
, interval includes start point. - includes_end (
BooleanExpression
) – IfTrue
, interval includes end point. - reference_genome (
str
orhail.genetics.ReferenceGenome
) – Reference genome to use.
Returns: - contig (
-
hail.expr.functions.
parse_locus
(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.LocusExpression[source]¶ Construct a locus expression by parsing a string or string expression.
Examples
>>> hl.eval(hl.parse_locus("1:10000")) Locus(contig=1, position=10000, reference_genome=GRCh37)
Notes
This method expects strings of the form
contig:position
, e.g.16:29500000
orX:123456
.Parameters: - s (str or
StringExpression
) – String to parse. - reference_genome (
str
orReferenceGenome
) – Reference genome to use.
Returns: - s (str or
-
hail.expr.functions.
parse_variant
(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.StructExpression[source]¶ Construct a struct with a locus and alleles by parsing a string.
Examples
>>> hl.eval(hl.parse_variant('1:100000:A:T,C')) Struct(locus=Locus('1', 100000), alleles=['A', 'T', 'C'])
Notes
This method returns an expression of type
tstruct
with the following fields:Parameters: - s (
StringExpression
) – String to parse. - reference_genome (
str
orReferenceGenome
) – Reference genome to use.
Returns: StructExpression
– Struct with fields locus and alleles.- s (
-
hail.expr.functions.
parse_locus_interval
(s, reference_genome: Union[str, hail.genetics.reference_genome.ReferenceGenome] = 'default') → hail.expr.expressions.typed_expressions.IntervalExpression[source]¶ Construct a locus interval expression by parsing a string or string expression.
Examples
>>> hl.eval(hl.parse_locus_interval('1:1000-2000')) Interval(start=Locus(contig=1, position=1000, reference_genome=GRCh37), end=Locus(contig=1, position=2000, reference_genome=GRCh37))
>>> hl.eval(hl.parse_locus_interval('1:start-10M')) Interval(start=Locus(contig=1, position=0, reference_genome=GRCh37), end=Locus(contig=1, position=10000000, reference_genome=GRCh37))
Notes
The start locus must precede the end locus. The default bounds of the interval are left-inclusive and right-exclusive. To change this, add one of
[
or(
at the beginning of the string for left-inclusive or left-exclusive respectively. Likewise, add one of]
or)
at the end of the string for right-inclusive or right-exclusive respectively.There are several acceptable representations for s.
CHR1:POS1-CHR2:POS2
is the fully specified representation, and we use this to define the various shortcut representations.In a
POS
field,start
(Start
,START
) stands for 1.In a
POS
field,end
(End
,END
) stands for the contig length.In a
POS
field, the qualifiersm
(M
) andk
(K
) multiply the given number by1,000,000
and1,000
, respectively.1.6K
is short for 1600, and29M
is short for 29000000.CHR:POS1-POS2
stands forCHR:POS1-CHR:POS2
CHR1-CHR2
stands forCHR1:START-CHR2:END
CHR
stands forCHR:START-CHR:END
Note
The bounds of the interval must be valid loci for the reference genome (contig in reference genome and position is within the range [1-END]) except in the case where the position is
0
AND the interval is left-exclusive which is normalized to be1
and left-inclusive. Likewise, in the case where the position isEND + 1
AND the interval is right-exclusive which is normalized to beEND
and right-inclusive.Parameters: - s (str or
StringExpression
) – String to parse. - reference_genome (
str
orhail.genetics.ReferenceGenome
) – Reference genome to use.
Returns: - s (str or
-
hail.expr.functions.
call
(*alleles, phased=False) → hail.expr.expressions.typed_expressions.CallExpression[source]¶ Construct a call expression.
Examples
>>> hl.eval(hl.call(1, 0)) Call(alleles=[1, 0], phased=False)
Parameters: - alleles (variable-length args of
int
orExpression
of typetint32
) – List of allele indices. - phased (
bool
) – IfTrue
, preserve the order of alleles.
Returns: - alleles (variable-length args of
-
hail.expr.functions.
unphased_diploid_gt_index_call
(gt_index) → hail.expr.expressions.typed_expressions.CallExpression[source]¶ Construct an unphased, diploid call from a genotype index.
Examples
>>> hl.eval(hl.unphased_diploid_gt_index_call(4)) Call(alleles=[1, 2], phased=False)
Parameters: gt_index ( int
orExpression
of typetint32
) – Unphased, diploid genotype index.Returns: CallExpression
-
hail.expr.functions.
parse_call
(s) → hail.expr.expressions.typed_expressions.CallExpression[source]¶ Construct a call expression by parsing a string or string expression.
Examples
>>> hl.eval(hl.parse_call('0|2')) Call([0, 2], phased=True)
Notes
This method expects strings in the following format:
ploidy Phased Unphased 0 |-
-
1 |i
i
2 i|j
i/j
3 i|j|k
i/j/k
N i|j|k|...|N
i/j/k/.../N
Parameters: s (str or StringExpression
) – String to parse.Returns: CallExpression
-
hail.expr.functions.
downcode
(c, i) → hail.expr.expressions.typed_expressions.CallExpression[source]¶ Create a new call by setting all alleles other than i to ref
Examples
Preserve the third allele and downcode all other alleles to reference.
>>> hl.eval(hl.downcode(hl.call(1, 2), 2)) Call(alleles=[0, 2], phased=False)
Parameters: - c (
CallExpression
) – A call. - i (
Expression
of typetint32
) – The index of the allele that will be sent to the alternate allele. All other alleles will be downcoded to reference.
Returns: - c (
-
hail.expr.functions.
triangle
(n) → hail.expr.expressions.typed_expressions.Int32Expression[source]¶ Returns the triangle number of n.
Examples
>>> hl.eval(hl.triangle(3)) 6
Notes
The calculation is
n * (n + 1) / 2
.Parameters: n ( Expression
of typetint32
)Returns: Expression
of typetint32
-
hail.expr.functions.
is_snp
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute a single nucleotide polymorphism.Examples
>>> hl.eval(hl.is_snp('A', 'T')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_mnp
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute a multiple nucleotide polymorphism.Examples
>>> hl.eval(hl.is_mnp('AA', 'GT')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_transition
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute a transition.Examples
>>> hl.eval(hl.is_transition('A', 'T')) False
>>> hl.eval(hl.is_transition('AAA', 'AGA')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_transversion
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute a transversion.Examples
>>> hl.eval(hl.is_transversion('A', 'T')) True
>>> hl.eval(hl.is_transversion('AAA', 'AGA')) False
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_insertion
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute an insertion.Examples
>>> hl.eval(hl.is_insertion('A', 'ATT')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_deletion
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute a deletion.Examples
>>> hl.eval(hl.is_deletion('ATT', 'A')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_indel
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute an insertion or deletion.Examples
>>> hl.eval(hl.is_indel('ATT', 'A')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_star
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute an upstream deletion.Examples
>>> hl.eval(hl.is_deletion('A', '*')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_complex
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles constitute a complex polymorphism.Examples
>>> hl.eval(hl.is_deletion('ATT', 'GCA')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_strand_ambiguous
(ref, alt) → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if the alleles are strand ambiguous.Strand ambiguous allele pairs are
A/T
,T/A
,C/G
, andG/C
where the first allele is ref and the second allele is alt.Examples
>>> hl.eval(hl.is_strand_ambiguous('A', 'T')) True
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns: - ref (
-
hail.expr.functions.
is_valid_contig
(contig, reference_genome='default') → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if contig is a valid contig name in reference_genome.Examples
>>> hl.eval(hl.is_valid_contig('1', 'GRCh37')) True
>>> hl.eval(hl.is_valid_contig('chr1', 'GRCh37')) False
Parameters: - contig (
Expression
of typetstr
) - reference_genome (
str
orReferenceGenome
)
Returns: - contig (
-
hail.expr.functions.
is_valid_locus
(contig, position, reference_genome='default') → hail.expr.expressions.typed_expressions.BooleanExpression[source]¶ Returns
True
if contig and position is a valid site in reference_genome.Examples
>>> hl.eval(hl.is_valid_locus('1', 324254, 'GRCh37')) True
>>> hl.eval(hl.is_valid_locus('chr1', 324254, 'GRCh37')) False
Parameters: - contig (
Expression
of typetstr
) - position (
Expression
of typetint
) - reference_genome (
str
orReferenceGenome
)
Returns: - contig (
-
hail.expr.functions.
allele_type
(ref, alt) → hail.expr.expressions.typed_expressions.StringExpression[source]¶ Returns the type of the polymorphism as a string.
Examples
>>> hl.eval(hl.allele_type('A', 'T')) 'SNP'
>>> hl.eval(hl.allele_type('ATT', 'A')) 'Deletion'
Notes
- The possible return values are:
"SNP"
"MNP"
"Insertion"
"Deletion"
"Complex"
"Star"
"Symbolic"
"Unknown"
Parameters: - ref (
StringExpression
) – Reference allele. - alt (
StringExpression
) – Alternate allele.
Returns:
-
hail.expr.functions.
pl_dosage
(pl) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Return expected genotype dosage from array of Phred-scaled genotype likelihoods with uniform prior. Only defined for bi-allelic variants. The pl argument must be length 3.
For a PL array
[a, b, c]
, let:\[\begin{split}a^\prime = 10^{-a/10} \\ b^\prime = 10^{-b/10} \\ c^\prime = 10^{-c/10} \\\end{split}\]The genotype dosage is given by:
\[\frac{b^\prime + 2 c^\prime} {a^\prime + b^\prime +c ^\prime}\]Examples
>>> hl.eval(hl.pl_dosage([5, 10, 100])) 0.24025307377482674
Parameters: pl ( ArrayNumericExpression
of typetint32
) – Length 3 array of bi-allelic Phred-scaled genotype likelihoodsReturns: Expression
of typetfloat64
-
hail.expr.functions.
gp_dosage
(gp) → hail.expr.expressions.typed_expressions.Float64Expression[source]¶ Return expected genotype dosage from array of genotype probabilities.
Examples
>>> hl.eval(hl.gp_dosage([0.0, 0.5, 0.5])) 1.5
Notes
This function is only defined for bi-allelic variants. The gp argument must be length 3. The value is
gp[1] + 2 * gp[2]
.Parameters: gp ( ArrayFloat64Expression
) – Length 3 array of bi-allelic genotype probabilitiesReturns: Expression
of typetfloat64
-
hail.expr.functions.
get_sequence
(contig, position, before=0, after=0, reference_genome='default') → hail.expr.expressions.typed_expressions.StringExpression[source]¶ Return the reference sequence at a given locus.
Examples
Return the reference allele for
'GRCh37'
at the locus'1:45323'
:>>> hl.eval(hl.get_sequence('1', 45323, 'GRCh37')) "T"
Notes
This function requires reference genome has an attached reference sequence. Use
ReferenceGenome.add_sequence()
to load and attach a reference sequence to a reference genome.Returns
None
if contig and position are not valid coordinates in reference_genome.Parameters: - contig (
Expression
of typetstr
) – Locus contig. - position (
Expression
of typetint32
) – Locus position. - before (
Expression
of typetint32
, optional) – Number of bases to include before the locus of interest. Truncates at contig boundary. - after (
Expression
of typetint32
, optional) – Number of bases to include after the locus of interest. Truncates at contig boundary. - reference_genome (
str
orReferenceGenome
) – Reference genome to use. Must have a reference sequence available.
Returns: - contig (
-
hail.expr.functions.
mendel_error_code
(locus, is_female, father, mother, child)[source]¶ Compute a Mendelian violation code for genotypes.
>>> father = hl.call(0, 0) >>> mother = hl.call(1, 1) >>> child1 = hl.call(0, 1) # consistent >>> child2 = hl.call(0, 0) # Mendel error >>> locus = hl.locus('2', 2000000)
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child1)) None
>>> hl.eval(hl.mendel_error_code(locus, True, father, mother, child2)) 7
Note
Ignores call phasing, and assumes diploid and biallelic. Haploid calls for hemiploid samples on sex chromosomes also are acceptable input.
Notes
In the table below, the copy state of a locus with respect to a trio is defined as follows, where PAR is the pseudoautosomal region (PAR) of X and Y defined by the reference genome and the autosome is defined by
LocusExpression.in_autosome()
:- Auto – in autosome or in PAR, or in non-PAR of X and female child
- HemiX – in non-PAR of X and male child
- HemiY – in non-PAR of Y and male child
Any refers to the set { HomRef, Het, HomVar, NoCall } and ~ denotes complement in this set.
Code Dad Mom Kid Copy State Implicated 1 HomVar HomVar Het Auto Dad, Mom, Kid 2 HomRef HomRef Het Auto Dad, Mom, Kid 3 HomRef ~HomRef HomVar Auto Dad, Kid 4 ~HomRef HomRef HomVar Auto Mom, Kid 5 HomRef HomRef HomVar Auto Kid 6 HomVar ~HomVar HomRef Auto Dad, Kid 7 ~HomVar HomVar HomRef Auto Mom, Kid 8 HomVar HomVar HomRef Auto Kid 9 Any HomVar HomRef HemiX Mom, Kid 10 Any HomRef HomVar HemiX Mom, Kid 11 HomVar Any HomRef HemiY Dad, Kid 12 HomRef Any HomVar HemiY Dad, Kid Parameters: - locus (
LocusExpression
) - is_female (
BooleanExpression
) - father (
CallExpression
) - mother (
CallExpression
) - child (
CallExpression
)
Returns:
-
hail.expr.functions.
liftover
(x, dest_reference_genome, min_match=0.95)[source]¶ Lift over coordinates to a different reference genome.
Examples
Lift over the locus coordinates from reference genome
'GRCh37'
to'GRCh38'
:>>> hl.eval(hl.liftover(hl.locus('1', 1034245, 'GRCh37'), 'GRCh38')) Locus(contig='chr1', position=1098865, reference_genome='GRCh38')
Lift over the locus interval coordinates from reference genome
'GRCh37'
to'GRCh38'
:>>> hl.eval(hl.liftover(hl.locus_interval('20', 60001, 82456, True, True, 'GRCh37'), 'GRCh38')) Interval(Locus(contig='chr20', position=79360, reference_genome='GRCh38'), Locus(contig='chr20', position=101815, reference_genome='GRCh38'), True, True)
Notes
This function requires the reference genome of x has a chain file loaded for dest_reference_genome. Use
ReferenceGenome.add_liftover()
to load and attach a chain file to a reference genome.Returns
None
if x could not be converted.Warning
Before using the result of
liftover()
as a new row key or column key, be sure to filter out missing values.Parameters: - x (
Expression
of typetlocus
ortinterval
oftlocus
) – Locus or locus interval to lift over. - dest_reference_genome (
str
orReferenceGenome
) – Reference genome to convert to. - min_match (
Expression
of typetfloat64
) – Minimum ratio of bases that must remap.
Returns: Expression
– A locus or locus interval converted to dest_reference_genome.- x (
-
hail.expr.functions.
min_rep
(locus, alleles)[source]¶ Computes the minimal representation of a (locus, alleles) polymorphism.
Examples
>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['TAA', 'TA'])) Struct(locus=Locus(contig=1, position=100000, reference_genome=GRCh37), alleles=['TA', 'T'])
>>> hl.eval(hl.min_rep(hl.locus('1', 100000), ['AATAA', 'AACAA'])) Struct(locus=Locus(contig=1, position=100002, reference_genome=GRCh37), alleles=['T', 'C'])
Notes
Computing the minimal representation can cause the locus shift right (the position can increase).
Parameters: - locus (
LocusExpression
) - alleles (
ArrayExpression
of typetstr
)
Returns: StructExpression
– Atstruct
expression with two fields, locus (LocusExpression
) and alleles (ArrayExpression
of typetstr
).- locus (