Genetics¶
Formatting¶
Convert variants in string format to separate locus and allele fields¶
code: | >>> ht = ht.key_by(**hl.parse_variant(ht.variant))
|
---|---|
dependencies: | |
understanding: | If your variants are strings of the format ‘chr:pos:ref:alt’, you may want to convert them to separate locus and allele fields. This is useful if you have imported a table with variants in string format and you would like to join this table with other Hail tables that are keyed by locus and alleles.
|
Filtering and Pruning¶
Filter loci by a list of locus intervals¶
From a table of intervals¶
description: | Import a text file of locus intervals as a table, then use this table to filter the loci in a matrix table. |
---|---|
code: | >>> interval_table = hl.import_locus_intervals('data/gene.interval_list')
>>> filtered_mt = mt.filter_rows(hl.is_defined(interval_table[mt.locus]))
|
dependencies: | |
understanding: | We have a matrix table Hail supports implicit joins between locus intervals and loci, so we can filter our dataset to the rows defined in the join between the interval table and our matrix table.
To do our filtering, we can filter to the rows of our matrix table where the
struct expression This method will also work to filter a table of loci, instead of a matrix table. |
From a Python list¶
description: | Filter loci in a matrix table using a list of intervals. Suitable for a small list of intervals. |
---|---|
dependencies: | |
code: | >>> interval_table = hl.import_locus_intervals('data/gene.interval_list')
>>> interval_list = [x.interval for x in interval_table.collect()]
>>> filtered_mt = hl.filter_intervals(mt, interval_list)
|
Pruning Variants in Linkage Disequilibrium¶
tags: | LD Prune |
---|---|
description: | Remove correlated variants from a matrix table. |
code: | >>> biallelic_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
>>> pruned_variant_table = hl.ld_prune(mt.GT, r2=0.2, bp_window_size=500000)
>>> filtered_mt = mt.filter_rows(
... hl.is_defined(pruned_variant_table[mt.row_key]))
|
dependencies: | |
understanding: | Hail’s Note that it is more efficient to do the final filtering step on the original dataset, rather than on the biallelic dataset, so that the biallelic dataset does not need to be recomputed. |
Analysis¶
Linear Regression¶
Single Phenotype¶
tags: | Linear Regression |
---|---|
description: | Compute linear regression statistics for a single phenotype. |
code: | Approach #1: Use the >>> mt_linreg = hl.linear_regression(y=mt.pheno.height,
... x=mt.GT.n_alt_alleles(),
... covariates=[1])
Approach #2: Use the >>> mt_linreg = mt.annotate_rows(linreg=hl.agg.linreg(y=mt.pheno.height,
... x=[1, mt.GT.n_alt_alleles()]))
|
dependencies: | |
understanding: | The |
Multiple Phenotypes¶
tags: | Linear Regression |
---|---|
description: | Compute linear regression statistics for multiple phenotypes. |
code: | Approach #1: Use the >>> mt_linreg = hl.linear_regression(y=[mt.pheno.height, mt.pheno.blood_pressure],
... x=mt.GT.n_alt_alleles(),
... covariates=[1])
Approach #2: Use the >>> mt_linreg = hl.linear_regression(y=mt.pheno.height,
... x=mt.GT.n_alt_alleles(),
... covariates=[1])
>>> mt_linreg = hl.linear_regression(y=mt_linreg.pheno.blood_pressure,
... x=mt_linreg.GT.n_alt_alleles(),
... covariates=[1])
Approach #3: Use the >>> mt_linreg = mt.annotate_rows(
... linreg_height=hl.agg.linreg(y=mt.pheno.height,
... x=[1, mt.GT.n_alt_alleles()]),
... linreg_bp=hl.agg.linreg(y=mt.pheno.blood_pressure,
... x=[1, mt.GT.n_alt_alleles()]))
|
dependencies: | |
understanding: | The |
Stratified by Group¶
tags: | Linear Regression |
---|---|
description: | Compute linear regression statistics for a single phenotype stratified by group. |
code: | Approach #1: Use the >>> female_pheno = (hl.case()
... .when(mt.pheno.is_female, mt.pheno.height)
... .or_missing())
>>> mt_linreg = hl.linear_regression(y=female_pheno,
... x=mt.GT.n_alt_alleles(),
... covariates=[1],
... root='linreg_female')
>>> male_pheno = (hl.case()
... .when(~mt_linreg.pheno.is_female, mt_linreg.pheno.height)
... .or_missing())
>>> mt_linreg = hl.linear_regression(y=male_pheno,
... x=mt_linreg.GT.n_alt_alleles(),
... covariates=[1],
... root='linreg_male')
Approach #2: Use the >>> mt_linreg = mt.annotate_rows(
... linreg=hl.agg.group_by(mt.pheno.is_female,
... hl.agg.linreg(y=mt.pheno.height,
... x=[1, mt.GT.n_alt_alleles()])))
|
dependencies: |
|
understanding: | We have presented two ways to compute linear regression statistics for each value of a grouping
variable. The first approach utilizes the The second approach uses the The |
PLINK Conversions¶
Polygenic Risk Score Calculation¶
plink: | >>> plink --bfile data --score scores.txt sum
|
---|---|
tags: | PRS |
description: | This command is analogous to plink’s –score command with the sum option. Biallelic variants are required. |
code: | >>> mt = hl.import_plink(
... bed="data/ldsc.bed", bim="data/ldsc.bim", fam="data/ldsc.fam",
... quant_pheno=True, missing='-9')
>>> mt = hl.variant_qc(mt)
>>> scores = hl.import_table('data/scores.txt', delimiter=' ', key='rsid',
... types={'score': hl.tfloat32})
>>> mt = mt.annotate_rows(**scores[mt.rsid])
>>> flip = hl.case().when(mt.allele == mt.alleles[0], True).when(
... mt.allele == mt.alleles[1], False).or_missing()
>>> mt = mt.annotate_rows(flip=flip)
>>> mt = mt.annotate_rows(
... prior=2 * hl.cond(mt.flip, mt.variant_qc.AF[0], mt.variant_qc.AF[1]))
>>> mt = mt.annotate_cols(
... prs=hl.agg.sum(
... mt.score * hl.coalesce(
... hl.cond(mt.flip, 2 - mt.GT.n_alt_alleles(),
... mt.GT.n_alt_alleles()), mt.prior)))
|
dependencies: |
|