Plotting Tutorial¶
The Hail plot module allows for easy plotting of data. This notebook contains examples of how to use the plotting functions in this module, many of which can also be found in the first tutorial.
In [1]:
import hail as hl
hl.init()
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()
Running on Apache Spark version 2.2.0
SparkUI available at http://10.56.40.9:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version devel-90a5cab4aab8
NOTE: This is a beta version. Interfaces may change
during the beta period. We recommend pulling
the latest changes weekly.
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20181015-1344-devel-90a5cab4aab8.log
In [2]:
hl.utils.get_1kg('data/')
mt = hl.read_matrix_table('data/1kg.mt')
table = (hl.import_table('data/1kg_annotations.txt', impute=True)
.key_by('Sample'))
mt = mt.annotate_cols(**table[mt.s])
mt = hl.sample_qc(mt)
mt.describe()
2018-10-15 13:45:01 Hail: INFO: 1KG files found
2018-10-15 13:45:02 Hail: INFO: Reading table to impute column types
2018-10-15 13:45:02 Hail: INFO: Finished type imputation
Loading column 'Sample' as type 'str' (imputed)
Loading column 'Population' as type 'str' (imputed)
Loading column 'SuperPopulation' as type 'str' (imputed)
Loading column 'isFemale' as type 'bool' (imputed)
Loading column 'PurpleHair' as type 'bool' (imputed)
Loading column 'CaffeineConsumption' as type 'float64' (imputed)
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
'Population': str
'SuperPopulation': str
'isFemale': bool
'PurpleHair': bool
'CaffeineConsumption': float64
'sample_qc': struct {
dp_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
gq_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
call_rate: float64,
n_called: int64,
n_not_called: int64,
n_hom_ref: int64,
n_het: int64,
n_hom_var: int64,
n_non_ref: int64,
n_singleton: int64,
n_snp: int64,
n_insertion: int64,
n_deletion: int64,
n_transition: int64,
n_transversion: int64,
n_star: int64,
r_ti_tv: float64,
r_het_hom_var: float64,
r_insertion_deletion: float64
}
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
AC: array<int32>,
AF: array<float64>,
AN: int32,
BaseQRankSum: float64,
ClippingRankSum: float64,
DP: int32,
DS: bool,
FS: float64,
HaplotypeScore: float64,
InbreedingCoeff: float64,
MLEAC: array<int32>,
MLEAF: array<float64>,
MQ: float64,
MQ0: int32,
MQRankSum: float64,
QD: float64,
ReadPosRankSum: float64,
set: str
}
----------------------------------------
Entry fields:
'GT': call
'AD': array<int32>
'DP': int32
'GQ': int32
'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
Histogram¶
The histogram()
method takes as an argument an aggregated hist
expression, as well as optional arguments for the legend and title of
the plot.
In [3]:
dp_hist = mt.aggregate_entries(hl.expr.aggregators.hist(mt.DP, 0, 30, 30))
p = hl.plot.histogram(dp_hist, legend='DP', title='DP Histogram')
show(p)
This method, like all Hail plotting methods, also allows us to pass in
fields of our data set directly. Choosing not to specify the range
and bins
arguments would result in a range being computed based on
the largest and smallest values in the dataset and a default bins value
of 50.
In [4]:
p = hl.plot.histogram(mt.DP, range=(0, 30), bins=30)
show(p)
Cumulative Histogram¶
The cumulative_histogram()
method works in a similar way to
histogram()
.
In [5]:
p = hl.plot.cumulative_histogram(mt.DP, range=(0,30), bins=30)
show(p)
Scatter¶
The scatter()
method can also take in either Python types or Hail
fields as arguments for x and y.
In [6]:
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.call_rate, xlabel='Mean DP', ylabel='Call Rate')
show(p)
We can also pass in a Hail field as a label
argument, which
determines how to color the data points.
In [7]:
mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.97))
ab = mt.AD[1] / hl.sum(mt.AD)
filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) |
(mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
(mt.GT.is_hom_var() & (ab >= 0.9)))
mt = mt.filter_entries(filter_condition_ab)
mt = hl.variant_qc(mt).cache()
common_mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01)
gwas = hl.linear_regression(y=common_mt.CaffeineConsumption, x=common_mt.GT.n_alt_alleles(), covariates=[1.0])
pca_eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(common_mt.GT)
2018-10-15 13:45:12 Hail: INFO: Coerced sorted dataset
2018-10-15 13:45:13 Hail: INFO: Coerced sorted dataset
2018-10-15 13:45:15 Hail: INFO: linear_regression: running on 250 samples for 1 response variable y,
with input variable x, and 1 additional covariate...
2018-10-15 13:45:19 Hail: INFO: hwe_normalized_pca: running PCA using 9169 variants.
2018-10-15 13:45:20 Hail: INFO: pca: running PCA with 10 components...
2018-10-15 13:45:31 Hail: INFO: Coerced sorted dataset
In [8]:
p = hl.plot.scatter(pca_scores.scores[0], pca_scores.scores[1],
label=common_mt.cols()[pca_scores.s].SuperPopulation,
title='PCA', xlabel='PC1', ylabel='PC2', collect_all=True)
show(p)
2018-10-15 13:45:31 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2018-10-15 13:45:31 Hail: INFO: Coerced sorted dataset
Hail’s downsample aggregator is incorporated into the scatter()
,
qq()
, and manhattan()
functions. The collect_all
parameter
tells the plot function whether to collect all values or downsample.
Choosing not to set this parameter results in downsampling.
In [9]:
p2 = hl.plot.scatter(pca_scores.scores[0], pca_scores.scores[1],
label=common_mt.cols()[pca_scores.s].SuperPopulation,
title='PCA (downsampled)', xlabel='PC1', ylabel='PC2', collect_all=False, n_divisions=50)
show(gridplot([p, p2], ncols=2, plot_width=400, plot_height=400))
2018-10-15 13:45:32 Hail: INFO: Coerced sorted dataset
Q-Q (Quantile-Quantile)¶
The qq()
function requires either a Python type or a Hail field
containing p-values to be plotted. This function also allows for
downsampling.
In [10]:
p = hl.plot.qq(gwas.linreg.p_value, collect_all=True)
p2 = hl.plot.qq(gwas.linreg.p_value, n_divisions=75)
show(gridplot([p, p2], ncols=2, plot_width=400, plot_height=400))
2018-10-15 13:45:35 Hail: INFO: Ordering unsorted dataset with network shuffle
Manhattan¶
The manhattan()
function requires a Hail field containing p-values.
In [11]:
p = hl.plot.manhattan(gwas.linreg.p_value)
show(p)
We can also pass in a dictionary of fields that we would like to show up as we hover over a data point, and choose not to downsample if the dataset is relatively small.
In [12]:
hover_fields = dict([('alleles', gwas.alleles)])
p = hl.plot.manhattan(gwas.linreg.p_value, hover_fields=hover_fields, collect_all=True)
show(p)