(document version 1)
This document describes the data structure and the quality control performed on the WGSPD WGS data.
All data are available in this google bucket:
gs://wgspd-wgs-v1
The bucket has the following structure:
gs://wgspd-wgs-v1/raw/: contains the original raw data and plink files created from the raw data.
vcf/: original vcf splitted in 10,000 chuncks. Fields AN,AF and AC are up to date.vds/: raw .vds (hail format). Includes a _split.vds version: multi-allelic variants have been splitted, see: https://hail.is/commands.html#splitmulti. We prefer to use this version since not all the hail commands support multi-allelics.wgspd_wgs_v1_HQ_common.{bed,bim,fam}: plink file with only high quality SNPs. See the Sample QC section.wgspd_wgs_v1_HQ_common_pruned.{bed,bim,fam}: plink file with only high quality pruned SNPs. See the Sample QC section.wgspd_wgs_v1_idependent_SNPs.prune.{in,out}: list of pruned high quality SNPs to keep (e.g. for PCs).gs://wgspd-wgs-v1/qced/: variant and sample QCed data.
vcf/: QCed vcf splitted in 5,000 chuncks, fields AN,AF and AC are up to date. This is the main file to download for those that just want QCed data and don’t intend to use hail.vds/: QCed .vds. The multi-allelic variants in the .vds have been splitted. We also created a _split_hard_annot.vds version that contains hard-calls (only genotypes, not info like DP, GQ etc…) and several annotations. This is the main file to use for hail users.wgspd_wgs_v1_QCED_common.{bed,bim,fam}: plink file obtained from the QCed .vds, only variants with MAF > 1%.gs://wgspd-wgs-v1/scripts/: scripts used for processing the data.
Plot_sample_qc.R: R script containing the code used for sample QC, PCA and sample-level annotations.qc_script.sh: script with hail and plink code used to QC the dataset.gs://wgspd-wgs-v1/qc_measures/: all QC measures and plots displayed in this document.gs://wgspd-wgs-v1/test/: small test test to use for development.We created a google account which will allow to read and download the data. This can be done by installing gsutil, as described here: https://cloud.google.com/storage/docs/gsutil_install.
Briefly, type this in your terminal:
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
You will be ask to authenticate your account and directed to a web-page were you should use the following credentials.
Once authentication is completed, for example, to download the raw vcf you can:
gsutil -m cp -r gs://wgspd-wgs-v1/raw/vcf/wgspd_wgs_v1.vcf.bgz /your/local/folder/
The procedure above allows to read and download the data. However to run cloud-Hail on the shared .vds I need to add your google compute engine service account, which can be obtained only after creating a Google cloud project.
After registering to google cloud, you can find the google compute engine service account here:

Email aganna@broadinstitute.org and I will add your google compute engine service account and you will be able to run analysis directly on the shared .vds.
The complete guide on how to use hail on the cloud can be found here: http://discuss.hail.is/t/using-hail-on-the-google-cloud-platform/80. Below a short introduction.
After you set up your google cloud project and I have added your google compute engine service account, you should be able to see the bucket and start to run hail.
Step 1: Create a dataproc cluster
You need to set a dataproc spark cluster to be able to run hail. This is an example of the configuration you can use:
gcloud dataproc clusters create mycluster \
--zone us-central1-f \
--master-machine-type n1-highmem-16 \
--master-boot-disk-size 100 \
--num-workers 2 \
--worker-machine-type n1-highmem-8 \
--worker-boot-disk-size 40 \
--num-worker-local-ssds 1 \
--num-preemptible-workers 4 \
--image-version 1.0 \
--project myproject \
--properties "spark:spark.driver.memory=90g,spark:spark.driver.maxResultSize=50g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1"
This command creates a cluster called mycluster with a VM master n1-highmem-16 with 100GB of disk, two non-preemptible workers --num-workers 2 and 4 preemptible workers --num-preemptible-workers 4, all with a disk of 40 GB. You will need to change myproject with you actual project name. Preemptible workers are much cheaper than non-preemptible workers, so use them!.
Notice that the time zone might be different in your project and you will need to set --zone accordingly.
You can interactively monitor your cluster, by selecting dataproc on the left-side menu:

Step 2: Run hail on the cluster
Now that the cluster is created, you can proceed to run hail. This is an example on how to do that:
## 1. Copy the configuration script
wget https://storage.googleapis.com/hail-common/pyhail-submit .
chmod +x pyhail-submit
## 2. Create a script with your code
echo 'from pyspark import SparkContext
from pyhail import HailContext
sc = SparkContext()
hc = HailContext(sc)
wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')
(wgspd_wgs_v1_raw_split.count())' > test_script.py
## 3. Run the script
./pyhail-submit mycluster test_script.py
You can monitor your job interactively (be aware that the hail output on your terminal might stop after a while):

Killing the hail command from the terminal will not kill the job. The easisiest way is to do that interactively, by selecting delete
The raw dataset contains 10,570 individuals.
We created a file containing all the info that are used for Sample QC:
wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')
(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.sample_qc()
.export_samples('gs://wgspd-wgs-v1/qc_measures/sample_qc_info.txt','ID=s.id, CHIMERAS=sa.CHIMERAS, CONTAMINATION=sa.CONTAMINATION, COVERAGE=sa.Sample_Coverage, PROJECTG=sa.PROJECT, PROJECT1=sa.Collection, PROJECT2=sa.Title, PCR_FREE=sa.pcr_free, POP=sa.predicted_pop, GENDER=sa.Gender,callRate=sa.qc.callRate,nCalled=sa.qc.nCalled,nSingleton=sa.qc.nSingleton,dpMean=sa.qc.dpMean,gqMean=sa.qc.gqMean,rTiTv=sa.qc.rTiTv,rHetHomVar=sa.qc.rHetHomVar,rInsertionDeletion=sa.qc.rInsertionDeletion'))
The file containing info for sample QC is available here: gs://wgspd-wgs-v1/qc_measures/sample_qc_info.txt.
The R script used for sample QC is here: gs://wgspd-wgs-v1/scripts/plot_sample_qc.R
Notice: Spaces in the sample ID have been replace with underscores using the following command: .rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
Step 1
We removed 55 individuals with:
The removed indivduals are represented by a *

Step 2
We removed 50 samples that deviate more than 4 median absolute deviations from the median, within each project, for the following metrics:
The removed indivduals are represented by a *. Different colors represent different ethnicities (check the principal components section for color annotations).



Notice that, for the GPC cohort, the subproject definition is based on a non-optimal categorization of the studies. Bob Handsaker suggests a better categorization, which can be found here: gs://wgspd-wgs-v1/qc_measures/GPC_sample_groups.txt.
Step 3
We identified related individuals and perform sex-check using the code below:
## Save a high-quality variants dataset in plink and perform sex-check ##
wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')
(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.variant_qc()
.filter_genotypes('(g.ad[0] + g.ad[1]) / g.dp < 0.9 || (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20)) || (g.isHet && (g.ad[1] / g.dp < 0.20 || g.pl[0] < 20)) || (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20)) || g.dp > 200', keep=False)
.variant_qc()
.filter_variants_expr('va.info.QD > 4 && va.qc.callRate > 0.99 && va.qc.dpMean > 20 && va.qc.AF > 0.05 && va.pass && va.qc.AF < 0.95', keep=True)
.export_plink('gs://wgspd-wgs-v1/raw/wgspd_wgs_v1_HQ_common')
.impute_sex(maf_threshold=0.05)
.export_samples('gs://wgspd-wgs-v1/qc_measures/sample_sex_fstat.txt','ID=s.id,sexFstat=sa.imputesex.Fstat,isFemale=sa.imputesex.isFemale')
## Create a set of independent SNPs in plink ##
## This is done on a VM, that's why the paths are different ##
plink -bfile /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common \
--snps-only \
--indep-pairwise 100 50 0.2 \
--out /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_idependent_SNPs
plink -bfile /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common \
--extract /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_idependent_SNPs.prune.in \
--hwe 0.000001 \
--make-bed --out /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned
## Calculate kinship matrix with King ##
king -b /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned.bed --kinship --related --ibs --prefix /mnt/data/aganna/bucket/qc_measures/relatedness
## Save file for the IBD plot below ##
awk '{if($16 > 0.04) print $0}' /mnt/data/aganna/bucket/qc_measures/relatedness.ibs > /mnt/data/aganna/bucket/qc_measures/related_for_plot
## Save file with > 2nd degree relatives for tagging and exclusions ##
awk '{if($16 > 0.177) print $0}' /mnt/data/aganna/bucket/qc_measures/relatedness.ibs > /mnt/data/aganna/bucket/qc_measures/1_2_degree_rel
The plot below shows relationships in the data. We excluded 55 duplicated samples and tagged 1st and 2nd-degree related individuals.

Step 4
We excluded 80 individuals where the reported gender did not match the genetically derived sex.
The final number of samples was: 10,330
Sample QC annotations
Sample-level annotations are already included in the .vds. You can use this command to see which annotations are available:
hail read -i gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds \
printschema
We calculated principal components using the code below. We used pruned, autosomal high quality variants and excluded long-LD blocks, the MHC region and a common inversion on chr 8. We used admixture to annotate ethnicities. In the PCs we didn’t include related individuals (> 2nd degree) and excluded sampels that didn’t pass the sample QC above.
wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')
(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_samples_list('gs://wgspd-wgs-v1/qc_measures/samples_to_keep_after_qc', keep=True)
.annotate_samples_expr('sa = drop(sa, ID)')
.annotate_samples_table('gs://wgspd-wgs-v1/qc_measures/sex_related_annot','ID',impute = True,code = 'sa=merge(sa, table)')
.filter_variants_list('gs://wgspd-wgs-v1/raw/wgspd_wgs_v1_idependent_SNPs.prune.in', keep=True)
.filter_variants_intervals('gs://hail-common/MHC_invchr8_longLDreg.txt', keep=False)
.filter_variants_expr('v.contig == "X" ', keep=False)
.filter_samples_expr('sa.first_2nd_related==1', keep=False)
.pca('sa.pca')
.export_samples('gs://wgspd-wgs-v1/qc_measures/pca.tsv','id=s.id, sa.pca.*')
cd /mnt/data/aganna/bucket/qc_measures/
admixture /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned.bed 4 -j10
This is how the first three PCs look like.

The raw multi-allelic-split dataset contains 196,862,507 variants.
We have used to the following exclusion criteria to perform variant QC:
and the following script:
wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')
(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.annotate_samples_expr(
"""
sa = drop(sa, ID),
sa.projectcat=if (sa.projectdGPC_afr_W1==1 || sa.projectdGPC_afr_W2==1 || sa.projectdGPC_afr_W3==1 || sa.projectdGPC_afr_W4==1 || sa.projectdGPC_aframr==1) "GPC_afr" else if (sa.projectdPalotie_MESTA==1 || sa.projectdPalotie_migraine==1 || sa.projectdPalotie_Finrisk_W2==1 || sa.projectdPalotie_Finrisk_W1==1) "Fin" else if (sa.projectdGPC_nfe_W1==1 || sa.projectdGPC_nfe_W2==1 || sa.projectdGPC_nfe_W3==1 || sa.projectdGPC_nfe_WFREE==1) "GPC_nfe" else if(sa.Estonia_SCZControls==1) "Est" else "other"
""")
.annotate_samples_table('gs://wgspd-wgs-v1/qc_measures/sex_related_annot', 'ID', impute = True,code = 'sa=merge(sa, table)')
.filter_samples_list('gs://wgspd-wgs-v1/qc_measures/samples_to_keep_after_qc')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.variant_qc()
.annotate_variants_expr(
"""
va.step0 = va.qc.AC > 0,
va.step1 = va.qc.AC > 0 && va.pass,
va.step2 = va.qc.AC > 0 && va.pass && va.info.QD > 4
""")
.annotate_global_expr(
"""
global.numvar_step0 = variants.filter(v => va.step0).count(),
global.numvar_step1 = variants.filter(v => va.step1).count(),
global.numvar_step2 = variants.filter(v => va.step2).count()
""")
.filter_variants_expr('va.qc.AC > 0 && va.pass && va.info.QD > 4', keep=True)
.annotate_variants_expr(
"""
va.numgsdpo400=gs.filter(g => g.dp > 400).count(),
va.numgsabhomref=gs.filter(g => (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20))).count(),
va.numgsabhomvar=gs.filter(g => (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20))).count(),
va.numgsabhet=gs.filter(g => (g.isHet && ( (g.ad[0] + g.ad[1]) / g.dp < 0.9 || g.ad[1] / g.dp < 0.20 || g.pl[0] < 20 || (sa.SEX_FROM_FSTAT == "Male" && ( v.inXNonPar || v.inYNonPar )) )) || (v.inYNonPar && sa.SEX_FROM_FSTAT == "Female")).count(),
va.allgs=gs.count(),
va.allgshomref=gs.filter(g => g.isHomRef).count(),
va.allgshomvar=gs.filter(g => g.isHomVar).count(),
va.allgshet=gs.filter(g => g.isHet).count()
""")
.annotate_global_expr(
"""
global.numgsdpo400=variants.map(v => va.numgsdpo400).sum()/variants.map(v => va.allgs).sum(),
global.fracgsabhomref=variants.map(v => va.numgsabhomref).sum()/variants.map(v => va.allgshomref).sum(),
global.fracgsabhomvar=variants.map(v => va.numgsabhomvar).sum()/variants.map(v => va.allgshomvar).sum(),
global.fracgsabhet=variants.map(v => va.numgsabhet).sum()/variants.map(v => va.allgshet).sum()
""")
.filter_genotypes(
"""
g.dp > 400 ||
(g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20)) ||
(g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20)) ||
(g.isHet && ( (g.ad[0] + g.ad[1]) / g.dp < 0.9 || g.ad[1] / g.dp < 0.20 || g.pl[0] < 20 || (sa.SEX_FROM_FSTAT == "Male" && ( v.inXNonPar || v.inYNonPar )) )) || (v.inYNonPar && sa.SEX_FROM_FSTAT == "Female")
""", keep = False)
.variant_qc()
.annotate_variants_expr(
"""
va.callRate_GPC_afr = gs.filter(g => sa.projectcat == "GPC_afr").fraction(g => g.isCalled),
va.callRate_Fin = gs.filter(g => sa.projectcat == "Fin").fraction(g => g.isCalled),
va.callRate_GPC_nfe = gs.filter(g => sa.projectcat == "GPC_nfe").fraction(g => g.isCalled),
va.callRate_Est = gs.filter(g => sa.projectcat == "Est").fraction(g => g.isCalled),
va.callRate_other = gs.filter(g => sa.projectcat == "other").fraction(g => g.isCalled),
va.pHWE_GPC_afr = gs.filter(g => sa.projectcat == "GPC_afr").hardyWeinberg().pHWE,
va.pHWE_Fin = gs.filter(g => sa.projectcat == "Fin").hardyWeinberg().pHWE,
va.pHWE_GPC_nfe = gs.filter(g => sa.projectcat == "GPC_nfe").hardyWeinberg().pHWE,
va.pHWE_Est = gs.filter(g => sa.projectcat == "Est").hardyWeinberg().pHWE,
va.pHWE_other = gs.filter(g => sa.projectcat == "other").hardyWeinberg().pHWE
""")
.annotate_variants_expr(
"""
va.step3 = va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98 && va.callRate_GPC_nfe && va.callRate_Est > 0.98 && va.callRate_other > 0.98,
va.step4 = va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98 && va.callRate_GPC_nfe && va.callRate_Est > 0.98 && va.callRate_other > 0.98 && va.qc.pHWE > 0.000001 && va.pHWE_GPC_afr > 0.000001 && va.pHWE_Fin > 0.000001 && va.pHWE_GPC_nfe > 0.000001 && va.pHWE_Est > 0.000001 && va.pHWE_other > 0.000001
""")
.annotate_global_expr(
"""
global.numvar_step3 = variants.filter(v => va.step3).count(),
global.numvar_step4 = variants.filter(v => va.step4).count(),
global.callRate098 = variants.filter(v => va.callRate < 0.98).count(),
global.MAC = variants.filter(v => va.qc.AC == 0).count()
""")
.filter_variants_expr('va.qc.pHWE > 0.000001 && va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98 && va.callRate_GPC_nfe > 0.98 && va.callRate_Est > 0.98 && va.callRate_other > 0.98 && va.pHWE_GPC_afr > 0.000001 && va.pHWE_Fin > 0.000001 && va.pHWE_GPC_nfe > 0.000001 && va.pHWE_Est > 0.000001 && va.pHWE_other > 0.000001 && v.contig != "X" ', keep=True)
.write('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds'))
Resulting in this number of variants at the different filtering stages:
| Filter | Number of variants |
|---|---|
| LCR and AC=0 removed | 169,691,764 |
| Keep PASS variants | 159,667,836 |
| Keep QD > 4 | 153,236,903 |
| Apply genotype-level filters | 0.74% het, 1.21% homref, 0.12% homvar genotypes set to missing |
| Keep call rate > 98 % and AC > 0 | 132,803,241 |
| Keep HWE P-value > 1x10^-6 and autosomes | 129,010,459 (120,006,849 SNPs, 9,003,610 INDELs) |
gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split_hard_anno.vds has been annotated with a large number of functional annotations. We are working to a well documented annotation pipeline, at the moment you can find information about the annotation used here: https://storage.googleapis.com/hail-common/README_annotation_v0.2.txt. The .vds containing annotations for all possible SNVs in the genome can be found here: gs://hail-common/annotation_v0.2.vds.
This is the code we used to produce a hard-call version of the dataset which includes annotations. This is the main .vds used for analysis.
wgspd_wgs_v1_qced_split = hc.read('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds')
annotations_v02 = hc.read('gs://hail-common/annotation_v0.2.vds')
(wgspd_wgs_v1_qced_split
.hardcalls()
.annotate_variants_vds(annotations_v02,root='va.ann02')
.write('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split_hard_anno.vds'))
The most relevant annotations available are:
All .bam files for this project are available in a separate google bucket. If you need access to this file email aganna@broadinstitute.org and I will point you to the right path.