WGSPD data release v.1

(document version 1)

Introduction

This document describes the data structure and the quality control performed on the WGSPD WGS data.

For an overview of google cloud, check: https://cloud.google.com/docs/overview/.
For an introduction to hail, check: https://hail.is/overview.html
We have also created a tutorial on how to run hail on the cloud (some of this info will also be reported in this document), here you can have a complete overview: http://discuss.hail.is/t/using-hail-on-the-google-cloud-platform/80

Data structure

All data are available in this google bucket:
gs://wgspd-wgs-v1

The bucket has the following structure:

gs://wgspd-wgs-v1/raw/: contains the original raw data and plink files created from the raw data.
- vcf/: original vcf splitted in 10,000 chuncks. Fields AN,AF and AC are up to date.
- vds/: raw .vds (hail format). Includes a _split.vds version: multi-allelic variants have been splitted, see: https://hail.is/commands.html#splitmulti. We prefer to use this version since not all the hail commands support multi-allelics.
- wgspd_wgs_v1_HQ_common.{bed,bim,fam}: plink file with only high quality SNPs. See the Sample QC section.
- wgspd_wgs_v1_HQ_common_pruned.{bed,bim,fam}: plink file with only high quality pruned SNPs. See the Sample QC section.
- wgspd_wgs_v1_idependent_SNPs.prune.{in,out}: list of pruned high quality SNPs to keep (e.g. for PCs).
gs://wgspd-wgs-v1/qced/: variant and sample QCed data.
- vcf/: QCed vcf splitted in 5,000 chuncks, fields AN,AF and AC are up to date. This is the main file to download for those that just want QCed data and don’t intend to use hail.
- vds/: QCed .vds. The multi-allelic variants in the .vds have been splitted. We also created a _split_hard_annot.vds version that contains hard-calls (only genotypes, not info like DP, GQ etc…) and several annotations. This is the main file to use for hail users.
- wgspd_wgs_v1_QCED_common.{bed,bim,fam}: plink file obtained from the QCed .vds, only variants with MAF > 1%.
gs://wgspd-wgs-v1/scripts/: scripts used for processing the data.
- Plot_sample_qc.R: R script containing the code used for sample QC, PCA and sample-level annotations.
- qc_script.sh: script with hail and plink code used to QC the dataset.
gs://wgspd-wgs-v1/qc_measures/: all QC measures and plots displayed in this document.
gs://wgspd-wgs-v1/test/: small test test to use for development.

Data access and how to download the data

We created a google account which will allow to read and download the data. This can be done by installing gsutil, as described here: https://cloud.google.com/storage/docs/gsutil_install.

Briefly, type this in your terminal:

curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

You will be ask to authenticate your account and directed to a web-page were you should use the following credentials.

The username is: consortiumwgspd@gmail.com
The password is: (the password has been shared with the consortium members)

Once authentication is completed, for example, to download the raw vcf you can:

gsutil -m cp -r gs://wgspd-wgs-v1/raw/vcf/wgspd_wgs_v1.vcf.bgz /your/local/folder/

The procedure above allows to read and download the data. However to run cloud-Hail on the shared .vds I need to add your google compute engine service account, which can be obtained only after creating a Google cloud project.
After registering to google cloud, you can find the google compute engine service account here:

alt text

Email aganna@broadinstitute.org and I will add your google compute engine service account and you will be able to run analysis directly on the shared .vds.

Examples on how to run hail on the cloud

The complete guide on how to use hail on the cloud can be found here: http://discuss.hail.is/t/using-hail-on-the-google-cloud-platform/80. Below a short introduction.

After you set up your google cloud project and I have added your google compute engine service account, you should be able to see the bucket and start to run hail.

Step 1: Create a dataproc cluster

You need to set a dataproc spark cluster to be able to run hail. This is an example of the configuration you can use:

gcloud dataproc clusters create mycluster \
  --zone us-central1-f \
  --master-machine-type n1-highmem-16 \
  --master-boot-disk-size 100 \
  --num-workers 2 \
  --worker-machine-type n1-highmem-8 \
  --worker-boot-disk-size 40 \
  --num-worker-local-ssds 1 \
  --num-preemptible-workers 4 \
  --image-version 1.0 \
  --project myproject \
  --properties "spark:spark.driver.memory=90g,spark:spark.driver.maxResultSize=50g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1"

This command creates a cluster called mycluster with a VM master n1-highmem-16 with 100GB of disk, two non-preemptible workers --num-workers 2 and 4 preemptible workers --num-preemptible-workers 4, all with a disk of 40 GB. You will need to change myproject with you actual project name. Preemptible workers are much cheaper than non-preemptible workers, so use them!.

Notice that the time zone might be different in your project and you will need to set --zone accordingly.

You can interactively monitor your cluster, by selecting dataproc on the left-side menu:

alt text

Step 2: Run hail on the cluster

Now that the cluster is created, you can proceed to run hail. This is an example on how to do that:

## 1. Copy the configuration script
wget https://storage.googleapis.com/hail-common/pyhail-submit .
chmod +x pyhail-submit

## 2. Create a script with your code

echo 'from pyspark import SparkContext
from pyhail import HailContext

sc = SparkContext()
hc = HailContext(sc)

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')
(wgspd_wgs_v1_raw_split.count())' > test_script.py

## 3. Run the script
./pyhail-submit mycluster test_script.py

You can monitor your job interactively (be aware that the hail output on your terminal might stop after a while):

alt text

Killing the hail command from the terminal will not kill the job. The easisiest way is to do that interactively, by selecting delete

Quality control

Sample QC

The raw dataset contains 10,570 individuals.

We created a file containing all the info that are used for Sample QC:

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')

(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.sample_qc()
.export_samples('gs://wgspd-wgs-v1/qc_measures/sample_qc_info.txt','ID=s.id, CHIMERAS=sa.CHIMERAS, CONTAMINATION=sa.CONTAMINATION, COVERAGE=sa.Sample_Coverage, PROJECTG=sa.PROJECT, PROJECT1=sa.Collection, PROJECT2=sa.Title, PCR_FREE=sa.pcr_free, POP=sa.predicted_pop, GENDER=sa.Gender,callRate=sa.qc.callRate,nCalled=sa.qc.nCalled,nSingleton=sa.qc.nSingleton,dpMean=sa.qc.dpMean,gqMean=sa.qc.gqMean,rTiTv=sa.qc.rTiTv,rHetHomVar=sa.qc.rHetHomVar,rInsertionDeletion=sa.qc.rInsertionDeletion'))

The file containing info for sample QC is available here: gs://wgspd-wgs-v1/qc_measures/sample_qc_info.txt.
The R script used for sample QC is here: gs://wgspd-wgs-v1/scripts/plot_sample_qc.R

Notice: Spaces in the sample ID have been replace with underscores using the following command: .rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')

Step 1

We removed 55 individuals with:

Contamination > 5%
Chimeras > 5%
Call rate < 95%
Coverage < 15x

The removed indivduals are represented by a *

alt text

Step 2

We removed 50 samples that deviate more than 4 median absolute deviations from the median, within each project, for the following metrics:

Ti/TV ratio
Het/Hom ratio
Insertion/Deletion ratio

The removed indivduals are represented by a *. Different colors represent different ethnicities (check the principal components section for color annotations).

alt text

Notice that, for the GPC cohort, the subproject definition is based on a non-optimal categorization of the studies. Bob Handsaker suggests a better categorization, which can be found here: gs://wgspd-wgs-v1/qc_measures/GPC_sample_groups.txt.

Step 3

We identified related individuals and perform sex-check using the code below:

## Save a high-quality variants dataset in plink and perform sex-check ##
wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')

(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.variant_qc()
.filter_genotypes('(g.ad[0] + g.ad[1]) / g.dp < 0.9 || (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20)) || (g.isHet && (g.ad[1] / g.dp < 0.20 || g.pl[0] < 20)) || (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20)) || g.dp > 200', keep=False)
.variant_qc()
.filter_variants_expr('va.info.QD > 4 && va.qc.callRate > 0.99 && va.qc.dpMean > 20 && va.qc.AF > 0.05 && va.pass && va.qc.AF < 0.95', keep=True)
.export_plink('gs://wgspd-wgs-v1/raw/wgspd_wgs_v1_HQ_common')
.impute_sex(maf_threshold=0.05)
.export_samples('gs://wgspd-wgs-v1/qc_measures/sample_sex_fstat.txt','ID=s.id,sexFstat=sa.imputesex.Fstat,isFemale=sa.imputesex.isFemale')

## Create a set of independent SNPs in plink ##
## This is done on a VM, that's why the paths are different ##
plink -bfile /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common \
--snps-only \
--indep-pairwise 100 50 0.2 \
--out /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_idependent_SNPs

plink -bfile /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common \
--extract /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_idependent_SNPs.prune.in \
--hwe 0.000001 \
--make-bed --out /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned

## Calculate kinship matrix with King ##
king -b /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned.bed --kinship --related --ibs --prefix /mnt/data/aganna/bucket/qc_measures/relatedness

## Save file for the IBD plot below ##
awk '{if($16 > 0.04) print $0}' /mnt/data/aganna/bucket/qc_measures/relatedness.ibs > /mnt/data/aganna/bucket/qc_measures/related_for_plot

## Save file with > 2nd degree relatives for tagging and exclusions ##
awk '{if($16 > 0.177) print $0}' /mnt/data/aganna/bucket/qc_measures/relatedness.ibs > /mnt/data/aganna/bucket/qc_measures/1_2_degree_rel

The plot below shows relationships in the data. We excluded 55 duplicated samples and tagged 1st and 2nd-degree related individuals.

alt text

Step 4

We excluded 80 individuals where the reported gender did not match the genetically derived sex.

The final number of samples was: 10,330

Sample QC annotations

Sample-level annotations are already included in the .vds. You can use this command to see which annotations are available:

hail read -i gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds \
printschema

Principal components

We calculated principal components using the code below. We used pruned, autosomal high quality variants and excluded long-LD blocks, the MHC region and a common inversion on chr 8. We used admixture to annotate ethnicities. In the PCs we didn’t include related individuals (> 2nd degree) and excluded sampels that didn’t pass the sample QC above.

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')

(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_samples_list('gs://wgspd-wgs-v1/qc_measures/samples_to_keep_after_qc', keep=True)
.annotate_samples_expr('sa = drop(sa, ID)')
.annotate_samples_table('gs://wgspd-wgs-v1/qc_measures/sex_related_annot','ID',impute = True,code = 'sa=merge(sa, table)')
.filter_variants_list('gs://wgspd-wgs-v1/raw/wgspd_wgs_v1_idependent_SNPs.prune.in', keep=True)
.filter_variants_intervals('gs://hail-common/MHC_invchr8_longLDreg.txt', keep=False)
.filter_variants_expr('v.contig == "X" ', keep=False)
.filter_samples_expr('sa.first_2nd_related==1', keep=False)
.pca('sa.pca')
.export_samples('gs://wgspd-wgs-v1/qc_measures/pca.tsv','id=s.id, sa.pca.*') 

cd /mnt/data/aganna/bucket/qc_measures/
admixture /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned.bed 4 -j10

This is how the first three PCs look like.

alt text

Variant QC

The raw multi-allelic-split dataset contains 196,862,507 variants.

We have used to the following exclusion criteria to perform variant QC:

Remove low complexity regions (LCR), from: http://bioinformatics.oxfordjournals.org/content/30/20/2843.abstract
Keep only PASS variants
Keep variants with QD > 4
Set to missing genotypes with:
- DP > 400
- Heterozygous AND [ (AD alt/DP) < 20% OR PL ref < 20 OR (AD ref + AD alt) / DP < 90% ]
- Homozygous ref AND [ GQ < 20 OR (AD ref / DP) < 0.9 ]
- Homozygous alt AND [ PL ref < 20 OR (AD alt / DP) < 0.9 ]
Keep variants with call rate > 98% AND AC > 0 in each separate substudy
Keep variants with HWE P-value > 1e-06 in each separate substudy and remove chr X

and the following script:

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')


(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.annotate_samples_expr(
    """
    sa = drop(sa, ID),
    sa.projectcat=if (sa.projectdGPC_afr_W1==1 || sa.projectdGPC_afr_W2==1 || sa.projectdGPC_afr_W3==1 || sa.projectdGPC_afr_W4==1 || sa.projectdGPC_aframr==1) "GPC_afr" else if (sa.projectdPalotie_MESTA==1 || sa.projectdPalotie_migraine==1 || sa.projectdPalotie_Finrisk_W2==1 || sa.projectdPalotie_Finrisk_W1==1) "Fin" else if (sa.projectdGPC_nfe_W1==1 || sa.projectdGPC_nfe_W2==1 || sa.projectdGPC_nfe_W3==1 || sa.projectdGPC_nfe_WFREE==1) "GPC_nfe" else if(sa.Estonia_SCZControls==1) "Est" else "other"
    """)
.annotate_samples_table('gs://wgspd-wgs-v1/qc_measures/sex_related_annot', 'ID', impute = True,code = 'sa=merge(sa, table)')
.filter_samples_list('gs://wgspd-wgs-v1/qc_measures/samples_to_keep_after_qc')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.variant_qc()
.annotate_variants_expr(
     """
     va.step0 = va.qc.AC > 0,
     va.step1 = va.qc.AC > 0 && va.pass,
     va.step2 = va.qc.AC > 0 && va.pass && va.info.QD > 4
     """)
 .annotate_global_expr(
     """
     global.numvar_step0 = variants.filter(v => va.step0).count(),
     global.numvar_step1 = variants.filter(v => va.step1).count(),
     global.numvar_step2 = variants.filter(v => va.step2).count()
     """)
  .filter_variants_expr('va.qc.AC > 0 && va.pass && va.info.QD > 4', keep=True)   
  .annotate_variants_expr(
     """
     va.numgsdpo400=gs.filter(g => g.dp > 400).count(),
     va.numgsabhomref=gs.filter(g => (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20))).count(),
     va.numgsabhomvar=gs.filter(g => (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20))).count(),
     va.numgsabhet=gs.filter(g => (g.isHet && ( (g.ad[0] + g.ad[1]) / g.dp < 0.9 || g.ad[1] / g.dp < 0.20 || g.pl[0] < 20 || (sa.SEX_FROM_FSTAT == "Male" && ( v.inXNonPar || v.inYNonPar )) )) || (v.inYNonPar && sa.SEX_FROM_FSTAT == "Female")).count(),
     va.allgs=gs.count(),
     va.allgshomref=gs.filter(g => g.isHomRef).count(),
     va.allgshomvar=gs.filter(g => g.isHomVar).count(),
     va.allgshet=gs.filter(g => g.isHet).count()
     """)
 .annotate_global_expr(
     """
     global.numgsdpo400=variants.map(v => va.numgsdpo400).sum()/variants.map(v => va.allgs).sum(), 
     global.fracgsabhomref=variants.map(v => va.numgsabhomref).sum()/variants.map(v => va.allgshomref).sum(), 
     global.fracgsabhomvar=variants.map(v => va.numgsabhomvar).sum()/variants.map(v => va.allgshomvar).sum(), 
     global.fracgsabhet=variants.map(v => va.numgsabhet).sum()/variants.map(v => va.allgshet).sum()
     """)
 .filter_genotypes(
     """
     g.dp > 400 ||
     (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20)) ||
     (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20)) ||
     (g.isHet && ( (g.ad[0] + g.ad[1]) / g.dp < 0.9 || g.ad[1] / g.dp < 0.20 || g.pl[0] < 20 || (sa.SEX_FROM_FSTAT == "Male" && ( v.inXNonPar || v.inYNonPar )) )) || (v.inYNonPar && sa.SEX_FROM_FSTAT == "Female")
     """, keep = False)
 .variant_qc()
 .annotate_variants_expr(
     """    
     va.callRate_GPC_afr = gs.filter(g => sa.projectcat == "GPC_afr").fraction(g => g.isCalled),
     va.callRate_Fin = gs.filter(g => sa.projectcat == "Fin").fraction(g => g.isCalled),
     va.callRate_GPC_nfe = gs.filter(g => sa.projectcat == "GPC_nfe").fraction(g => g.isCalled),
     va.callRate_Est = gs.filter(g => sa.projectcat == "Est").fraction(g => g.isCalled),
     va.callRate_other = gs.filter(g => sa.projectcat == "other").fraction(g => g.isCalled),

     va.pHWE_GPC_afr = gs.filter(g => sa.projectcat == "GPC_afr").hardyWeinberg().pHWE,
     va.pHWE_Fin = gs.filter(g => sa.projectcat == "Fin").hardyWeinberg().pHWE,
     va.pHWE_GPC_nfe = gs.filter(g => sa.projectcat == "GPC_nfe").hardyWeinberg().pHWE,
     va.pHWE_Est = gs.filter(g => sa.projectcat == "Est").hardyWeinberg().pHWE,
     va.pHWE_other = gs.filter(g => sa.projectcat == "other").hardyWeinberg().pHWE
     """)
 .annotate_variants_expr(
     """    
     va.step3 = va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98 && va.callRate_GPC_nfe && va.callRate_Est > 0.98 && va.callRate_other > 0.98,

     va.step4 = va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98 && va.callRate_GPC_nfe && va.callRate_Est > 0.98 && va.callRate_other > 0.98 && va.qc.pHWE > 0.000001 && va.pHWE_GPC_afr > 0.000001 && va.pHWE_Fin > 0.000001 && va.pHWE_GPC_nfe > 0.000001 && va.pHWE_Est > 0.000001 && va.pHWE_other > 0.000001
     """)
 .annotate_global_expr(
     """
     global.numvar_step3 = variants.filter(v => va.step3).count(),
     global.numvar_step4 = variants.filter(v => va.step4).count(),
     global.callRate098 = variants.filter(v => va.callRate < 0.98).count(),
     global.MAC = variants.filter(v => va.qc.AC == 0).count()
     """)
 .filter_variants_expr('va.qc.pHWE > 0.000001 && va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98  && va.callRate_GPC_nfe > 0.98  && va.callRate_Est > 0.98  && va.callRate_other > 0.98 && va.pHWE_GPC_afr > 0.000001 && va.pHWE_Fin > 0.000001 && va.pHWE_GPC_nfe > 0.000001 && va.pHWE_Est > 0.000001 && va.pHWE_other > 0.000001 && v.contig != "X" ', keep=True)
 .write('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds'))

Resulting in this number of variants at the different filtering stages:

Filter	Number of variants
LCR and AC=0 removed	169,691,764
Keep PASS variants	159,667,836
Keep QD > 4	153,236,903
Apply genotype-level filters	0.74% het, 1.21% homref, 0.12% homvar genotypes set to missing
Keep call rate > 98 % and AC > 0	132,803,241
Keep HWE P-value > 1x10^-6 and autosomes	129,010,459 (120,006,849 SNPs, 9,003,610 INDELs)

Functional Annotations (beta)

gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split_hard_anno.vds has been annotated with a large number of functional annotations. We are working to a well documented annotation pipeline, at the moment you can find information about the annotation used here: https://storage.googleapis.com/hail-common/README_annotation_v0.2.txt. The .vds containing annotations for all possible SNVs in the genome can be found here: gs://hail-common/annotation_v0.2.vds.

This is the code we used to produce a hard-call version of the dataset which includes annotations. This is the main .vds used for analysis.

wgspd_wgs_v1_qced_split = hc.read('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds')

annotations_v02 = hc.read('gs://hail-common/annotation_v0.2.vds')

(wgspd_wgs_v1_qced_split
.hardcalls()
.annotate_variants_vds(annotations_v02,root='va.ann02')
.write('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split_hard_anno.vds'))

The most relevant annotations available are:

Variant Effect Predictor (VEP) for all the exonic +/- 50 bp SNVs
CADD score and all CADD predictors (including conservation scores)
Epigenome Roadmap chromHMM predicted enhancers and promoters for different tissues
EXAC V1 frequencies
dbNSFP database containing all major non-synonymous variant annotations

.bam files access

All .bam files for this project are available in a separate google bucket. If you need access to this file email aganna@broadinstitute.org and I will point you to the right path.