WGSPD data release v.1

(document version 1)

Introduction

This document describes the data structure and the quality control performed on the WGSPD WGS data.

Data structure

All data are available in this google bucket:
gs://wgspd-wgs-v1

The bucket has the following structure:

Data access and how to download the data

We created a google account which will allow to read and download the data. This can be done by installing gsutil, as described here: https://cloud.google.com/storage/docs/gsutil_install.

Briefly, type this in your terminal:

curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

You will be ask to authenticate your account and directed to a web-page were you should use the following credentials.

Once authentication is completed, for example, to download the raw vcf you can:

gsutil -m cp -r gs://wgspd-wgs-v1/raw/vcf/wgspd_wgs_v1.vcf.bgz /your/local/folder/

The procedure above allows to read and download the data. However to run cloud-Hail on the shared .vds I need to add your google compute engine service account, which can be obtained only after creating a Google cloud project.
After registering to google cloud, you can find the google compute engine service account here:

alt text

Email aganna@broadinstitute.org and I will add your google compute engine service account and you will be able to run analysis directly on the shared .vds.

Examples on how to run hail on the cloud

The complete guide on how to use hail on the cloud can be found here: http://discuss.hail.is/t/using-hail-on-the-google-cloud-platform/80. Below a short introduction.

After you set up your google cloud project and I have added your google compute engine service account, you should be able to see the bucket and start to run hail.

Step 1: Create a dataproc cluster

You need to set a dataproc spark cluster to be able to run hail. This is an example of the configuration you can use:

gcloud dataproc clusters create mycluster \
  --zone us-central1-f \
  --master-machine-type n1-highmem-16 \
  --master-boot-disk-size 100 \
  --num-workers 2 \
  --worker-machine-type n1-highmem-8 \
  --worker-boot-disk-size 40 \
  --num-worker-local-ssds 1 \
  --num-preemptible-workers 4 \
  --image-version 1.0 \
  --project myproject \
  --properties "spark:spark.driver.memory=90g,spark:spark.driver.maxResultSize=50g,spark:spark.task.maxFailures=20,spark:spark.kryoserializer.buffer.max=1g,hdfs:dfs.replication=1"

This command creates a cluster called mycluster with a VM master n1-highmem-16 with 100GB of disk, two non-preemptible workers --num-workers 2 and 4 preemptible workers --num-preemptible-workers 4, all with a disk of 40 GB. You will need to change myproject with you actual project name. Preemptible workers are much cheaper than non-preemptible workers, so use them!.

Notice that the time zone might be different in your project and you will need to set --zone accordingly.

You can interactively monitor your cluster, by selecting dataproc on the left-side menu:

alt text

Step 2: Run hail on the cluster

Now that the cluster is created, you can proceed to run hail. This is an example on how to do that:

## 1. Copy the configuration script
wget https://storage.googleapis.com/hail-common/pyhail-submit .
chmod +x pyhail-submit

## 2. Create a script with your code

echo 'from pyspark import SparkContext
from pyhail import HailContext

sc = SparkContext()
hc = HailContext(sc)

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')
(wgspd_wgs_v1_raw_split.count())' > test_script.py

## 3. Run the script
./pyhail-submit mycluster test_script.py

You can monitor your job interactively (be aware that the hail output on your terminal might stop after a while):

alt text

Killing the hail command from the terminal will not kill the job. The easisiest way is to do that interactively, by selecting delete

Quality control

Sample QC

The raw dataset contains 10,570 individuals.

We created a file containing all the info that are used for Sample QC:

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')

(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.sample_qc()
.export_samples('gs://wgspd-wgs-v1/qc_measures/sample_qc_info.txt','ID=s.id, CHIMERAS=sa.CHIMERAS, CONTAMINATION=sa.CONTAMINATION, COVERAGE=sa.Sample_Coverage, PROJECTG=sa.PROJECT, PROJECT1=sa.Collection, PROJECT2=sa.Title, PCR_FREE=sa.pcr_free, POP=sa.predicted_pop, GENDER=sa.Gender,callRate=sa.qc.callRate,nCalled=sa.qc.nCalled,nSingleton=sa.qc.nSingleton,dpMean=sa.qc.dpMean,gqMean=sa.qc.gqMean,rTiTv=sa.qc.rTiTv,rHetHomVar=sa.qc.rHetHomVar,rInsertionDeletion=sa.qc.rInsertionDeletion'))

The file containing info for sample QC is available here: gs://wgspd-wgs-v1/qc_measures/sample_qc_info.txt.
The R script used for sample QC is here: gs://wgspd-wgs-v1/scripts/plot_sample_qc.R

Notice: Spaces in the sample ID have been replace with underscores using the following command: .rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')

Step 1

We removed 55 individuals with:

The removed indivduals are represented by a *

alt text

Step 2

We removed 50 samples that deviate more than 4 median absolute deviations from the median, within each project, for the following metrics:

The removed indivduals are represented by a *. Different colors represent different ethnicities (check the principal components section for color annotations).

alt text

alt text

alt text

Notice that, for the GPC cohort, the subproject definition is based on a non-optimal categorization of the studies. Bob Handsaker suggests a better categorization, which can be found here: gs://wgspd-wgs-v1/qc_measures/GPC_sample_groups.txt.

Step 3

We identified related individuals and perform sex-check using the code below:

## Save a high-quality variants dataset in plink and perform sex-check ##
wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')

(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.variant_qc()
.filter_genotypes('(g.ad[0] + g.ad[1]) / g.dp < 0.9 || (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20)) || (g.isHet && (g.ad[1] / g.dp < 0.20 || g.pl[0] < 20)) || (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20)) || g.dp > 200', keep=False)
.variant_qc()
.filter_variants_expr('va.info.QD > 4 && va.qc.callRate > 0.99 && va.qc.dpMean > 20 && va.qc.AF > 0.05 && va.pass && va.qc.AF < 0.95', keep=True)
.export_plink('gs://wgspd-wgs-v1/raw/wgspd_wgs_v1_HQ_common')
.impute_sex(maf_threshold=0.05)
.export_samples('gs://wgspd-wgs-v1/qc_measures/sample_sex_fstat.txt','ID=s.id,sexFstat=sa.imputesex.Fstat,isFemale=sa.imputesex.isFemale')

## Create a set of independent SNPs in plink ##
## This is done on a VM, that's why the paths are different ##
plink -bfile /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common \
--snps-only \
--indep-pairwise 100 50 0.2 \
--out /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_idependent_SNPs

plink -bfile /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common \
--extract /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_idependent_SNPs.prune.in \
--hwe 0.000001 \
--make-bed --out /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned

## Calculate kinship matrix with King ##
king -b /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned.bed --kinship --related --ibs --prefix /mnt/data/aganna/bucket/qc_measures/relatedness

## Save file for the IBD plot below ##
awk '{if($16 > 0.04) print $0}' /mnt/data/aganna/bucket/qc_measures/relatedness.ibs > /mnt/data/aganna/bucket/qc_measures/related_for_plot

## Save file with > 2nd degree relatives for tagging and exclusions ##
awk '{if($16 > 0.177) print $0}' /mnt/data/aganna/bucket/qc_measures/relatedness.ibs > /mnt/data/aganna/bucket/qc_measures/1_2_degree_rel

The plot below shows relationships in the data. We excluded 55 duplicated samples and tagged 1st and 2nd-degree related individuals.

alt text

Step 4

We excluded 80 individuals where the reported gender did not match the genetically derived sex.

The final number of samples was: 10,330

Sample QC annotations

Sample-level annotations are already included in the .vds. You can use this command to see which annotations are available:

hail read -i gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds \
printschema

Principal components

We calculated principal components using the code below. We used pruned, autosomal high quality variants and excluded long-LD blocks, the MHC region and a common inversion on chr 8. We used admixture to annotate ethnicities. In the PCs we didn’t include related individuals (> 2nd degree) and excluded sampels that didn’t pass the sample QC above.

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')

(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.filter_samples_list('gs://wgspd-wgs-v1/qc_measures/samples_to_keep_after_qc', keep=True)
.annotate_samples_expr('sa = drop(sa, ID)')
.annotate_samples_table('gs://wgspd-wgs-v1/qc_measures/sex_related_annot','ID',impute = True,code = 'sa=merge(sa, table)')
.filter_variants_list('gs://wgspd-wgs-v1/raw/wgspd_wgs_v1_idependent_SNPs.prune.in', keep=True)
.filter_variants_intervals('gs://hail-common/MHC_invchr8_longLDreg.txt', keep=False)
.filter_variants_expr('v.contig == "X" ', keep=False)
.filter_samples_expr('sa.first_2nd_related==1', keep=False)
.pca('sa.pca')
.export_samples('gs://wgspd-wgs-v1/qc_measures/pca.tsv','id=s.id, sa.pca.*') 

cd /mnt/data/aganna/bucket/qc_measures/
admixture /mnt/data/aganna/bucket/raw/wgspd_wgs_v1_HQ_common_pruned.bed 4 -j10 

This is how the first three PCs look like.

alt text

Variant QC

The raw multi-allelic-split dataset contains 196,862,507 variants.

We have used to the following exclusion criteria to perform variant QC:

and the following script:

wgspd_wgs_v1_raw_split = hc.read('gs://wgspd-wgs-v1/raw/vds/wgspd_wgs_v1_split.vds')


(wgspd_wgs_v1_raw_split
.rename_samples('gs://wgspd-wgs-v1/qc_measures/sample_ids_all_spaces_to_underscores.txt')
.annotate_samples_expr(
    """
    sa = drop(sa, ID),
    sa.projectcat=if (sa.projectdGPC_afr_W1==1 || sa.projectdGPC_afr_W2==1 || sa.projectdGPC_afr_W3==1 || sa.projectdGPC_afr_W4==1 || sa.projectdGPC_aframr==1) "GPC_afr" else if (sa.projectdPalotie_MESTA==1 || sa.projectdPalotie_migraine==1 || sa.projectdPalotie_Finrisk_W2==1 || sa.projectdPalotie_Finrisk_W1==1) "Fin" else if (sa.projectdGPC_nfe_W1==1 || sa.projectdGPC_nfe_W2==1 || sa.projectdGPC_nfe_W3==1 || sa.projectdGPC_nfe_WFREE==1) "GPC_nfe" else if(sa.Estonia_SCZControls==1) "Est" else "other"
    """)
.annotate_samples_table('gs://wgspd-wgs-v1/qc_measures/sex_related_annot', 'ID', impute = True,code = 'sa=merge(sa, table)')
.filter_samples_list('gs://wgspd-wgs-v1/qc_measures/samples_to_keep_after_qc')
.filter_variants_intervals('gs://hail-common/LCR.interval_list', keep = False)
.variant_qc()
.annotate_variants_expr(
     """
     va.step0 = va.qc.AC > 0,
     va.step1 = va.qc.AC > 0 && va.pass,
     va.step2 = va.qc.AC > 0 && va.pass && va.info.QD > 4
     """)
 .annotate_global_expr(
     """
     global.numvar_step0 = variants.filter(v => va.step0).count(),
     global.numvar_step1 = variants.filter(v => va.step1).count(),
     global.numvar_step2 = variants.filter(v => va.step2).count()
     """)
  .filter_variants_expr('va.qc.AC > 0 && va.pass && va.info.QD > 4', keep=True)   
  .annotate_variants_expr(
     """
     va.numgsdpo400=gs.filter(g => g.dp > 400).count(),
     va.numgsabhomref=gs.filter(g => (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20))).count(),
     va.numgsabhomvar=gs.filter(g => (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20))).count(),
     va.numgsabhet=gs.filter(g => (g.isHet && ( (g.ad[0] + g.ad[1]) / g.dp < 0.9 || g.ad[1] / g.dp < 0.20 || g.pl[0] < 20 || (sa.SEX_FROM_FSTAT == "Male" && ( v.inXNonPar || v.inYNonPar )) )) || (v.inYNonPar && sa.SEX_FROM_FSTAT == "Female")).count(),
     va.allgs=gs.count(),
     va.allgshomref=gs.filter(g => g.isHomRef).count(),
     va.allgshomvar=gs.filter(g => g.isHomVar).count(),
     va.allgshet=gs.filter(g => g.isHet).count()
     """)
 .annotate_global_expr(
     """
     global.numgsdpo400=variants.map(v => va.numgsdpo400).sum()/variants.map(v => va.allgs).sum(), 
     global.fracgsabhomref=variants.map(v => va.numgsabhomref).sum()/variants.map(v => va.allgshomref).sum(), 
     global.fracgsabhomvar=variants.map(v => va.numgsabhomvar).sum()/variants.map(v => va.allgshomvar).sum(), 
     global.fracgsabhet=variants.map(v => va.numgsabhet).sum()/variants.map(v => va.allgshet).sum()
     """)
 .filter_genotypes(
     """
     g.dp > 400 ||
     (g.isHomRef && (g.ad[0] / g.dp < 0.9 || g.gq < 20)) ||
     (g.isHomVar && (g.ad[1] / g.dp < 0.9 || g.pl[0] < 20)) ||
     (g.isHet && ( (g.ad[0] + g.ad[1]) / g.dp < 0.9 || g.ad[1] / g.dp < 0.20 || g.pl[0] < 20 || (sa.SEX_FROM_FSTAT == "Male" && ( v.inXNonPar || v.inYNonPar )) )) || (v.inYNonPar && sa.SEX_FROM_FSTAT == "Female")
     """, keep = False)
 .variant_qc()
 .annotate_variants_expr(
     """    
     va.callRate_GPC_afr = gs.filter(g => sa.projectcat == "GPC_afr").fraction(g => g.isCalled),
     va.callRate_Fin = gs.filter(g => sa.projectcat == "Fin").fraction(g => g.isCalled),
     va.callRate_GPC_nfe = gs.filter(g => sa.projectcat == "GPC_nfe").fraction(g => g.isCalled),
     va.callRate_Est = gs.filter(g => sa.projectcat == "Est").fraction(g => g.isCalled),
     va.callRate_other = gs.filter(g => sa.projectcat == "other").fraction(g => g.isCalled),

     va.pHWE_GPC_afr = gs.filter(g => sa.projectcat == "GPC_afr").hardyWeinberg().pHWE,
     va.pHWE_Fin = gs.filter(g => sa.projectcat == "Fin").hardyWeinberg().pHWE,
     va.pHWE_GPC_nfe = gs.filter(g => sa.projectcat == "GPC_nfe").hardyWeinberg().pHWE,
     va.pHWE_Est = gs.filter(g => sa.projectcat == "Est").hardyWeinberg().pHWE,
     va.pHWE_other = gs.filter(g => sa.projectcat == "other").hardyWeinberg().pHWE
     """)
 .annotate_variants_expr(
     """    
     va.step3 = va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98 && va.callRate_GPC_nfe && va.callRate_Est > 0.98 && va.callRate_other > 0.98,

     va.step4 = va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98 && va.callRate_GPC_nfe && va.callRate_Est > 0.98 && va.callRate_other > 0.98 && va.qc.pHWE > 0.000001 && va.pHWE_GPC_afr > 0.000001 && va.pHWE_Fin > 0.000001 && va.pHWE_GPC_nfe > 0.000001 && va.pHWE_Est > 0.000001 && va.pHWE_other > 0.000001
     """)
 .annotate_global_expr(
     """
     global.numvar_step3 = variants.filter(v => va.step3).count(),
     global.numvar_step4 = variants.filter(v => va.step4).count(),
     global.callRate098 = variants.filter(v => va.callRate < 0.98).count(),
     global.MAC = variants.filter(v => va.qc.AC == 0).count()
     """)
 .filter_variants_expr('va.qc.pHWE > 0.000001 && va.qc.AC > 0 && va.qc.callRate > 0.98 && va.callRate_GPC_afr > 0.98 && va.callRate_Fin > 0.98  && va.callRate_GPC_nfe > 0.98  && va.callRate_Est > 0.98  && va.callRate_other > 0.98 && va.pHWE_GPC_afr > 0.000001 && va.pHWE_Fin > 0.000001 && va.pHWE_GPC_nfe > 0.000001 && va.pHWE_Est > 0.000001 && va.pHWE_other > 0.000001 && v.contig != "X" ', keep=True)
 .write('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds')) 

Resulting in this number of variants at the different filtering stages:

Filter Number of variants
LCR and AC=0 removed 169,691,764
Keep PASS variants 159,667,836
Keep QD > 4 153,236,903
Apply genotype-level filters 0.74% het, 1.21% homref, 0.12% homvar genotypes set to missing
Keep call rate > 98 % and AC > 0 132,803,241
Keep HWE P-value > 1x10^-6 and autosomes 129,010,459 (120,006,849 SNPs, 9,003,610 INDELs)

Functional Annotations (beta)

gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split_hard_anno.vds has been annotated with a large number of functional annotations. We are working to a well documented annotation pipeline, at the moment you can find information about the annotation used here: https://storage.googleapis.com/hail-common/README_annotation_v0.2.txt. The .vds containing annotations for all possible SNVs in the genome can be found here: gs://hail-common/annotation_v0.2.vds.

This is the code we used to produce a hard-call version of the dataset which includes annotations. This is the main .vds used for analysis.

wgspd_wgs_v1_qced_split = hc.read('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split.vds')

annotations_v02 = hc.read('gs://hail-common/annotation_v0.2.vds')

(wgspd_wgs_v1_qced_split
.hardcalls()
.annotate_variants_vds(annotations_v02,root='va.ann02')
.write('gs://wgspd-wgs-v1/qced/vds/wgspd_wgs_v1_split_hard_anno.vds'))

The most relevant annotations available are:

.bam files access

All .bam files for this project are available in a separate google bucket. If you need access to this file email aganna@broadinstitute.org and I will point you to the right path.