^{1}

^{2}

^{3}

^{3}

^{4}

The authors have declared that no competing interests exist.

Expression quantitative trait loci (eQTL) provide insight on transcription regulation and illuminate the molecular basis of phenotypic outcomes. High-throughput RNA sequencing (RNA-seq) is becoming a popular technique to measure gene expression abundance. Traditional eQTL mapping methods for microarray expression data often assume the expression data follow a normal distribution. As a result, for RNA-seq data, total read count measurements can be normalized by normal quantile transformation in order to fit the data using a linear regression. Other approaches model the total read counts using a negative binomial regression. While these methods work well for common variants (minor allele frequencies > 5% or 1%), an extension of existing methodology is needed to accommodate a collection of rare variants in RNA-seq data. Here, we examine 2 approaches that are direct applications of existing methodology and apply these approaches to RNAseq studies: 1) collapsing the rare variants in the region and using either negative binomial regression or Poisson regression and 2) using the normalized read counts with the Sequence Kernel Association Test (SKAT), the burden test for SKAT (SKAT-Burden), or an optimal combination of these two tests (SKAT-O). We evaluated these approaches via simulation studies under numerous scenarios and applied these approaches to the 1,000 Genomes Project.

Expression quantitative trait loci (eQTL) studies provide insight on transcription regulation and have the potential to illuminate the molecular basis of phenotypic outcomes. [

Traditional eQTL mapping methods for microarray expression data often assume the expression data follow a normal distribution and involve application of linear regression or equivalent approaches for eQTL mapping. [

These methods for eQTL mapping in RNA-seq data readily accommodate common variants (minor allele frequency (MAF) > 5%), but were not specifically designed for a collection of rare variants. Several methods have been proposed for testing the association between sequence data with rare variants and a normally distributed or dichotomous outcome. Most methods for analyzing rare variants fall into two categories: burden tests and variance-based tests. Burden tests collapse rare variants within a region and use an indicator function, sum, or proportion as the genetic variable in an association test. [

While these methods are used in a wide variety of settings, most of the association methods for rare variants are for normally distributed or dichotomous phenotypes rather than count data like RNA-seq data. Here, we examine 2 extensions of existing methodology to analyze rare variants in RNA-seq data: 1) collapsing the rare variants in the region and using either negative binomial regression or Poisson regression and 2) using the normalized read counts with the Sequence Kernel Association Test (SKAT), the burden test for SKAT (SKAT-Burden), or an optimal combination of these two tests (SKAT-O). We evaluated these 2 approaches via simulation studies under numerous scenarios. We then applied these approaches to the 1000 Genomes Project Consortium data to determine if the genes that show strong differentiation between closely related populations are acting on expression of nearby genes.

Consider a collection of _{ij} = 0, 1, and 2 for 0, 1, and 2 copies of the disease allele, respectively, for subject _{i} be the total number of reads mapped to this gene for subject _{i} be the normalized read count. Let _{i} be a vector of

We consider modeling the total number of reads _{i} by a Poisson distribution or a negative binomial distribution, depending on whether there is significant over dispersion for the Poisson distribution. [_{i}, dispersion parameter _{i} = (_{i1}, …, _{iK})^{T}. When applying either a Poisson or negative binomial regression, one can employ a log link function to acknowledge the fact that _{i} > 0:
_{i} is the total number of reads mapped to a given gene for subject _{i}) is a function of the rare variants in the region, either

The null and alternative hypothesis to test for an association between the read counts and the collapsed rare variants in the region can be written as follows:

Parameter estimation can be obtained through maximum likelihood estimation (MLE). Since there is not a closed form for the MLEs, iterative techniques such as the Newton-Raphson algorithm can be used. Hypothesis testing can be done using Wald, score, or likelihood ratio tests.

The advantage of Approach 1 is that it is easy to implement. Poisson and negative binomial regressions are usually faster to run than SKAT. The disadvantage of Approach 1 is that taking the sum of the rare variants in the region or using an indicator function for at least one rare variant in the region assumes that all of the rare variants influence the phenotype in the same direction. [

For the normalized read count _{i}, a genotype vector _{i} = (_{i1}, …, _{iJ})^{T}, and a covariate vector _{i} = (_{i1}, …, _{iK})^{T}, then
_{i} ∼ ^{2}) and

The null and alternative hypothesis to test for an association between the transformed read counts and the rare variants in the region can be written as follows:

In order to increase the power to test the null hypothesis, SKAT assumes that _{j} is a pre-specified weight for variant _{0}: _{x} = 0 is equivalent to testing the null hypothesis _{0}: ^{T} where _{1}, …, _{p}) is a weight matrix for the

The choice of the weights plays an important role in SKAT where good choices of the weights can improve power. [_{j}, is close to zero, then variant _{j} is the sample minor-allele frequency for variant _{1} and _{2} can vary. For example, to allow rare variants to have a larger effect, one can set 0 < _{1} ≤ 1 and _{2} ≥ 1. The default for the SKAT software in R is to set _{1} = 1 and _{2} = 25 because this increases the weight of rare variants while still putting nonzero weights for uncommon variants with MAF 1%–5%. [

While the above describes the basis of SKAT, the method has been extended to include a burden test (SKAT-burden) and an optimal test (SKAT-O) that combines the burden test and the traditional version of SKAT. [

The advantage of Approach 2 is that it avoids introducing noise by collapsing the rare variants in the region. Approach 2 is more likely to correctly model the rare variants than Approach 1. The disadvantage of Approach 2 is that by using the normalized read counts information may be lost in terms of the outcome.

Using the software package SKAT, 1,000 rare variant datasets were generated from a 3kb region where 10% markers with MAF < 0.005 were generated as causal for 15, 30, 50, 100, and 500 subjects, respectively. There were 58 rare and common variants in the region. We filtered out all common variants (MAF > 5%), which resulted in 54 variants in the region. We considered the following 2 scenarios to simulate the total read counts for each subject for a given gene.

Total read counts were generated from a negative binomial distribution or a Poisson distribution such that

_{i} ∼ _{RV}) + (_{RV})

_{i} ∼ _{RV}) + (_{RV},

where _{RV} is an indicator that equals one if the subject has any causal variants within the region, the average total read count

We also considered the number of causal variants within the region such that

_{i} ∼ _{i} ∼ _{cv})

_{i} ∼ _{i} ∼ _{cv},

where _{cv} is the number of causal variants in the region,

Results for all 60 plots of the 540 simulation scenarios considered (i.e. 9 fold changes for all combinations of

The read counts were generated from a Poisson distribution for row 1 and a negative binomial distribution for row 2. All plots were generated from Scenario A with the average number of reads

As seen in

As seen in

We applied these proposed approaches to the 1000 Genomes Project Consortium publicly available data [

LCT [chromosome 2] associated with lactose tolerance

FADS cluster [chromosome 11] that may be associated with dietary fat

SLC24A5 [chromosome 15] associated with skin pigmentation

HERC2 [chromosome 15] associated with eye color

The study also found several potentially novel selection signals including:

TRBV9 [chromosome 7]

PRICKLE4 [chromosome 6]

Given these results, we wanted to determine if rare variants within these 6 genes demonstrated association with expression of the respective gene. There was very sparse coverage of the

As seen in

Below are the p-values for all approaches and 5 genes (

Population | Method | Chr 2 | Chr 6 | Chr 11 | Chr 15 | Chr 15 |
---|---|---|---|---|---|---|

African Ancestry | SKAT-Burden | 0.30 | 0.04 | 0.68 | 0.22 | |

SKAT-O | 0.46 | 0.07 | 0.48 | 0.34 | ||

SKAT | 0.04 | 0.74 | 0.15 | 0.30 | 0.27 | |

Negative Binomial: Sum | 0.04 | 0.55 | 0.01 | 0.52 | 0.25 | |

Negative Binomial: Indicator | 0.52 | 0.37 | 0.08 | 0.52 | 0.57 | |

Poisson: Sum | 2.2E-19 | 1.0E-42 | 0 | 3.5E-08 | 2.2E-198 | |

Poisson: Indicator | 0.01 | 2.1E-100 | 0 | 3.0E-07 | 1.4E-34 | |

European Ancestry | SKAT-Burden | 0.51 | 0.51 | 0.77 | 0.50 | |

SKAT-O | 0.56 | 0.60 | 1.00 | 0.73 | ||

SKAT | 0.18 | 0.43 | 0.39 | 0.97 | 0.57 | |

Negative Binomial: Sum | 0.12 | 0.62 | 0.38 | 0.62 | 0.63 | |

Negative Binomial: Indicator | 0.41 | 0.62 | 0.41 | 0.21 | 0.70 | |

Poisson: Sum | 1.5E-19 | 1.4E-38 | 1.9E-164 | 8.5E-05 | 1.2E-40 | |

Poisson: Indicator | 4.7E-05 | 1.4E-38 | 1.5E-141 | 4.5E-19 | 1.9E-23 |

None of the other regions (

It has been previously shown that an increase in power can be achieved for eQTL mapping with RNA-seq data by using a negative binomial regression with the total read count instead of normalizing the read count and using a linear regression. [

Based on the simulation studies that were performed, the SKAT (SKAT, SKAT-O, and SKAT-Burden) methods using the normalized read counts were the only methods that maintained the type 1 error rate in all scenarios for all sample sizes. Therefore, we recommend using SKAT with normalized read counts over the other approaches considered here. The data analysis further supported this recommendation. The 2 methods that found a significant association of rare variants in the

For our simulation studies, we generated the rare variant data such that rare variants in the region acted in the same direction. If rare variants in the region were not directionally consistent, this could be an issue for the negative binomial regressions which collapsed rare variants in the region. However, this would not be an issue for the SKAT approaches, which further supports our recommendation.

We considered 2 approaches: (1) collapsing the rare variants in the region and using negative binomial regression or Poisson regression and (2) using the normalized read counts with SKAT, SKAT-O, or SKAT-Burden. Given that SKAT is based on generalized linear mixed models, another approach would be to extend SKAT for negative binomial traits.

The file contains the results for all 60 plots of the 540 simulation scenarios considered (i.e. 9 fold changes for all combinations of

(PDF)

Research reported in this publication was supported by the National Heart, Lung, and Blood Institute of the National Institutes of Health under Award Number K01HL125858 and R21HL113543. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We would also like to thank Pamela H. Russell at National Jewish Health for her help with the data set.