^{1}

^{2}

^{1}

^{2}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: ES NB. Analyzed the data: ES. Contributed reagents/materials/analysis tools: ES. Wrote the paper: ES NB.

In large collections of tumor samples, it has been observed that sets of genes that are commonly involved in the same cancer pathways tend not to occur mutated together in the same patient. Such gene sets form mutually exclusive patterns of gene alterations in cancer genomic data. Computational approaches that detect mutually exclusive gene sets, rank and test candidate alteration patterns by rewarding the number of samples the pattern covers and by punishing its impurity, i.e., additional alterations that violate strict mutual exclusivity. However, the extant approaches do not account for possible observation errors. In practice, false negatives and especially false positives can severely bias evaluation and ranking of alteration patterns. To address these limitations, we develop a fully probabilistic, generative model of mutual exclusivity, explicitly taking coverage, impurity, as well as error rates into account, and devise efficient algorithms for parameter estimation and pattern ranking. Based on this model, we derive a statistical test of mutual exclusivity by comparing its likelihood to the null model that assumes independent gene alterations. Using extensive simulations, the new test is shown to be more powerful than a permutation test applied previously. When applied to detect mutual exclusivity patterns in glioblastoma and in pan-cancer data from twelve tumor types, we identify several significant patterns that are biologically relevant, most of which would not be detected by previous approaches. Our statistical modeling framework of mutual exclusivity provides increased flexibility and power to detect cancer pathways from genomic alteration data in the presence of noise. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.

Tumor DNA carries multiple alterations, including somatic point mutations, amplifications, and deletions. It is challenging to identify the disease-causing alterations from the plethora of random ones, and to delineate their functional relations and involvement in common pathways. One solution for this task is inspired by the observation that genes from the same cancer pathway tend not to be altered together in each patient, and thus form patterns of mutually exclusive alterations across patients. Mutual exclusivity may arise, because alteration of only one pathway component is sufficient to deregulate the entire process. Detecting such patterns is an important step in

Recent years in cancer research are characterized by both accumulation of data and growing awareness of its overwhelming complexity. While consortia like The Cancer Genome Atlas (TCGA)

One systematic approach to address the latter task is to search for mutually exclusive patterns in cancer genomic data

Previous studies identified mutually exclusive patterns either via integrated analysis of known cellular interactions and genomic alteration data

To our knowledge, there exists no approach that explicitly models the generative process of mutual exclusivity patterns. In the absence of a statistical model of the data, the definition of the weight, although intuitively reasonable, remains arbitrary. In the previous studies, the weight served also as statistic for a column-wise permutation test that assesses the significance of patterns. We show that the power of this test decreases with the number of genes, likely because the weight does not scale with gene number, and the same impurity level affects it more with more genes in the pattern. Most importantly, none of the existing approaches deal with the problem of errors in the data. Despite advanced methodologies on both experimental and computational side

Here, we develop two alternative models for cancer alteration data (

First, we evaluate performance of our approach in the case when, as it is done in the literature, the data is assumed to record no false positive or negative alterations. On simulated patterns our mutual exclusivity test proves more powerful than the weight-based permutation test. In glioblastoma multiforme data

A mutual exclusivity pattern can be detected in a given cancer alteration dataset, with

We assume that the mutual exclusivity patterns are the result of the following generative process (

We propose a generative model of mutual exclusivity that describes the process illustrated in

Encouraged by this result, we propose an expectation maximization algorithm (Methods) to estimate the maximum likelihood parameter values and evaluate its performance in practice (

In the case when the dataset does not carry the mutual exclusivity pattern, we assume that the corresponding genes are mutated independently with their individual alteration frequencies. This is modeled with a set of independent, observed binary random variables

We evaluate our mutual exclusivity model and statistical test in three different scenarios. First, we make an assumption prevalent in the literature, namely that the data is generated without errors. In the second scenario, we assume that the data contains errors, and the error rates are given. Finally, we consider the scenario where the data is generated with errors, and the error rates are unknown.

First, we evaluate the performance of our mutual exclusivity model on simulated data assuming that the data is clean of errors. In this case, the model is reduced, since it is parametrized only by the coverage

For datasets with three genes only and low coverages, both our ME and the permutation test not always detect mutual exclusivity (

Both tests correctly do not support mutual exclusivity for datasets generated from the independence model (

We further use our model to identify significant mutual exclusivity patterns with high coverage and low impurity in glioblastoma multiforme samples from The Cancer Genome Atlas (TCGA

To obtain a comprehensive picture of the types of patterns that can be found in this dataset, we restricted the gene set size to four, and evaluated all 1,837,620 possible gene subsets of this size.

Gene set | ME statistic | ME p-value | Weight | Perm. p-value | Imbalance | ||

0.97 | 0.01 | −14.72 | 1 | 221 | 0 | 0.93 | |

0.44 | 0.11 | 2.63 | 0.04 | 69 | 1 | 0.28 | |

0.44 | 0.11 | 2.84 | 0.02 | 68 | 1 | 0.27 | |

0.5 | 0.13 | 2.63 | 0.04 | 72 | 1 | 0.28 |

Our analysis did not rediscover four mutually exclusive gene sets (

In this section, we consider the scenario where the data are erroneous, and the error rates are known and can be used for pattern evaluation.

Fixing the parameters

In contrast to the observed weight, which was applied in previous studies, and ignores errors and scores observed coverage and impurity, our approach to estimate true quality, using known error rates, estimates the true parameters and ranks the patterns correctly. The data was simulated from the mutual exclusivity model with parameter values fixed to

Finally, we consider the scenario, where the observed data contains errors that occur at unknown rates. In this case we need to estimate all four model parameters, and we proved the model to be identifiable from the data (

The ME test p-values for

We applied our approach accounting for false positives to pan-cancer genomic alteration data

We aimed to collect universal, low-impurity mutual exclusivity patterns for gene sets of size five that cover multiple cancer samples, accounting for possible false positives. We first pre-filtered the immense set of all possible subsets, starting with fitting the reduced model (assuming no errors in the data) for all 15,504 subsets of 20 measured genes that were selected by their large individual alteration frequency (

Findings for various cancers for pairs of genes support that the top patterns are indicative of coexistence in a common cancer pathway. For instance, for the pattern in

This work brings two main contributions. First, a probabilistic, generative model of mutual exclusivity, with readily interpretable parameters that represent pattern coverage and impurity, as well as parameters that account for false positive and false negative rates. In the case when the data is clear of errors, we give closed-form expressions for maximum likelihood coverage and impurity estimates. For erroneous data, we propose an EM algorithm for parameter estimation. We prove analytically that the model parameters are identifiable, and show the limits of parameter estimation in practice, where the sample sizes are small. These limits allow accurate estimation of the most troublesome false positive rate, as well as the coverage and impurity parameters, which are most useful for pattern ranking. Second, we develop the ME test, which assesses the significance of mutual exclusivity patterns by comparing the likelihood of the dataset under the mutual exclusivity model to the null model assuming independent alterations of genes. The proposed test proves to be more powerful than a permutation test applied previously.

Our approach was first applied to identify mutually exclusive patterns that are specific for glioblastoma, with the assumption prevalent in the literature that the data does not contain errors. The genes that show the top identified patterns are involved in canonical glioblastoma signaling pathways, with addition of two novel genes,

The proposed mutual exclusivity model could be extended in several ways. For instance, the current model explicitly assumes that the mutually exclusive mutations occur equally likely in all genes in the dataset. This assumption has two important advantages. First, the ME test finds most evidence for mutual exclusivity for balanced patterns, where the genes contribute similarly to the coverage. Second, with this assumption our EM algorithm is very efficient (Methods) and dropping it would increase its time complexity. The model may be extended to allow different mutually exclusive mutation rates of genes as parameters, which would be estimated from the data. Another possible extension of the model would allow for multiple gene sets, each with own coverage and impurity parameters, and the same error rates. Such a model, in contrast to previous work in this direction

The TCGA provisional glioblastoma data for 236 patients in 83 genes includes somatic point mutations (identified as significant by MutSig

Let

In the reduced model we know

By Proposition (1), we have that for

iteration parameters

Estimate

Draw at random

}

The independence model assumes all genes are mutated independently. Each gene

The mutual exclusivity and independence models are not nested. To compare their likelihoods for a given dataset

For a given set of genes

A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5 [26].

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(TXT)

(PDF)

(PDF)

(PDF)

(TXT)

(PDF)

(PDF)