^{*}

^{*}

OMR developed the statistical model, did most of the programming, and conducted analysis. RDU conceived the model, participated in model development and programming, and conducted simulations. Both authors wrote the paper.

The authors have declared that no competing interests exist.

Genomic DNA copy-number alterations (CNAs) are associated with complex diseases, including cancer: CNAs are indeed related to tumoral grade, metastasis, and patient survival. CNAs discovered from array-based comparative genomic hybridization (aCGH) data have been instrumental in identifying disease-related genes and potential therapeutic targets. To be immediately useful in both clinical and basic research scenarios, aCGH data analysis requires accurate methods that do not impose unrealistic biological assumptions and that provide direct answers to the key question, “What is the probability that this gene/region has CNAs?” Current approaches fail, however, to meet these requirements. Here, we introduce reversible jump aCGH (RJaCGH), a new method for identifying CNAs from aCGH; we use a nonhomogeneous hidden Markov model fitted via reversible jump Markov chain Monte Carlo; and we incorporate model uncertainty through Bayesian model averaging. RJaCGH provides an estimate of the probability that a gene/region has CNAs while incorporating interprobe distance and the capability to analyze data on a chromosome or genome-wide basis. RJaCGH outperforms alternative methods, and the performance difference is even larger with noisy data and highly variable interprobe distance, both commonly found features in aCGH data. Furthermore, our probabilistic method allows us to identify minimal common regions of CNAs among samples and can be extended to incorporate expression data. In summary, we provide a rigorous statistical framework for locating genes and chromosomal regions with CNAs with potential applications to cancer and other complex human diseases.

Alterations in the number of copies (gains, losses) of genomic DNA have been associated with several hereditary anomalies and are involved in human cancers [

A widely used technique to identify copy-number changes in genomic DNA is array-based comparative genomic hybridization (aCGH). Two DNA samples (e.g., problem and control) are differentially labeled (often with fluorescent dyes) and competitively hybridized to chromosomal DNA targets. After hybridization, emission from each of the two fluorescent dyes is measured, and the signal intensity ratios are indicative of the relative copy number of the two samples [

The main biomedical problem, both for the study of the CNAs per se and for downstream analysis (e.g., relationship with gene expression changes or patient classification), is the accurate identification of the genes/chromosomal regions that have an altered copy number. Satisfactorily dealing with this problem requires a method that (1) provides direct answers that can be used in different settings (e.g., clinical versus basic research), (2) reflects the underlying biology and accounts for key features of the technological platform, and (3) can accommodate the different levels of analysis (types of questions) addressed with these data.

First, estimates of the probabilities of alteration (instead of

Second, the analysis should incorporate distance between probes [

Third, depending on the focus of the study, the analysis should be conducted either chromosome by chromosome, or genome-wide [

Available methods for the analysis of aCGH fail some or most of these requirements. Smoothing techniques [

We have developed a method, reversible jump aCGH (RJaCGH), that fulfills the three requirements above, and does not suffer from the limitations discussed for other methods. Our method is applicable to aCGH from platforms including ROMA, oligonucleotide aCGH (oaCGH; including Agilent, NimbleGen, and many noncommercial, in-house oligonucleotide arrays), bacterial artificial chromosome (BAC), and cDNA arrays [_{2} value, with some random noise. We want to use the observed log-ratios to identify regions with altered copy number.

The biological features of this model (a finite number of unknown or hidden states that are indirectly measured, with states of close elements likely to be similar, and variable distances between probes) can be modeled with a nonhomogeneous HMM [

We applied RJaCGH and the best performing alternative methods (based on two recent reviews [

Results in

Shown are the mean and 95% confidence interval around the mean of the correct classification error rate. Each mean and confidence interval is computed from 500 datasets [

The same data are shown as those in

This paper focuses on the statistical performance of the methods compared. In terms of speed, nevertheless, our approach is clearly the slowest one. We are currently working on improving the speed of the execution both by using more efficient algorithms and by using parallel computing.

Similar results are obtained when applying these methods to a real dataset of nine cell lines [

The excellent performance of RJaCGH is a result of the statistical method used, which is essentially a careful and rigorous development from first principles. We set out to obtain a method that allows us to seamlessly incorporate interprobe distances (to allow usage over varied technological platforms), that makes no untenable assumptions about the true number of copy levels (since this is likely to vary between datasets), that permits analysis at the chromosome and the genome level, and, finally, that returns posterior probabilities of alteration, because these posterior probabilities constitute the direct answer to the basic biomedical question (“Is this gene likely to have an altered copy number?”).

Based simply on our usage of interprobe distance, we should expect RJaCGH to perform better than all alternative approaches, with the possible exception of BioHMM [

In addition, we use Bayesian model averaging, which has been repeatedly shown [

In addition to features that can be compared with other methods, RJaCGH has two unique features that set it apart from most alternative approaches. First, the user can analyze data at either the genome or the chromosome level, thus addressing different types of questions. Some approaches (e.g., BioHMM, HMM, GLAD, DNAcopy) allow us to perform genome-wide inferences, but they use essentially an ad hoc postprocessing of results of analysis that is conducted at the chromosome level. Finally, one of the main features of RJaCGH, its returning of posterior probabilities of CNAs, simply cannot be compared with most alternative methods as they do not provide this type of output. What most alternative approaches return are smoothed means,

Directly returning probabilities of alterations has profound consequences, both for current practices and for future developments. As argued above, these probabilities are the direct answer to the question “Does this gene have an altered copy number?”;

For currently active research areas, the availability of rigorously obtained probabilities of alterations has far-reaching consequences, both in terms of the biological phenomena that can be exposed and as an avenue of further research. First, the availability of probabilities of alteration should improve the identification of regions with consistent alterations across samples [

Second, posterior probabilities of being in a specific state, together with the estimated posterior mean of each state, can be used as the basis for a statistically rigorous and biologically sound approach for identifying breakpoints. At present, the identification of breakpoints depends completely on the resolution of the method, and does not allow us to combine the probability of membership in different states with the biological relevance of an estimated mean difference; however, the precise definition of boundaries and amplification maxima are important not only for the study of genomic copy numbers, but also for understanding the relationship between aCGH and expression data [

Third, the model of RJaCGH can be extended to provide rigorous downstream analysis of aCGH, including patient classification [_{2} aCGH ratios that reflect underlying copy number variation; as such, these are highly relevant to the recent studies on the relationship between copy-number variation and complex phenotypes [

We use a nonhomogeneous HMM with Gaussian emissions. We can either fit one model to all the chromosomes of an array, or we can fit a different model for each chromosome of an array. Let _{i}_{i}_{i}_{=1,...,n}. Let _{i}_{i}

We assume that {_{i}_{i}_{i}_{i}_{−1} = _{i}_{−1}, _{i}_{−1} = _{i}_{−1}) = _{i}_{−1}, so the dependence of the state of a probe onto the next one is lower the farther the probes are. We also expect that when the distance between two probes is maximal, the state of a probe should be independent from the state of its predecessor. Thus, we model the transition probabilities as:
_{ij}_{i}_{i}

Similar approaches have been used before with nonhomogeneous HMM [

For computational reasons and modeling flexibility, we opted for Bayesian methods using MCMC. To fit models with varying number of hidden states, we used reversible jump. Suppose that we have a collection of ^{2}(^{2}), where _{i}^{2}(^{2}(_{i}

We can draw samples from the posterior distribution through a reversible jump MCMC (RJMCMC) algorithm [

(1) Update HMM of a model using a series of Metropolis-Hastings moves. (We do not use Gibbs Sampler to avoid the hidden state sequence from becoming part of the state space of the sampler, so dimensionality is reduced and reaching convergence is easier).

(2) Update model (birth/death). When we have _{birth}_{death}

(3) Update model (split/combine). A split/combine move is attempted with probabilities _{split}_{combine}_{0} is split into two, _{1}, _{2}:

The combination of birth and split moves makes it possible not only to visit models with a different number of parameters, but also to explore more thoroughly the posterior probability in the case of a parameter with a multimodal density.

These moves are common ones [

We run the former algorithm a large number of times (e.g., 50,000) and, after discarding the first iterations as burn-in, we keep the last (e.g., 10,000) samples as observations from the joint distribution so that we can make inferences from it. For every model that has been visited, we obtain the posterior probabilities of the mean copy number of every state, the variance of the copy number of every state, and the function of transitions between hidden states. By counting the number of times that each model has been visited, we obtain an estimate of the posterior probability of each model (i.e., we avoid using Bayesian information criterion [BIC] or AIC). Then, applying the Viterbi algorithm [

When obtaining posterior probabilities of copy-number change, we use Bayesian model averaging [_{i}_{k}_{i}_{k}

As in any MCMC approach, it is crucial to assess convergence of the sampler. We follow common practice [_{μ}_{β}

We have implemented RJaCGH using C (for the sweep algorithm) and R [

(551 KB PDF)

C. Lázaro-Perea, A. Alibés, L. Hsu, D. Grove, two anonymous reviewers, and J. F. Poyatos especially provided discussion and comments on the paper. RDU is partially supported by the Ramón y Cajal programme of the Spanish Ministry of Education and Science (MEC).

array-based comparative genomic hybridization

copy-number alteration

hidden Markov model

Markov chain Monte Carlo

reversible jump aCGH

reversible jump MCMC