^{1}

^{*}

^{2}

EB and RT conceived and designed the experiments. EB performed the experiments and analyzed the data. EB and RT contributed reagents/materials/analysis tools. EB wrote the paper.

The authors have declared that no conflicts of interest exist.

An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer.

Procedures that utilize both gene expression data and clinical data to identify subtypes of cancer can provide more accurate prognoses.

When a patient is diagnosed with cancer, various clinical parameters are used to assess the patient's risk profile. However, patients with a similar prognosis frequently respond very differently to the same treatment. This may occur because two apparently similar tumors are actually completely different diseases at the molecular level (

The main example discussed in this paper concerns diffuse large B-cell lymphoma (DLBCL). This is the most common type of lymphoma in adults, and it can be treated by chemotherapy in only approximately 40% of patients (

If different subtypes of cancer are known to exist, there are a variety of existing techniques that can be used to identify which subtype is present in a given patient (

There are two main approaches in the literature to identify such subtypes. One approach uses unsupervised learning techniques, such as hierarchical clustering, to identify patient subgroups. This type of procedure is called “unsupervised” since it does not use any of the clinical information about the patient. The subgroups are identified using only the gene expression data. (In contrast, “supervised learning” would use the clinical data to build the model.) For an overview of unsupervised learning techniques, see

Hierarchical clustering (

The second approach to identifying subtypes of cancer is based exclusively on the clinical data. For example, patients can be assigned to a “low-risk” or a “high-risk” subgroup based on whether they were still alive or whether their tumor had metastasized after a certain amount of time. This approach has also been used successfully to develop procedures to diagnose patients (

However, by dividing the patients into subgroups based on their survival times, the resulting subgroups may not be biologically meaningful. Suppose, for example, that there are two tumor cell types. Suppose further that patients with cell type 2 live slightly longer than patients with cell type 1 but that there is considerable overlap between the two groups (

To overcome these difficulties, we propose a novel procedure that combines both the gene expression data and the clinical data to identify cancer subtypes. The crux of the idea is to use the clinical data to identify a list of genes that correlate with the clinical variable of interest and then apply unsupervised clustering techniques to this subset of the genes.

For instance, in many studies, the survival times of the patients are known even though no tumor subtypes have been identified (

Once such a list of significant genes is compiled, there are several methods we can use to identify clinical subgroups. We can apply clustering techniques to identify subgroups of patients with similar expression profiles. Once such subgroups are identified, we can apply existing supervised learning techniques to classify future patients into the appropriate subgroup. In this study, we will use the “nearest shrunken centroids” procedure of

Sometimes, however, a continuous predictor of survival is desired. We also describe a supervised version of principal components analysis that can be used to calculate a continuous risk score for a given patient and identify subtypes of cancer. The resulting predictor performs very well when applied to several published datasets.

These two methods will produce satisfactory results in most datasets. However, we will describe some variations of these methods that can sometimes improve their performance. When we cluster a dataset using only a subset of the genes, it is important that we choose the correct subset of genes. Choosing the genes with the largest Cox scores is generally a good strategy, but this procedure sometimes selects some spurious genes. We will show that one can use partial least squares (PLS) to compute a “corrected” Cox score. Selecting the genes with the largest “corrected” Cox scores can produce better clusters than selecting genes with largest raw Cox scores. Additionally, we will describe two other continuous predictors of survival that we will call β̃ and γ̂. For some problems, they are better predictors than the continuous predictor based on supervised principal components (

There are also related methods for predicting the survival of cancer patients using gene expression data.

Moreover, in many applications, we would like to identify which genes are the best predictors of survival. These genes could be analyzed in the laboratory to attempt to discover how they influence survival. They could also be used to develop a diagnostic test based on immunostaining or reverse transcriptase PCR. For these applications, it is important to have a predictor of survival that is based on a small subset of the genes. This is another important advantage of our methods over existing methods.

Our goal is to identify subtypes of cancer that are both clinically relevant and biologically meaningful. Suppose that we have 𝓃 patients, and we measure the expression level of

As noted in the Introduction, we needed to assign each patient to a subgroup before we could apply nearest shrunken centroids. First, we applied an unsupervised 2-means clustering procedure to the DLBCL data of

We compared the survival times of the two subgroups using a log-rank test. The log-rank test statistic was 0.7, with a corresponding

We assigned each patient in the training data to either a “low-risk” or “high-risk” subgroup based on their survival time (see “

These were obtained by applying nearest shrunken centroids to the DLBCL test data. Patients in the training data were assigned to either the “low-risk” or “high-risk” group depending on whether or not their survival time was greater than the median survival time of all the patients.

In order to identify tumor subclasses that were both biologically meaningful and clinically relevant, we applied a novel, supervised clustering procedure to the DLBCL data. We ranked all of the genes based on their univariate Cox proportional hazards scores, and performed clustering using only the “most significant” genes.

Recall that when we performed 2-means clustering on the patients in the test data using all 7,399 genes and used a log-rank test to compare the survival times of the patients in the two resulting clusters, the result was not significant. To test our new clustering method, we calculated the Cox scores of all 7,399 genes based on the 160 training observations and ranked the genes from largest to smallest based on their absolute Cox scores. We then clustered the 80 test observations using only the 25 top-scoring genes. This time, the log-rank statistic comparing the survival times of the two clusters was highly significant (

Both

Although these methods do a better job of identifying biologically meaningful clusters than clustering based on all of the genes, there is no guarantee that the clusters thus identified are associated with the clinical outcome of interest. Indeed, both

We applied the clustering procedure of

We showed that the cancer subgroups identified using this supervised clustering method can be used to predict survival in future patients. The idea is straightforward. First, we identified subgroups of patients using supervised clustering. Then we trained a nearest shrunken centroid classifier to predict the subgroup to which each patient belonged. Details are given in “

We tested this procedure on the DLBCL data. A clustering based on 343 genes produced the smallest crossvalidation error rate, so we used a classifier based on this clustering to assign each of the 80 test patients to one of the two subgroups. The survival curves of the two predicted subgroups are shown in

We used a form of the principal components of the expression matrix to predict survival. Principal components analysis is an unsupervised learning technique that is used to reduce the dimensionality of a dataset by calculating a series of “principal components.” The hope is that the first few principal components will summarize a large percentage of the variability in the entire dataset. See

Unfortunately, principal components analysis suffers from the same limitations as purely unsupervised clustering. If we perform principal components analysis using all of the genes in a dataset, there is no guarantee that the resulting principal components will be associated with survival. Thus, we propose a semi-supervised form of principal components analysis that we call “supervised principal components.” Rather than using all of the genes when we perform principal components analysis, we use only a subset of the genes that are correlated with survival.

Using the 160 training observations, we computed the Cox scores for each gene. We kept the 17 genes with Cox scores of 2.39 or greater. We calculated the principal components of the training data using only these 17 genes. Then we approximated the principal components of the test data using

_{1}, and patient survival. To confirm this observation, we fit a Cox proportional hazards model to a linear combination of υ̂I_{1} and υ̂I_{2}, the estimated first and second principal components of the test data, respectively. (See “^{2} = 0.113, likelihood ratio test statistic = 9.58, 1 d.f., ^{2} = 0.08, likelihood ratio test statistic = 6.7, 1 d.f.,

Thus far, all of our examples have been based on the DLBCL data of

Unfortunately, the expression levels of only 70 genes were available for the 292 patient dataset, making it difficult to test our methodology. However, we were able to apply our supervised principal components method. The expression levels of approximately 25,000 genes were available for the earlier study (consisting of 78 patients). After applying crossvalidation, we selected a model consisting of eight genes, five of which were included among the 70 genes in the larger dataset. Thus, we fit a supervised principal components model using these five genes and applied it to the dataset of 292 patients.

The results are shown in ^{2} statistic for each model. ^{2} measures the percentage of the variation in survival time that is explained by the model. Thus, when comparing models, one would prefer the model with the larger ^{2} statistic.) We see that our supervised principal components method produced a stronger predictor of metastasis than the procedure described in

Comparison of the values of the ^{2} statistic of the Cox proportional hazards model (and the

(Of the 78 patients used to build the model in the original study, 61 were included in the larger dataset of 292 patients. Thus, the values of ^{2} calculated using all 292 patients are inflated, since part of the dataset used to validate the model was also used to train the model. We include these results merely to demonstrate the greater predictive power of our methodology. Moreover, we repeated these calculations using only the 234 patients that were not included in the earlier study to ensure that our results were still valid.)

We compared each of our proposed methods to several previously published methods for predicting survival based on microarray data. In particular, we examined three previously published procedures: a method based on SVMs (

We compared these methods on four different datasets (See

Comparison of the different methods applied to the DLBCL data of

We compared each of the methods we proposed above on two simulated datasets. (See

We then generated survival times. The survival times of samples 1–50 were generated as normal random numbers with a mean of 10.0 and a standard deviation of 2.0, and the survival times of samples 51–100 were generated as normal random numbers with a mean of 8.0 and a standard deviation of 3.0. For each sample, a censoring time was generated as a normal random number with a mean of 10.0 and a standard deviation of 3.0. If the censoring time turned out to be less than the survival time, the observation was considered to be censored. Finally, we generated another 5000 × 100 matrix of test data X̃I, which was generated the same way

We defined samples 1–50 as belonging to “tumor type 1” and samples 51–100 as belonging to “tumor type 2.” Thus, a successful subgroup discovery procedure should assign samples 1–50 to one subgroup, and samples 51–100 to the other subgroup.

We applied the methods discussed above to identify these subgroups (and predict survival) for the simulated dataset. This simulation was repeated ten times. The results are shown in ^{2} obtained by fitting a Cox proportional hazards model to the predicted class labels for the test data (or by fitting a Cox model to γ̂ in the case of methods 4 and 6).

The methods are (1) assigning samples to a “low-risk” or “high-risk” group based on their median survival time; (2) using 2-means clustering to identify two subgroups; (3) using 2-means clustering based on the genes with the largest Cox scores; (4) using the supervised principal components method; (5) using 2-means clustering based on the genes with the largest PLS-corrected Cox scores; (6) using the continuous predictor . Each entry in the table represents the mean over 10 simulations; the standard error is given in parentheses

In the first simulation, we found that the fully supervised and the fully unsupervised methods produced much worse results than the semi-supervised methods. (For each iteration of the “median cut” method, the crossvalidation error was minimized when all of the observations were assigned to the same class. Hence, each such model had no predictive power, and the value of ^{2} was zero for each iteration. If we had chosen a smaller value of the tuning parameter Δ, the procedure would have performed better, although not significantly better.) The continuous predictor based on supervised principal components performed nearly as well as the methods based on semi-supervised clustering.

Next, we performed a second simulation. The second simulated dataset

Under this model, there are actually five “tumor subgroups.” However, we still used 2-means clustering on this simulated dataset in order to evaluate the performance of our methods when the number of clusters is chosen incorrectly. Thus, in this simulation, it does not make sense to talk about the number of “misclassification errors;” we can only compare the methods on the basis of their predictive ability.

We applied the six different methods to this new simulated dataset and repeated this simulation ten times; the results are shown in

The methods are (1) assigning samples to a “low-risk” or “high-risk” group based on their median survival time; (2) using 2-means clustering to identify two subgroups; (3) using 2-means clustering based on the genes with the largest Cox scores; (4) using the supervised principal components method; (5) using 2-means clustering based on the genes with the largest PLS-corrected Cox scores; (6) using the continuous predictor . Each entry in the table represents the mean over 10 simulations; the standard error is given in parentheses

One important goal of microarray research is to develop more powerful diagnostic tools for cancer and other diseases. Consider a hypothetical cancer that has two subtypes. One subtype is known to spread much more rapidly than the other subtype, and hence must be treated much more aggressively. We would like to be able to diagnose which type of cancer patients have and give them the appropriate treatment.

If it is known that two such subtypes of a certain cancer exist, and if we have a training set where it is known which patients have which subtype, then we can use nearest shrunken centroids or other classification methods to build a model to diagnose this cancer in future patients. However, in many cases, we do not know how many subtypes are present, nor do we know which patients belong to which subgroup. Thus, it is important to develop methods to identify such subgroups.

Unsupervised methods, such as hierarchical clustering, are popular techniques for identifying such subgroups. However, there is no guarantee that subgroups discovered using unsupervised methods will have clinical significance.

An alternative is to generate class labels using clinical data. The simplicity of the approach of dividing the patients into two subclasses based on their survival time is attractive, and there is evidence that this procedure can successfully predict survival. Indeed, this procedure produced a significant predictor of survival in four different datasets, suggesting that this approach has some utility. However, as noted in the Introduction, subgroups identified in this manner may not be biologically meaningful. When we applied this model to the DLBCL data described earlier, the misclassification error rate for the shrunken centroids model was very high (around 40%), so a diagnosis based on this procedure is likely to be inaccurate.

Supervised clustering methods can overcome these problems. We have seen that if we selected significant genes prior to clustering the data, we could identify clusters that were clinically relevant. We have also seen how knowledge of these clusters could be used to diagnose future patients.

This supervised clustering methodology is a useful prognostic tool. It is also easy to interpret. However, it has certain shortcomings as well. Recall our conceptual model shown in

One possible such predictor is our supervised principal components procedure. This procedure used the principal components of a subset of the expression matrix

We compared our methods to several previously published methods for predicting survival based on microarray data. In general, our methods performed significantly better than these existing methods. In particular, our supervised principal components method gave the best results on three of the four datasets. (It performed slightly worse than our γ̂ method on the DLBCL data, but it still outperformed almost all of the other methods.) Furthermore, each of our proposed methods was a significant predictor of survival (at

Another important advantage of our methods is that they select a subset of the genes to use as predictors. The methods of

Throughout this study, we have used survival data to help us identify genes of interest. However, other clinical variables could also be used, such as the stage of the tumor, or whether or not it has metastasized. Rather than ranking genes based on their Cox scores, one would use a different metric to measure the association between a given gene and the clinical variable of interest. For example, suppose we wished to identify a subgroup of cancer that was associated with a high risk of metastasis. For each gene, we could compute a t-statistic comparing the expression levels in the patients whose cancer metastasized to those in the patients with no metastasis.

Information about the risk of metastasis (and death) for a given patient is essential to treat cancer successfully. If the risk of metastasis is high, the cancer must be treated aggressively; if the risk is low, milder forms of treatment can be used. Using DNA microarrays, researchers have successfully identified subtypes of cancer that can be used to assess a patient's risk profile. Our results show that semi-supervised learning methods can identify these subtypes of cancer and predict patient survival better than existing methods. Thus, we believe they can be a powerful tool for diagnosing and treating cancer and other genetic diseases.

The nearest shrunken centroids procedure calculates the mean expression of each gene within each class. Then it shrinks these centroids toward the overall mean for that gene by a fixed quantity, Δ. Diagonal linear discriminant analysis (LDA) is then applied to the genes that survive the thresholding. Details are given in

We created two classes by cutting the survival times at the median survival time (2.8 y). Any patient who lived longer than 2.8 y was considered to be a “low-risk” patient, and any patient that lived less than 2.8 y was considered to be a “high-risk” patient. In this manner, we assigned a class label to each observation in the training data.

Unfortunately, many of the patients' survival times were censored, meaning that the individual left the study before the study was completed. When this occurs, we do not know how long the patient survived; we only know how long the patient remained in the study prior to being lost to follow-up.

If an observation is censored, we may not know to which class it belongs. For example, suppose that the median survival time is 2.8 y, but that a patient left the study after 1.7 y. If the patient died in the interval between 1.7 y and 2.8 y, then the patient should be assigned to the “high-risk” group. Otherwise, the patient should be assigned to the “low-risk” group. However, there is no way to determine which possibility is correct.

Based on the Kaplan-Meier survival curve for all the patients, we can estimate the probability that a censored case survives a specified length of time (

and, of course,

In this manner, we can estimate the probability that each censored observation belongs to the “low-risk” and “high-risk” classes, respectively.

However, it is still unclear how we would train our classifier based on this information. Nearest shrunken centroids is a modified version of LDA. It is described in detail in

Let {_{i}}_{i}}_{i}s may belong. (If we are dividing the training data into “low-risk” and “high-risk” patients, then _{i}s are known, the problem is to fit the mixture model

(Generally, each 𝒻_{i} is a Gaussian density function, and the θ_{i}s correspond to the mean of the observations in each class. The π_{i}s correspond to “prior” probabilities that an observation belongs to class 𝒾.) In this case, we must fit this model on the basis of the classified (uncensored) training data, which we denote by _{𝒿} ( 𝒿 = 𝓃+1, …,𝓃+𝓂), which we denote by _{𝓊}. (Also, note that Φ = (π′,θ′)′ denotes the vector of all unknown parameters.)

We define the latent variables 𝓏_{ij} to be equal to one if the 𝒿th observation belongs to the 𝒾th class, and zero otherwise. Then the complete-data log likelihood is

The EM algorithm is applied to this model by treating _{j}(𝒿 = 𝓃+1,…, 𝓃+𝓂) as missing data. It turns out to be very simple in the case of LDA. The E-step is effected here simply by replacing each unobserved indicator variable 𝓏_{ij} by its expectation conditional on _{j}. That is, 𝓏_{ij} is replaced by the estimate of the posterior probability that the 𝒿th entity with feature vector _{j} belongs to _{i}(𝒾 = 1, …,_{ij} to be the earlier estimate that the 𝒾th censored observation belongs to class 𝒿 based on the Kaplan-Meier curve.

The estimates of π_{i} and μ_{i} in the M-step are equally simple:

and

where

In these expressions, τ_{i}(_{j} belongs to _{i}, or, in other words,

We continue these imputations until the algorithm converges. In practice, one imputation seems to be sufficient for most problems, since each imputation is computationally intensive, and additional imputations did not seem to change the results significantly.

We calculated the Cox scores of each gene based on the 160 training observations, and obtained a list of the most significant genes. Then we performed 2-means clustering on these 160 observations using the genes with the largest absolute Cox scores and obtained two subgroups. We repeated this procedure multiple times with different numbers of genes. For each such clustering, we trained a nearest shrunken centroid classifier to assign future patients to one subgroup or the other and examined the crossvalidation error rate.

The problem of choosing the number of genes on which to perform the clustering is more complicated than it appears. The obvious way to choose the optimal number of genes on which to cluster is to simply minimize the crossvalidation error rate of the nearest shrunken centroids model based on the clustering. This works up to a certain point. It is possible that the clustering procedure will identify a cluster that is unrelated to survival. (Since we are clustering on the genes with the highest Cox scores, this is unlikely to occur. However, it is still possible, especially if the number of genes on which we are clustering is large.) Thus, we needed to build a safeguard against this possibility into our procedure. After performing clustering based on a given set of high-scoring genes, we performed a log-rank test to determine if the resulting clusters differed with respect to survival. If they did not, the clustering was discarded without further analysis. An outline of the procedure follows: (1) Choose a set _{min} = 1 and _{min} = 1. (3) For each Γ in _{min}, then return to step 3. (7) Fit a nearest shrunken centroids model based on the clusters obtained in step 3. Calculate the minimum crossvalidation error rate across all values of the shrinkage parameter, and call it _{min}, then let Γ_{best} = Γ, and return to step 3. Otherwise return to step 3 without changing the value of Γ_{best}. The optimal value of Γ is taken to be the value of Γ_{best} when this procedure terminates.

Several comments about this procedure are in order. First, note that we did not recalculate the Cox scores at each fold of the crossvalidation procedure. We calculated them only once, using all of the patients in the dataset. There are several reasons for doing this. Recalculating the Cox scores at each fold would be extremely expensive computationally. Moreover, we found that the Cox score of a given gene varied depending on the number of patients (and which patients) we included in the model. Thus, if a given value of Γ produced a low crossvalidation error rate, there was no guarantee that a model based on the full dataset using this value of Γ would produce good results, since the model based on the full dataset may use a different list of genes. Other studies have found that using the entire dataset to produce a “significant gene list” prior to performing crossvalidation can produce more accurate predictions (

Also, the set _{best} varies greatly from dataset to dataset, so we recommend trying several different forms of

Furthermore, note that when we calculated the

Finally, the number of clusters

As above, let _{ij} denote the expression level of the 𝒾th gene in the 𝒿th patient. Assume that each patient has one of two possible underlying tumor types. Without loss of generality, assume that patients 1, …,_{1j}, …,_{pj}) is different for 1 ≤ 𝒿 ≤ 𝓂 than it is for 𝓂 + 1 ≤ 𝒿 ≤ 𝓃. Thus, if we choose constants {_{i}}^{p}_{i=1}^{p}_{j=1}_{j}x_{ij}_{i}^{p}_{i}=1_{i}_{i}^{p}_{i}=1

In particular, consider the singular value decomposition of

where

In other words, for a given column of

Moreover, suppose that we have an independent test set X̃I. Then let

where

The reason for choosing the first few columns of ^{T}u_{1} has the largest sample variance amongst all normalized linear combinations of the rows of _{1} represents the first column of ^{T}_{u1} captures a large percentage of the variation in survival. (Indeed, in some simple models, it can be proven that

In theory, we could calculate

To choose the optimal value of Γ, we employ the following procedure: (1) Choose a set _{i}. (6) Average the 𝓌_{i}s over the 10 crossvalidation folds. Call this average 𝓌_{Γ}. (7) If 𝓌_{Γ} is greater than the value of 𝓌_{Γ∗}, then let Γ∗ = Γ and 𝓌_{Γ∗} = 𝓌_{Γ}. (8) Return to step 2. The set

In some cases, we can improve the predictive power of our model by taking a linear combination of several columns of ^{T}

All computations in this study were performed using the R statistical package, which is available on the Internet at

This file contains a brief description of the functions contained in the semi-super.R file.

(1 KB TXT).

This file contains R functions for implementing the procedures we have described in our study.

(6 KB TXT).

This file contains the R source code that we used to perform the first simulation study in our paper.

(31 KB TXT).

This file contains the R source code that we used to perform the second simulation study in our paper.

(39 KB TXT).

The gene expression data for the breast cancer study of

(2.9 MB CSV).

The names of each of the 4,751 genes in the study of

(74 KB CSV).

The clinical data for the study of

(1 KB CSV).

The gene expression data for the 70 genes in the breast cancer study of

(141 KB CSV).

The names of the 70 genes in the study of

(1 KB CSV).

A single column that is 1 if the patient was included in the earlier study (that of

(1 KB CSV).

The clinical data for the study of

(5 KB CSV).

The gene expression data for the DLBCL study of

(24.38 MB CSV).

The clinical data for the study of

(2 KB CSV).

This is the gene expression data for the lung cancer dataset of

(5.39 MB TXT).

This is the clinical data for the lung cancer dataset of

(1 KB TXT).

This is the gene expression data for the AML dataset of

(9.96 MB TXT).

This is the clinical data for the AML dataset of

(1 KB TXT).

(8.26 MB TIFF).

(8.33 MB TIFF).

(8.42 MB TIFF).

(28 KB TEX).

Eric Bair was supported by an NSF graduate research fellowship. Robert Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. We thank the academic editor and three reviewers for their helpful comments.

diffuse large B-cell lymphoma

linear discriminant analysis

partial least squares

support vector machine