^{1}

^{*}

^{2}

^{3}

^{*}

Conceived and designed the experiments: JS AD. Performed the experiments: JS AD. Analyzed the data: AD. Contributed reagents/materials/analysis tools: JS AD. Wrote the paper: JS AD.

The authors have declared that no competing interests exist.

Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

Linear Discriminant Analysis (LDA) is a long-standing prediction method that has been well characterized when the number of features used for prediction is small

In high-dimensional applications, it is often desirable to build a classifier using only a subset of features due to the fact that (i) many of the features are not informative for classification and (ii) the number of training samples available for building the classifier is substantially smaller than the number of possible features. It can also be argued that a classifier built with a smaller number of features is preferable to an equally accurate classifier built with the complete set of features. This problem is analagous to, but in general distinct from, that of selecting variables in a regression model by, say, least angle regression (LARS)

Several approaches have been recently proposed for nearest centroid classifiers that rely on univariate statistics for feature selection

In this paper, we provide a theoretical result showing how to determine the subset of features of a given size that minimizes the misclassification rate for a nearest-centroid classifier. For example, if 800 features are available, but one wants to build a classifier consisting of 12 features, we show which 12 provide the lowest misclassification rate. This optimal feature set takes into account the joint behavior of the features in two ways. First, it explicitly incorporates information about correlation between features. Second, it assesses how a group of features as a whole is capable of distinguishing between multiple classes. While we show how to define the theoretically optimal subset, we must

Two existing papers

Nearest centroid classifiers have been shown to perform well with gene-expression microarrays

An LDA classifier is a canonical nearest centroid classifier. The problem it addresses is to classify unknown samples into one of _{k}_{k}_{k}_{k}_{k}

Bayes' Theorem states that the probability that an observed sample comes from class

A misclassification occurs when a sample is assigned to the incorrect class. The probability of making a classification error is:

The misclassification rate of a nearest-centroid (LDA) classifier can be shown to be_{j}_{i}_{0}≤_{0} features not included in the subset are not involved in the calculation. This defines the optimal subset of size _{0}.

Equation (4) can be interpreted as measuring the collective distance between all of the class centroids. In general, the misclassification rate will be small when all of the class centroids are far away from each other. Note, however, that the score in (4) is actually a complicated combination of the pairwise differences between the centroids and the class priors. Furthermore, correlations between features are explicitly incorporated through the distance functions ||μ_{j}_{i}_{Σ}. Further intuition into (4) can be attained by considering the following simple example.

The data in

Feature | μ_{1} | μ_{2} | μ_{2} | Score |

1 | 3.00 | 0.00 | 0.00 | 2.00 |

2 | 2.00 | 0.00 | 0.00 | 0.89 |

3 | 1.50 | 0.00 | 0.00 | 0.50 |

4 | 1.25 | 0.00 | 0.00 | 0.35 |

5 | 0.00 | 1.10 | 0.00 | 0.27 |

6 | 0.00 | 1.00 | 0.00 | 0.22 |

7 | 0.00 | 0.90 | 0.00 | 0.18 |

8 | 0.00 | 0.00 | 0.85 | 0.16 |

9 | 0.00 | 0.00 | 0.75 | 0.12 |

10 | 0.00 | 0.00 | 0.65 | 0.09 |

An alternative approach to using univariate scores to select features is to consider all 252 possible quintuplets and choose the set with the lowest overall misclassification rate. Note that, to do this, we must be able to assign misclassification probabilities to arbitrary feature subsets. This highlights the utility of the multivariate score (4). Using (4), we find that the set of features chosen by the univariate scores has an overall misclassification rate of 20%. Similarly, we find that the optimal set in this example contains features 1, 5, 6, 7, and 8, with an associated error rate of 13%. The most obvious difference between this subset and that chosen by univariate scores is the exclusion of features 2, 3, and 4. Apparently, class one can be sufficiently characterized by feature 1. The other features do not contain sufficient _{0} + 1 contains the optimal subset of size _{0}, _{0} = 1, 2,…,9, although this need not be true in general.

An important aspect of the optimal feature-selection procedure is its explicit incorporation of correlation between features. It is not necessarily clear what effect correlation between features should have on a classifier. Intuitively, many weakly informative, correlated genes might be expected to collectively be highly informative. However, it has been shown

We investigate the effect of different correlation patterns in the example of ^{|i-j|}; no qualitative differences were found when considering negative correlation. We report the best subset under each correlation pattern (_{O}_{I}

Error | ||||

Covariance | Selected Features | _{O} | _{I} | Rank _{I} |

None | 1, 5, 6, 7, 8 | 13.1% | 13.1% | 1 |

Block 1 | 1, 5, 6, 7, 8 | 13.1% | 13.1% | 1 |

Block 2 | 1, 5, 8, 9, 10 | 14.9% | 18.2% | 64 |

Block 3 | 1, 5, 6, 7, 8 | 13.1% | 13.1% | 1 |

Block 1∼2 | 1, 4, 5, 8, 9 | 6.1% | 14.0% | 143 |

Block 1∼3 | 1, 5, 6, 7, 8 | 11.6% | 11.6% | 1 |

Block 2∼3 | 1, 2, 3, 7, 8 | 2.2% | 3.5% | 18 |

In this example, correlations that affect features 5–7 change the entries of the best subset, as well as the associated error rates. For example, when there is correlation within features 5–7, the optimal subset includes features 1, 5, 8, 9, and 10, with an error rate of 14.9%. The set chosen ignoring correlation ranks 64th (out of 252 possible subsets), with an error rate of 18.2%. These results suggest that correlated features can be useful together. While there are many possible scenarios in which correlation could play a role, the main point is that the feature-selection procedure guided by (4) automatically identifies the optimal combination of features, even in the presence of correlation. Of course, in practice, there is the added challenge of estimating the class centroids and covariance matrix. In particular, when there are many more features than samples, it is not clear that covariances can be estimated well enough to make them worth the effort. We consider this further in the next section.

In practice, unknown model parameters and the general impracticality of exhaustive searches with genomic data preclude our finding theoretically optimal subsets. Instead, we must

We consider several variations of our basic algorithm, employing different methods for estimating class centroids and covariance matrices. We compare the proposals with existing alternatives on both simulated and real datasets. Details of the proposed algorithms are given in what follows. Based on our comparisons, the final algorithm that we propose for nearest-centroid classification from genomic data uses shrunken centroids and a diagonal covariance matrix. The following is the proposed algorithm to build a classifier with _{0} features:

Estimate the class centroids

Estimate the pooled variances using equation (8) to form the diagonal covariance estimate

Using the estimated centroids and variances from Steps 1 and 2, find the single feature with the smallest estimated misclassification rate from equation (4).

For _{0}, consider each remaining feature separately:

Combine one remaining feature with the already selected feature(s) 1,2,…,

Find the single feature with the smallest estimated misclassification rate according to equation (4), when included with the already-selected feature(s) 1,2,…,

The final nearest centroid classifier is composed of the subsets of _{0} selected features.

We call this method “Clanc”, as it is an extension of the Clanc procedure proposed earlier

To estimate (4), we estimate the class centroids μ_{k}

While there are many possible approaches to shrinking the centroids, we take the following simple approach. We begin with the usual centroid estimate _{ijk}_{k}_{k}_{m}

Drawing from earlier versions of our work

While we can form an unbiased estimate of the covariance matrix Σ, such an estimate will tend to be singular in the microarray setting, with many more features than samples. Furthermore, theoretical justification for shrinking the off-diagonal components of Σ to zero in such settings has recently been published

This is a pooled version of the class-specific estimates, reflecting the model assumption of a common covariance matrix. In what follows, we define

The shrinkage parameter ω can be estimated using cross-validation on training data _{ii}_{ij}_{ij}

We have characterized the theoretically optimal subset of given size by computing the misclassification rate (4) for nearest-centroid classifiers. Ideally, having estimated the decision rule, we would evaluate all subsets of given size and choose the one corresponding to the lowest estimated error rate. In high-dimensional settings, such an exhaustive search is not feasible. We now consider practical strategies for searching for the optimal subset. These are analogous to existing routines for selecting subsets in discriminant analysis

A simple approach to this problem is to rank each feature individually on its ability to discriminate between the classes and choose the top features from this list. Many previous publications use versions of univariate

As an alternative to univariate scoring, we might consider more computationally-intensive search algorithms. For example, a greedy forward-selection algorithm would (i) select the one feature that scores best individually according to some criterion, (ii) select the one feature that scores best together with the already chosen feature(s), (iii) repeat until the desired number of features have been chosen. The misclassification rate (4) itself is the ideal score to use for guiding the selection of a subset. As such, we propose the greedy forward-selection algorithm that proceeds as above, using an estimate of (4) to score each proposed subset. For reference, the greedy algorithm identifies the same subset as the exhaustive search algorithm in the example of

We evaluate different methods for estimating the optimal subset of a given size with the following sets of simulations. There are 3 classes and 1000 genes, from which a subset of size 30 is desired. For each class, 15 training samples and 15 test samples are generated. In simulation set one (

Absolute Value of Correlation | ||||||||||

Algorithm | Centroids | Covariance | 0.00 | 0.40 | 0.65 | 0.90 | ||||

PAM | shrunken | diagonal | 0.15 | (0.07) | 0.15 | (0.06) | 0.18 | (0.07) | 0.29 | (0.08) |

Clanc | unshrunken | unrestricted | 0.30 | (0.08) | 0.28 | (0.08) | 0.28 | (0.07) | 0.11 | (0.07) |

Clanc | shrunken | unrestricted | 0.27 | (0.09) | 0.27 | (0.08) | 0.26 | (0.10) | 0.09 | (0.07) |

Clanc | unshrunk. | diagonal | 0.07 | (0.04) | 0.08 | (0.04) | 0.10 | (0.04) | 0.19 | (0.07) |

Clanc | shrunken | diagonal | 0.06 | (0.04) | 0.08 | (0.04) | 0.10 | (0.05) | 0.19 | (0.07) |

Clanc | unshrunk. | shrunken | 0.30 | (0.08) | 0.28 | (0.08) | 0.28 | (0.07) | 0.06 | (0.04) |

Clanc | shrunken | shrunken | 0.27 | (0.09) | 0.27 | (0.08) | 0.26 | (0.10) | 0.06 | (0.04) |

Shrinkage in this case takes place across classes rather than across features.

Absolute Value of Correlation | ||||||||||

Algorithm | Centroids | Covariance | 0.00 | 0.40 | 0.65 | 0.90 | ||||

PAM | shrunken | diagonal | 0.33 | (0.02) | 0.33 | (0.02) | 0.34 | (0.03) | 0.37 | (0.03) |

Clanc | unshrunken | unrestricted | 0.25 | (0.07) | 0.26 | (0.08) | 0.25 | (0.08) | 0.12 | (0.08) |

Clanc | shrunken | unrestricted | 0.25 | (0.08) | 0.28 | (0.09) | 0.22 | (0.07) | 0.13 | (0.09) |

Clanc | unshrunk. | diagonal | 0.04 | (0.03) | 0.04 | (0.03) | 0.06 | (0.04) | 0.14 | (0.06) |

Clanc | shrunken | diagonal | 0.03 | (0.03) | 0.03 | (0.03) | 0.06 | (0.03) | 0.13 | (0.06) |

Clanc | unshrunk. | shrunken | 0.25 | (0.07) | 0.26 | (0.08) | 0.25 | (0.08) | 0.12 | (0.08) |

Clanc | shrunken | shrunken | 0.25 | (0.08) | 0.28 | (0.09) | 0.22 | (0.07) | 0.13 | (0.09) |

Shrinkage in this case takes place across classes rather than across features.

The details of the simulations are as follows. Twenty five percent of the genes are noise, with centroid components μ_{i}_{1} = μ_{i}_{2} = μ_{i}_{3} = 0. Another 25% of the genes characterize class one, with centroid components μ_{i}_{1} = 0.5 or 1 in simulation sets one and two, respectively, and μ_{i}_{2} = μ_{i}_{3} = 0. Another 25% of the genes characterize class two, with μ_{i}_{2} = 0.5 and μ_{i}_{1} = μ_{i}_{3} = 0. The remaining genes characterize class three, with μ_{i}_{3} = 0.5 or 0.25 in simulation sets one and two, respectively, and μ_{i}_{1} = μ_{i}_{2} = 0. In each simulation, the genes are randomly broken into 50 blocks of 20 genes. Within each block, an autoregressive covariance structure is used. Correlation (ρ) is positive in half of the blocks and negative in the other half. In each simulation set, four scenarios are presented: (i) independence (ρ = 0), (ii) low correlation (ρ = 0.4), (iii) medium correlation (ρ = 0.65), and (iv) high correlation (ρ = 0.9). Samples for class _{m}_{k}_{k}

For each of 50 simulations, we applied the following nearest centroid classification methods. We report the PAM method

We now illustrate our methods on three previously published gene-expression microarray experiments. We compare the methods on the basis of error rates from five-fold cross-validation. We avoid gene-selection bias by completely rebuilding classifiers to identical specifications in each cross-validation iteration _{k}

The first example involves small round blue cell tumors (SRBCT) of childhood

The results for the SRBCT data are shown in

Classifiers are identical to those in

Classifiers are identical to those in

Classifiers are identical to those in

We have characterized the theoretically optimal subset of a given size for a nearest centroid classifier. We have also considered the estimation of this optimal subset. Although an exhaustive search would be ideal, it is not generally practical in the genomic setting. We have thus proposed a greedy algorithm for estimating optimal subsets and demonstrated that the resulting classifier can produce more accurate classifiers in both simulated and real applications. Our results indicate that some improvement in accuracy can be had by shrinking class centroids, for which we have proposed a novel procedure. Although the theoretically optimal subset explicitly incorporates correlation between features, our results concur with those of others in suggesting that correlations should be shrunken to zero in settings with many more features than samples.

We note that our approach to estimating the optimal decision rule could likely be improved upon. In particular, while MLE estimators of the class centroids and common covariance matrix themselves have good properties, the resulting estimator of the decision rule may not. An alternative in the two class case would be to directly estimate the decision rule using a variant of logistic regression. The multiclass case would be more complicated. We intend to investigate these issues in future work.