^{*}

ALT, VJC, XwC, RR, and SD wrote various sections of the paper. VJC and ALT wrote the sample R code.

Adi L. Tarca and Roberto Romero are with the Perinatology Research Branch, NICHD/NIH/DHHS, Detroit, Michigan, United States of America. Adi L. Tarca and Sorin Drăghici are with the Department of Computer Science, Wayne State University, Detroit, Michigan, United States of America. Vincent J. Carey is with the Harvard Medical School, Channing Laboratory, Boston, Massachusetts, United States of America. Xue-wen Chen is with the Bioinformatics and Computational Life Sciences Laboratory, Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, Kansas, United States of America.

The authors have declared that no competing interests exist.

The term

The history of relations between biology and the field of machine learning is long and complex. An early technique [

This tutorial is structured in four main components. Firstly, a brief section reviews definitions and mathematical prerequisites. Secondly, the field of supervised learning is described. Thirdly, methods of unsupervised learning are reviewed. Finally, a section reviews methods and examples as implemented in the open source data analysis and visualization language R (

Two main paradigms exist in the field of machine learning:

In supervised learning, objects in a given collection are classified using a set of attributes, or features. The result of the classification process is a set of rules that prescribe assignments of objects to classes based solely on values of features. In a biological context, examples of

In contrast to the supervised framework, in unsupervised learning, no predefined class labels are available for the objects under study. In this case, the goal is to explore the data and discover similarities between objects. Similarities are used to define groups of objects, referred to as

In some applications, such as protein structure classification, only a few labeled samples (protein sequences with known structure class) are available, while many other samples (sequences) with unknown class are available as well. In such cases,

Life science applications of unsupervised and/or supervised machine learning techniques abound in the literature. For instance, gene expression data was successfully used to classify patients in different clinical groups and to identify new disease groups [

To support precise characterization of both supervised and unsupervised machine learning methods, we have adopted certain mathematical notations and concepts. In the next sections, we employ vector notation (_{ij}

Let us consider the general case in which we want to classify a collection of objects _{ij}_{ij}_{i}_{i}_{c}_{c}

There are two main approaches to the identification of the discriminant functions _{c}_{c}

Suppose the classifier _{i}_{i}

The

The goal behind developing classification models is to use them to predict the class membership of

_{c}_{c}_{c}_{c}.

Using the multivariate-normal probability density function and replacing the true class means and covariance matrices with sample-derived estimates (_{c}_{c}, respectively), the discriminant function for each class can be computed as:

The discriminant functions are monotonically related to the densities

An alternative to this quadratic classifier is to assume that the class covariance matrices Σ_{c}

To cope with situations when the number of features is comparable with the number of samples, a further simplification can be made to the normal-based linear discriminant, by setting all off-diagonal elements in the covariance matrix to zero. This implies that between-features covariation is disregarded. Such a

The above-presented classifiers work optimally when their underlying assumptions are met, such as the normality assumption. In many cases, some of the assumptions may not be met. However, (as pointed out by one of the anonymous reviewers) what matters in the end for a practical application is how close the estimated class boundaries are to the true class boundaries. This can be assessed through a cross-validation process.

In very recent work, Guo and colleagues [

_{i}

The distances are ordered and the top _{c}_{c}_{c}_{c}

The most commonly used decision tree classifiers are binary. They use a single feature at each node, resulting in decision boundaries that are parallel to the feature axes (see

The left panel shows the data for a two-class decision problem, with dimensionality

Two-dimensional data points (

Similarly to the hidden layer, the output layer processes the output of the hidden layer. Usually there is one output unit for each class. The discriminant function implemented by the

In this equation, _{i}_{,j} is the weight from the _{j}_{,k} is the weight from the

Consider that _{T}_{s}_{s}

Here, _{s}_{,k} represents the actual output of the unit _{s}_{,k} is the desired (target) output value for the same sample. When a sample belongs to the class

For more details on theory and practical use of neural networks, please see Duda et al. [

_{i}^{p}_{i}^{T}^{2} (_{i}_{i}_{i}

Two-dimensional data points belonging to two different classes (circles and squares) are shown in the left panel. The right panel shows the maximum-margin decision boundary implemented by the SVMs. Samples along the dashed lines are called SVs.

The linear SVMs can be readily extended to nonlinear SVMs where more sophisticated decision boundaries are needed. This is done by applying a kernel transformation, i.e., simply replacing every matrix product (_{i}^{T}_{i}

The kernel functions return larger values for arguments that are closer together in feature space.

In constructing linear SVMs for classification, the only parameter to be selected is the penalty parameter

To conclude, the key points with the SVMs are: a) one believes there is a representation of features in which classes can be discriminated by a single hyperplane (perhaps with only a few errors); b) one chooses the hyperplane that lies at the largest distance between sentinel cases near the class boundary (large margin); and c) one can use kernel transformations when data is not linearly separable in the original feature space, but it may be so in the transformed space.

An important aspect of the classifier design is that in some applications, the dimensionality

A serious difficulty arises when

The statistical pattern recognition literature classifies the approaches to feature selection into

Although fast and easy to implement, such filter methods cannot take into account the joint contribution of the features. Wrapper methods use the accuracy of the resulting classifier to evaluate either each feature independently or multiple features at the same time. For instance, the accuracy of a

Clustering is a popular exploratory technique, especially with high dimensionality data such as microarray gene expression [

Clustering aims at dividing objects into groups (clusters) using measures of similarity, such as one minus correlation or Euclidean distance. For instance, in a microarray experiment the objects can be different tissue samples that can be clustered based on

Another approach to clustering is called

Any distance measure can be therefore used in conjunction with PAM. The algorithm maps the resulting distance matrix into a specified number of clusters. The medoids are representations of the cluster centers that are robust with respect to outliers. The robustness is particularly important in the common situation in which many elements do not have a clearcut membership to any specific cluster [

In many biological applications, it is desired to cluster both the features and the samples, i.e., both rows and columns of the data matrix X. For instance, with gene expression data one may be interested to cluster both the tissues samples and the genes at the same time. While

Class membership is indicated by a magenta (NEG) or blue (BCR/ABL) stripe at the top of the plot region. Rows correspond to data features (genes), while columns correspond to data points (samples). Hierarchical clustering is applied simultaneously to both rows (genes) and columns (samples) of the expression matrix to organize the display.

In addition to the type of clustering (e.g., hierarchical,

In

The linkage defines the desired notion of similarity between two groups of measurements. For instance, the

The R language and environment for statistical computing (

The Bioconductor project (

The

followed by

which installs a brokering interface to a substantial collection of machine learning functions, tailored to analysis of expression microarray datasets.

After obtaining the

installs a data structure representing samples on 128 individuals with acute lymphocytic leukemia [

There are 79 samples present, 37 of which present BCR/ABL fusion.

To illustrate simple approaches to unsupervised learning, we will filter the data severely, by focusing on the 50 genes that have the largest variability over all samples as measured by the median absolute deviation. The threshold 1.43 in the next command was determined by checking the data. We then invoke the R

The PAM algorithm can be applied to

The graphical output shown in

Left, PC display; right, silhouette display. The ellipses plotted on the left are cluster-specific minimum volume ellipsoids for the data projected into the PCs plane. These should be regarded as two-dimensional representations of the robust approximate variance–covariance matrix for the projected clusters. The silhouette display comprises a single horizontal segment for each observation, ordered by clusters and by object-specific silhouette value within a cluster. Large average silhouette values for a cluster indicate good separation of most cluster members from members of other clusters; negative silhouette values for objects indicate instances of indecisiveness or error of the given partition.

On the left panel of

A useful data visualization method, not necessarily related to machine learning, is to project the multidimensional data points onto two or three PCs which are the directions in the feature space showing the largest variability. The R packages

The 79 samples of the ALL dataset are projected on the first three PCs derived from the 50 original features. The blue and magenta colors are used to denote the known membership of the samples in the two classes, NEG and BCR/ABL, respectively. Note that PCA is an unsupervised data projection method, since the class membership is not required to compute the PCs.

Supervised methods of learning such as trees, neural networks, and SVMs will be illustrated in this section.

The following example uses 50 random samples from

The last line in the code segment above displays the confusion matrix achieved by the neural network classifier on the test samples:

The

A tree-structured classifier derived from the 50-gene extract from the ALL data is shown in

The figure is obtained with the

Top left: CART with

Several machine learning procedures include facilities for measuring relative contribution of features in successful classification events. The random forest [

The R system includes a large number of machine learning methods in easily installed and well-documented packages; the Bioconductor

Modern biology can benefit from the advancements made in the area of machine learning. Caution should be taken when judging the superiority of some machine learning approaches over other categories of methods. It is argued [

We express our gratitude to the two anonymous reviewers whose specific comments were very useful in improving this manuscript. ALT and RR were supported in part by the Division of Intramural Research of the National Institute of Child Health and Human Development. VJC was supported in part by National Institutes of Health (NIH) grant 1 P41 HG004059. XwC was supported in part by National Science Foundation (NSF) award IIS-0644366 and by NIH Grant P20 RR17708 from the IDeA Program of the National Center for Research Resources. SD is partially supported by the following grants: NSF DBI-0234806, CCF-0438970, 1R01HG003491-01A1, 1U01CA117478–01, 1R21CA100740–01, 1R01NS045207–01, 5R21EB000990–03, and 2P30 CA022453–24. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF, the NIH, or any other funding agency.

partitioning around medoids

principal component

principal component analysis

support vector

support vector machine

vector

scalar

matrix

feature space.