^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: ZH GT. Performed the experiments: ZH GT. Analyzed the data: ZH GT. Contributed reagents/materials/analysis tools: ZH GT. Wrote the paper: ZH GT DG.

Microarray databases are a large source of genetic data, which, upon proper analysis, could enhance our understanding of biology and medicine. Many microarray experiments have been designed to investigate the genetic mechanisms of cancer, and analytical approaches have been applied in order to classify different types of cancer or distinguish between cancerous and non-cancerous tissue. However, microarrays are high-dimensional datasets with high levels of noise and this causes problems when using machine learning methods. A popular approach to this problem is to search for a set of features that will simplify the structure and to some degree remove the noise from the data. The most widely used approach to feature extraction is principal component analysis (PCA) which assumes a multivariate Gaussian model of the data. More recently, non-linear methods have been investigated. Among these, manifold learning algorithms, for example Isomap, aim to project the data from a higher dimensional space onto a lower dimension one. We have proposed

In machine learning as the dimensionality of the data rises, the amount of data required to provide a reliable analysis grows exponentially. Richard E. Bellman referred to this phenomenon as the “curse of dimensionality” when considering problems in dynamic optimisation

In the last ten years, machine learning techniques have been investigated in microarray data analysis. Several approaches have been tried in order to: (i) distinguish between cancerous and non-cancerous samples; (ii) classify different types of cancer and (iii) to identify subtypes of cancer that may progress aggressively. All these investigations are seeking to generate biologically meaningful interpretations of complex datasets that are sufficiently interesting to drive follow-up experimentation.

Many methods have been implemented for extracting only the important information from the microarrays thus reducing their size. The simplest is feature selection, in which the number of gene probes in an experiment is reduced by selecting only the most significant according to some criterion such as high levels of activity. A number of investigations of this kind have been used to examine breast cancer

Feature extraction methods have also been widely explored. The most widely used method is principal component analysis (PCA) and many variations of it have been applied as a way of reducing the dimensionality of the data in microarrays

An approach to dimensionality reduction that can take into account potential non-linearity is based on the assumption that the data (genes of interest) lie on an embedded non-linear manifold which has lower dimension than the raw data space and lies within it. Algorithms based on manifold learning work well when the high dimensionality of the data sets is artificially high; although each point is defined by thousands of variables, it can be accurately characterised by just a few. Samples are drawn from a low-dimensional manifold that is embedded in a high-dimensional space

We have been investigating a novel way of constructing the manifold which makes use of prior knowledge. Prior knowledge has previously been used in microarray studies

Our method of building the manifold is as follows. In common with all previous methods we first build an affinity matrix from a set of microarrays. A gene-by-gene affinity matrix is a square matrix whose dimension is the same as the number of gene probes in the microarray data. The matrix is symmetric and each entry is a similarity measure (for example covariance) of the expression levels of the two genes that index it. We then fuse information from the KEGG pathways increasing the values in the affinity matrix for gene pairs with a strong relationship in KEGG. Next we apply a conventional manifold learning method to the fused affinity matrix to find the manifold. Having found the manifold of the gene probes we then project the raw data onto it so we can carry out classification experiments. This means that the KEGG pathway data is only involved in building the manifold. In contrast to previous data fusion approaches

To verify the effectiveness of our method we tested

The

Type Of Cancer | Number Of Samples | Number Of Genes |

Breast cancer | 344 cancer samples vs 1201 Other | 10935 |

Colon cancer | 286 cancer samples vs 1259 Other | 10935 |

Kidney cancer | 260 cancer samples vs 1285 Other | 10935 |

Ovary cancer | 198 cancer samples vs 1347 Other | 10935 |

Lung cancer | 126 cancer samples vs 1419 Other | 10935 |

Uterus cancer | 124 cancer samples vs 1421 Other | 10935 |

Omentum cancer | 77 cancer samples vs 1468 Other | 10935 |

Prostate cancer | 69 cancer samples vs 1476 Other | 10935 |

Endometrium cancer | 61 cancer samples vs 1484 Other | 10935 |

Acute lymphoblastic leukaemia | 19 B-Cell vs 8 T-Cell vs 10 Normal | 5000 |

The first metric we used to evaluate the density of the clusters is the Dunn Index. The Dunn Index is a way to measure the difference of the objects in a cluster with the mean of the same cluster. The higher the index value the better the state of the clusters. For our experiments the Dunn Index can indicate how well the resulting embedding separates the samples according to their label, since it uses the labels of each sample as the cluster indicators. In practice manifold learning does not create any clusters but if the embedding is done in a successful way many points will end up being next to each other, since the embedding is just a mapping from the original dataset to a different space. We ran this experiment for different dimensional embeddings (2 to 50 components) as the components we will end up using in the embedding is heavily dependent on the complexity of the data. We applied it on both sample-by-sample affinity matrices, shown in

The Dunn Index found using

The Dunn Index found using

The results for the Dunn Index in sample-by-sample experiments in

To evaluate the accuracy of the embeddings we used the

Type Of Cancer | A Priori Manifold Learning | Isomap | PCA |

Breast cancer | 0.806 | 0.863 | 0.879 |

Colon cancer | 0.868 | 0.897 | 0.906 |

Kidney cancer | 0.931 | 0.932 | |

Ovary cancer | 0.841 | 0.842 | 0.851 |

Lung cancer | 0.902 | 0.911 | 0.917 |

Uterus cancer | 0.891 | 0.890 | 0.891 |

Omentum cancer | 0.912 | 0.912 | |

Prostate cancer | 0.954 | 0.954 | |

Endometrium cancer | 0.923 | 0.924 | 0.926 |

Type Of Cancer | Isomap | PCA | |

Breast cancer | 0.782 | 0.792 | |

Colon cancer | 0.834 | 0.834 | |

Kidney cancer | 0.900 | 0.903 | |

Ovary cancer | 0.834 | 0.838 | |

Lung cancer | 0.883 | 0.886 | |

Uterus cancer | 0.882 | 0.881 | |

Omentum cancer | 0.912 | 0.912 | |

Prostate cancer | 0.943 | 0.945 | |

Endometrium cancer | 0.922 | 0.922 |

Type Of Cancer | Isomap | PCA | |

Breast cancer | 0.890 | 0.901 | 0.912 |

Colon cancer | 0.906 | 0.914 | 0.925 |

Kidney cancer | 0.952 | 0.953 | |

Ovary cancer | 0.867 | 0.870 | |

Lung cancer | 0.935 | 0.938 | 0.941 |

Uterus cancer | 0.900 | 0.905 | |

Omentum cancer | 0.923 | 0.924 | |

Prostate cancer | 0.972 | 0.972 | |

Endometrium cancer | 0.934 | 0.930 |

Type Of Cancer | Isomap | PCA | |

Breast cancer | 0.890 | 0.888 | 0.910 |

Colon cancer | 0.906 | 0.914 | 0.924 |

Kidney cancer | 0.911 | 0.954 | |

Ovary cancer | 0.945 | 0.870 | |

Lung cancer | 0.935 | 0.924 | 0.940 |

Uterus cancer | 0.901 | 0.905 | |

Omentum cancer | 0.926 | 0.923 | |

Prostate cancer | 0.970 | 0.972 | |

Endometrium cancer | 0.932 | 0.930 |

Type Of Cancer | A Priori Manifold Learning | Isomap | PCA |

Breast cancer | 32.09034e-5 | 37.52164e-5 | 35.38524e-5 |

Colon cancer | 29.24537e-5 | 29.91476e-5 | 28.95183e-5 |

Kidney cancer | 6.72999e-5 | 11.64989e-5 | 12.68591e-5 |

Ovary cancer | 21.39207e-5 | 11.13463e-5 | 12.88114e-5 |

Lung cancer | 14.09877e-5 | 5.26385e-5 | 3.13050e-5 |

Uterus cancer | 13.01978e-5 | 3.44257e-5 | 5.51030e-5 |

Omentum cancer | 2.54772e-5 | 0.80620e-5 | 0.80620e-5 |

Prostate cancer | 2.34272e-5 | 6.79816e-5 | 4.34986e-5 |

Endometrium cancer | 1.58922e-5 | 1.92059e-5 | 1.10440e-5 |

Type Of Cancer | A Priori Manifold Learning | Isomap | PCA |

Breast cancer | 32.09034e-5 | 27.91171e-5 | 18.32800e-5 |

Colon cancer | 29.24537e-5 | 26.86585e-5 | 16.95718e-5 |

Kidney cancer | 6.72999e-5 | 10.34294e-5 | 9.40982e-5 |

Ovary cancer | 21.39207e-5 | 24.88867e-5 | 14.85025e-5 |

Lung cancer | 14.09877e-5 | 14.62143e-5 | 12.39355e-5 |

Uterus cancer | 13.01978e-5 | 16.97889e-5 | 18.60610e-5 |

Omentum cancer | 2.54772e-5 | 2.79939e-5 | 2.12314e-5 |

Prostate cancer | 2.34272e-5 | 2.07739e-5 | 2.17724e-5 |

Endometrium cancer | 1.58922e-5 | 5.77868e-5 | 6.19262e-5 |

Type Of Cancer | A Priori Manifold Learning | Isomap | PCA |

Breast cancer | 4.43639e-5 | 3.18558e-5 | 1.64494e-5 |

Colon cancer | 3.97728e-5 | 1.79713e-5 | 5.60684e-5 |

Kidney cancer | 6.28824e-5 | 2.24769e-5 | 2.73758e-5 |

Ovary cancer | 2.94021e-5 | 3.21893e-5 | 3.50449e-5 |

Lung cancer | 1.58082e-5 | 2.14339e-5 | 1.26192e-5 |

Uterus cancer | 1.04442e-5 | 7.45783e-5 | 7.01667e-5 |

Omentum cancer | 1.21062e-5 | 3.76439e-5 | 2.42125e-5 |

Prostate cancer | 4.12092e-5 | 1.40641e-5 | 4.41161e-5 |

Endometrium cancer | 1.14222e-5 | 1.62528e-5 | 8.67444e-5 |

Type Of Cancer | A Priori Manifold Learning | Isomap | PCA |

Breast cancer | 4.43639e-05 | 2.70937e-5 | 1.62271e-5 |

Colon cancer | 3.97728e-5 | 3.55322e-5 | 5.32299e-5 |

Kidney cancer | 6.28824e-5 | 4.56036e-5 | 3.06760e-5 |

Ovary cancer | 2.94021e-5 | 2.80513e-5 | 4.17136e-5 |

Lung cancer | 1.58082e-5 | 1.97068e-5 | 1.47822e-5 |

Uterus cancer | 1.04442e-5 | 4.53130e-05 | 7.25349e-5 |

Omentum cancer | 1.21062e-5 | 9.24128e-5 | 2.14339e-5 |

Prostate cancer | 4.12092e-5 | 1.25679e-5 | 2.42809e-5 |

Endometrium cancer | 1.14222e-5 | 8.63512e-5 | 8.75652e-5 |

The sample-by-sample affinity matrix cannot be computed directly using

In addition we created the Receiver Operating Characteristic (ROC) curves to illustrate the ratio of true positives and false positive results. We have used three different classification methods for illustrating the effectiveness of

For the

ROC curves found for

ROC curves found for

Using SVMs

ROC curves found for

ROC curves found for

For the same purpose we also used LDA where for gene-by-gene experiments (

ROC curves found for

ROC curves found for

If we compare the ROC curves of the three different classifiers we can see that the

We used the Acute Lymphoblastic Leukaemia (ALL) dataset for leukaemia to demonstrate how the different cells were clustered. We have chosen the ALL dataset as it is simple enough to visualise and has been used before in

Two dimensional manifold of the three different leukaemia cells. Clusters of the different cell types are formed and are easily distinguished in the lower dimensional space.

Conventional manifold learning algorithms, such as Isomap, aim to project the microarray data to a lower dimensional space in which functionally different clusters are better separated. The lower dimensional space is a manifold (hypersurface) contained in the original data space and found from the local distribution of the data. A large representative dataset is used to compute the manifold. Our method provides a way of improving the way Isomap finds the

The results were similar across the different datasets. In the first set of results, we showed, using the Dunn Index, that our algorithm is able to create denser clusters with objects that lie closer to the mean of the cluster with a small variance.

Incorporating prior knowledge using KEGG pathways is not only limited to cancer data but it can be applied to a number of diseases that have KEGG signatures. This, along with the fact that the method does not require any other information, makes it easy to adapt to any kind of biological problem. Other studies

When performing cross validation experiments both PCA and Isomap features can be computed using either the gene-by-gene affinity matrix or the sample-by-sample affinity matrix. The latter is a square matrix with dimension equal to the number of microarrays used in the experiment. Each entry represents the similarity (or distance) between the corresponding pair of microarrays. It is considerably smaller than the gene-gene matrix and consequently more robust to noise.

Overall we see that

In this paper we present a method which incorporates manifold learning along with a novel approach for estimating the neighbourhood graph. The cluster validation and accuracy measures, along with the original Isomap algorithm and PCA were implemented using the

The manifold learning algorithm is used for non-linear dimensionality reduction

Determine the neighbours: For all points in a fixed radius, find the

Construct the neighbourhood graph: Points are connected to each other if they are

Find the shortest path between all the nodes on the graph using a graph algorithm (

Construct the lower dimensionality mapping. This is the same procedure as classical MDS. Generally another matrix

and

Calculate the eigenvalues of

Biological pathways are usually directed graphs with labelled nodes and edges representing associations of genes participating in a biological process. These interactions can help in understanding the underlying processes in different organisms as well as their contribution to diseases. Some of the interactions include regulation of gene expression, transmission of signals and metabolic processes. It is not yet clear as to why and how these interactions came to exist and what other, if any, external factors contribute to them. When it comes to machine learning, information from the pathways can be used as prior knowledge for either feature selection or dimensionality reduction of the original data set. For our implementation, KEGG pathways are used as a way to weight the distance between the gene to gene interactions. Genes that share a greater number of common pathways should have more probability in being closer together when it comes to clustering. The metric we have used in weighting the distances was based on the method for feature selection

Given a pair of probes the Jaccard coefficient is used to evaluate the similarity of pathways they share together. This index coined by Paul Jaccard

The distance metric selected to calculate the gene-to-gene distance was the Mahalanobis distance. It is measured using the correlations between two datasets.

The weights equation is shown in

The weights along with the Mahalanobis distance are expressed as:

First the Jaccard coefficient is calculated, the the Mahalanobis distances among the genes and the weights.

The shortest paths

To evaluate the results

A Support Vector Machine

Linear discriminant analysis

The Dunn Index is an internal evaluation metric for clusters

We demonstrate the robustness and the effectiveness of using pathways by removing pathways using a uniform distribution with different probabilities. By removing a percentage of the KEGG pathways in different runs of the algorithm we show how the number of pathways affects its performance. We show how the Dunn Index is affected in the Endometrium (

A plot of the Dunn Index with different percentages of pathways.

A plot of the Dunn Index with different percentages of pathways.

A plot of the Dunn Index with different percentages of pathways.

A plot of ROC curves with different percentages of pathways.

A plot of ROC curves with different percentages of pathways.

A plot of ROC curves with different percentages of pathways.

A plot of ROC curves with different percentages of pathways.

A plot of ROC curves with different percentages of pathways.

To test our

GEMLeR, provides a collection of gene expression datasets that can be used for benchmarking gene expression oriented machine learning algorithms. Each of the gene expression samples in GEMLeR came from a large publicly available repository. GEMLeR was mainly preferred as:

The processing procedure of tissue samples is consistent

The same Affymetrix microarray assay platform is used (Affymetrix GeneChip U133 Plus 2.0)

There is large number of samples for different tumour types

Additional information is available for combined genotype-phenotype studies

Acute lymphoblastic leukaemia (ALL) is a form of leukaemia characterised by excess lymphoblasts. There are two main types of acute leukaemia: T-cell ALL and B-cell ALL. T-Cell acute leukaemia is aggressive and progresses quickly but is more common in older children and teenagers. B-Cell ALL leukaemia

Information on the contents of the datasets is shown in

Our algorithm takes approximately 45 minutes for each embedding which is the same as the original Isomap algorithm. PCA is however a lot faster since is only takes ten minutes to fit the data and create an embedding. This is because PCA is linear while

(TIFF)

(TIFF)

(TIFF)

(TIFF)

(TIFF)

(PDF)

(PDF)