A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters

Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance


INTRODUCTION
Choosing from a range of competing models is a common practice in predictive modelling in which the selection of the optimal model and its performance depends on a combination of factors. In classification, for instance, the consequences of misclassification largely depend on the true class of the object being classified and the way it is ultimately labelled. Generally, the performance of both parametric and non-parametric models depends exclusively on the chosen model, the sampled data, and the available knowledge for the underlying problem. For instance, the accuracy and reliability of, say, a medical test will depend not only on the diagnostic tools but also on the definition of the state of the condition being tested. Such variations make model complexity a natural challenge to data modelling (Mwitondi, 2010). Thus, when data sources, repositories, and modelling tools are shared, it is imperative to work out a unifying environment with the potential to yield consistent results across applications. Achieving this goal requires striking a balance between model accuracy and reliability across applications (Mwitondi & Said, 2011). This paper examines how multiple model performances can be used to devise a generalised strategy for attaining the foregoing balance in predictive modelling. Using a generic two-class scenario, it addresses the underlying issues relating to prediction errors and combines model generated numerals and graphics to decipher data patterns for optimality. More specifically, the paper sets off from conventional approaches for group separation, ROC curves (Egan, 1975) and the Youden Index (Youden, 1950), to propose an iterative algorithm for detecting separation levels based on estimated data densities. The paper is organised as follows. Section 2 provides an overview of the methods and simulations. Section 3 outlines the modelling strategy, implementation, results, and discussions. The concluding remarks and potential future applications are outlined in Section 4.

METHODS AND SIMULATIONS
The ultimate goals are to illustrate the nature of variation in the performance of various models given the random nature of the data used to train and test them and to devise a modelling strategy for optimising the model selection process. The illustrations and the strategy derive from the Bayesian rule as outlined in Berger (1985), the decision trees (DT) domain partitioning technique as described in Breiman et al. (1984), and the receiver operating characteristics (ROC) analysis as outlined in Egan (1975).

Allocation rule errors due to data randomness
As shown in Table 1, the total empirical error is typically associated with randomness due to the allocation region and randomness due to assessing the rule by random training and validation data (Mwitondi, 2003). Thus, given that there are data points in different classes, the overall misclassification error is computed as the sum of the weighted probabilities of observing data belonging to a particular class given that we are not in that class. For instance, where and ( ) represent the partition region and the class priors respectively in a typical Bayesian context. Various approaches for minimising this error have been proposed -see, for instance, Reilly and Patino-Leal (1981), Wan (1990), Freund and Schapire (1997), and Mwitondi et al. (2002). A commonly acceptable practice is to vary the allocation rule in order to address specific requirements of an application. We illustrate the scenario based on decision trees as in Breiman et al. (1984) and ROC curves (Egan, 1975).

Decision trees modelling
Growing a tree amounts to sequentially splitting the data into, typically, two super sets, A and B, based on a single predictor at a time. The observations in A and B lie on either side of the hyper-plane chosen in such a way that a given measure of impurity is minimised. The splitting continues until an adopted stopping criterion is reached. Selecting an optimal model is one of the major challenges data scientists face. Breiman et al. (1984) propose an automated cost-complexity measure described as follows. Let the complexity of any sub-tree be defined by its number of terminal nodes, . Then, if we define the cost-complexity parameter , the cost-complexity measure can be defined as Let be any branch of the sub-tree ( ) , and define ( ) ∑ ( ) where represents the set of all terminal nodes in . They further show that given t any non-terminal node in ( ) , the inequality ( ) ( ) holds. It can be shown that for any sub-tree we can define a measure impurity as a function of as Typically, growing a large tree yields high accuracy but risks over-fitting while growing a small tree does the opposite. The measure of impurity will typically return different estimates for different values of directly impinging on accuracy and reliability. One way of assessing model performance is to use ROC curves.

ROC curves analysis, optimality, and Youden Indexing
Without loss of generality, consider a binary medical diagnostic test scenario in which patients are tested for a particular disease and there are four possible outcomes: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). In this case, the ROC curve is constructed based on the proportions (4) where SST and SPT denote the sensitivity and specificity respectively and N TP and N FN denote the number of those with the disease and who are diagnosed with it and those having the disease but cleared by the test respectively. Similarly, N TN and N FP are the number of those without the disease who test negative and those testing positive without having the disease, respectively. As with type I and II errors, the usefulness of a test cannot be determined by SST/SPT alone, and so a ROC analysis trade-off is needed. Assuming four possible outcomes, the ROC accuracy (ACCR) and error (ERR) are defined as (5) If we denote the data by X and the set of class labels as { }, the probability of accuracy can be computed as follows, where the integral is over both classes.
The main goal of predictive modelling is to maximise ( ) or equivalently minimise ( ) consistently across applications. By appropriately costing each of the class allocation measures, we can make the outcome not only depend on the diagnostic tools and techniques but also on the definition of the state of the tested condition. For instance, one would rather set "low specificity" for a cancer diagnostic test, i.e., let it trigger on low-risk symptoms, than miss the symptoms. Our model implementations are focused on striking this balance. One way of determining the optimal cut-off point for the ROC curves is to use the Youden Index (Youden, 1950). Its main idea is that for any binary classification model with corresponding cumulative distribution functions ( ) and ( ), say, then for any threshold t, the relationship ( ) ( ) ( ) ( ) holds. We can then compute the index as the maximum difference between the two as Within a model, the Youden Index is the maximum differences between the true and false positive values and between competing models ordering of the indices highlights performance order.

MODELLING STRATEGY, IMPLEMENTATION, RESULTS, AND DISCUSSIONS
The modelling strategy is based on the methods described in Section 2 and seeks to facilitate the selection of optimal models based on consistency of performance. Two datasets, the Pima Indians diabetes data, 768 observations on 9 variables (NIDDK, 1990) and the Bupa liver disorders data, 345 observations on 7 variables (Forsyth, 1990), as described in Table2 are used. The former relate to females of at least 21 years old while the latter relate to blood tests for liver disorders sensitivity to excessive alcohol consumption. Because variations in affect reliability, we trained and tested three decision tree models on each of the datasets.

Implementation and results
Graphical results for both datasets are presented in Figure 1. Left to right, the first two panels correspond to Pima ROC and predictive patterns while the remaining two correspond to those of the Bupa dataset.

Figure 1. Pima (left) and Bupa(right) ROC curves and model over/fitting points
Based on the ROC convention, a classifier is optimal only if it yields results in the top left corner, given the set conditions. Thus, the performance ranking in both cases was DT-1, DT-2, and DT-3. The maximum differences (Youden Indices) between the TPR and FPR for each model, the areas under the curve, and the minimum prediction errors are highlighted in Table 3, agreeing with this ranking. Note that Bupa's gaps between DT1 and DT2 and between DT2 and DT3 are much wider than Pima's, implying that DT1 and DT2 performance on the Pima Indians data is almost indistinguishable. Further, repeated simulations are expected to vary depending on factors such as data sources and the settings defined in Section 2.2. Accuracy of the test is a function of how well it separates the groups, and it is assessed by the corresponding area under the curve. Traditionally, a 90%-100% area under the curve is considered excellent while anything about 60% and less is a failure. Because ROC curves may mask or over-fit data estimated parameters, we propose a strategy for enhancing the formulation of a generalised error.

Proposed strategy for optimal model selection
Typically, one classifier is preferred to another if it yields a higher class posterior probability than the other (Web, 2005;Mwitondi, 2003). We have shown that the rate at which the models out-perform each other is datadependent, and so the fitting patterns may provide a good starting point in the search for optimality. If we assume a Bayesian error from a notional population on which the performance of a predictive model is assessed to be (data-dependent error), then the relationship below holds.
We can then measure model reliability by tracking the quantity [ ] as an indicator of stability/variability across models. The algorithm, described in the following simple steps, seeks to minimise the risk of over-fitting or under-fitting the data across applications.

Given a set of competing classifiers { } Extract the vectors and Set
For j:=1:K For

Store the Differences DIFFS, TP and FP End For Set a long bandwidth vector (typically Gaussian) ( ) While NOT END of Do
Compute and plot the densities of and End While Examine the resulting plots and choose the one that best separates the groups End.
Graphical illustrations of the algorithm results based on both datasets at different bandwidths are shown in Figure 2. The spiky lines at the foot of each of the four density panels represent a slightly noised univariate vector of TP and FP from left to right. Because the main purpose is to separate each of the two classes, we are interested not only in the positioning of the ROC curves as in Figure 1 but also in the sequential differences in the fitted parameters, which suggest that DT1 is suitable for the BUPA and DT2 for the PIMA data.

Figure 2. Differences between sequential TP and FP values and those of differences between them
The Gaussian kernel was used in approximating the differences in the algorithm. The optimal choice of the bandwidth is estimated as ( ̂ ⁄ ) ⁄ ̂ ⁄ (Silverman, 1984) where sigma is the standard deviation of the samples Runs based on these optimal bandwidth values yielded similar results to those in Figure 2 at suggesting DT1 for BUPA and DT2 for PIMA. Because each of the ROC curves in Figure  1 measures the probability that the corresponding model will rank a randomly chosen positive instance higher than it will rank a randomly chosen negative case, the patterns in Figure 2 provide an insight into the level of class separation and can be used to guide model selection.

CONCLUDING REMARKS AND POTENTIAL FUTURE DIRECTIONS
Selecting the "best" model from a potentially set of competing models is a conventional challenge in data science, and this paper sought to demonstrate an optimisation procedure for making that decision. Guided by the Bayesian rule, ROC curves, and the Youden Index, we empirically demonstrated the variation of the allocation rule using three decision tree models. For the purpose of addressing specific application requirements, a practical reality in a data sharing environment, we introduced a novel strategy for model selection. Based on graphical visualisation, the strategy seeks to help minimise data over-fitting and performance obscurity while remaining easily understood by non-specialists. The results from this paper serve to highlight the importance of paying attention to the allocation rules in Table 1 and the associated generalising error. The strategy can be adapted to other data-dependent domain-partitioning models, such as neural networks and support vector machines. The quantity [ ] can generally be accepted as a measure of performance that, for domainpartitioning purposes, we can align alongside similar measures, such as the ROC curves. The proposed strategy can readily be adapted to all applications of a binary nature or those that can be converted into such, and our results highlight novel paths towards tackling various real-life challenges in areas such as remote sensing, seismology, oceanography, ionosphere, and many others. Since error costing differs across applications, the decisions relating to model selection will typically remain application-specific. However, the proposed strategy provides prospects of interactivity and multi-disciplinary compliancy in a general data sharing framework