A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters

Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance


INTRODUCTION
Choosing from a range of competing models is a common practice in predictive modelling in which the selection of the optimal model and its performance depends on a combination of factors.In classification, for instance, the consequences of misclassification largely depend on the true class of the object being classified and the way it is ultimately labelled.Generally, the performance of both parametric and non-parametric models depends exclusively on the chosen model, the sampled data, and the available knowledge for the underlying problem.For instance, the accuracy and reliability of, say, a medical test will depend not only on the diagnostic tools but also on the definition of the state of the condition being tested.Such variations make model complexity a natural challenge to data modelling (Mwitondi, 2010).Thus, when data sources, repositories, and modelling tools are shared, it is imperative to work out a unifying environment with the potential to yield consistent results across applications.Achieving this goal requires striking a balance between model accuracy and reliability across applications (Mwitondi & Said, 2011).This paper examines how multiple model performances can be used to devise a generalised strategy for attaining the foregoing balance in predictive modelling.Using a generic two-class scenario, it addresses the underlying issues relating to prediction errors and combines model generated numerals and graphics to decipher data patterns for optimality.More specifically, the paper sets off from conventional approaches for group separation, ROC curves (Egan, 1975) and the Youden Index (Youden, 1950), to propose an iterative algorithm for detecting separation levels based on estimated data densities.The paper is organised as follows.Section 2 provides an overview of the methods and simulations.Section 3 outlines the modelling strategy, implementation, results, and discussions.The concluding remarks and potential future applications are outlined in Section 4.

METHODS AND SIMULATIONS
The ultimate goals are to illustrate the nature of variation in the performance of various models given the random nature of the data used to train and test them and to devise a modelling strategy for optimising the model selection process.The illustrations and the strategy derive from the Bayesian rule as outlined in Berger (1985), the decision trees (DT) domain partitioning technique as described in Breiman et al. (1984), and the receiver operating characteristics (ROC) analysis as outlined in Egan (1975).

Allocation rule errors due to data randomness
As shown in Table 1, the total empirical error is typically associated with randomness due to the allocation region and randomness due to assessing the rule by random training and validation data (Mwitondi, 2003).
Table 1.Error types associated with domain-partitioning modelling (Source: Mwitondi, 2003) POPULATION TRAINING CROSS VALIDATION TEST Thus, given that there are data points in different classes, the overall misclassification error is computed as the sum of the weighted probabilities of observing data belonging to a particular class given that we are not in that class.For instance, where and ( ) represent the partition region and the class priors respectively in a typical Bayesian context.Various approaches for minimising this error have been proposed -see, for instance, Reilly and Patino-Leal (1981), Wan (1990), Freund andSchapire (1997), andMwitondi et al. (2002).A commonly acceptable practice is to vary the allocation rule in order to address specific requirements of an application.We illustrate the scenario based on decision trees as in Breiman et al. (1984) and ROC curves (Egan, 1975).

Decision trees modelling
Growing a tree amounts to sequentially splitting the data into, typically, two super sets, A and B, based on a single predictor at a time.The observations in A and B lie on either side of the hyper-plane chosen in such a way that a given measure of impurity is minimised.The splitting continues until an adopted stopping criterion is reached.Selecting an optimal model is one of the major challenges data scientists face.Breiman et al. (1984) propose an automated cost-complexity measure described as follows.Let the complexity of any sub-tree be defined by its number of terminal nodes, .Then, if we define the cost-complexity parameter , the cost-complexity measure can be defined as Let be any branch of the sub-tree ( ) , and define ( ) ∑ ( ) where represents the set of all terminal nodes in .They further show that given t any non-terminal node in ( ) , the inequality ( ) ( ) holds.It can be shown that for any sub-tree we can define a measure impurity as a function of as Typically, growing a large tree yields high accuracy but risks over-fitting while growing a small tree does the opposite.The measure of impurity will typically return different estimates for different values of directly impinging on accuracy and reliability.One way of assessing model performance is to use ROC curves.

Implementation and results
Graphical results for both datasets are presented in Figure 1.Left to right, the first two panels correspond to Pima ROC and predictive patterns while the remaining two correspond to those of the Bupa dataset.Based on the ROC convention, a classifier is optimal only if it yields results in the top left corner, given the set conditions.Thus, the performance ranking in both cases was DT-1, DT-2, and DT-3.The maximum differences (Youden Indices) between the TPR and FPR for each model, the areas under the curve, and the minimum prediction errors are highlighted in Table 3, agreeing with this ranking.Note that Bupa's gaps between DT1 and DT2 and between DT2 and DT3 are much wider than Pima's, implying that DT1 and DT2 performance on the Pima Indians data is almost indistinguishable.Further, repeated simulations are expected to vary depending on factors such as data sources and the settings defined in Section 2.2.

Proposed strategy for optimal model selection
Typically, one classifier is preferred to another if it yields a higher class posterior probability than the other (Web, 2005;Mwitondi, 2003).We have shown that the rate at which the models out-perform each other is datadependent, and so the fitting patterns may provide a good starting point in the search for optimality.If we assume a Bayesian error from a notional population on which the performance of a predictive model is assessed to be (data-dependent error), then the relationship below holds.
We can then measure model reliability by tracking the quantity [ ] as an indicator of stability/variability across models.The algorithm, described in the following simple steps, seeks to minimise the risk of over-fitting or under-fitting the data across applications.

Given a set of competing classifiers { }
Extract the vectors and Set For j:=1:K For Store the Differences DIFFS, TP and FP End For Set a long bandwidth vector (typically Gaussian) ( )

While NOT END of Do
Compute and plot the densities of and End While Examine the resulting plots and choose the one that best separates the groups End.
Graphical illustrations of the algorithm results based on both datasets at different bandwidths are shown in Figure 2. The spiky lines at the foot of each of the four density panels represent a slightly noised univariate vector of TP and FP from left to right.Because the main purpose is to separate each of the two classes, we are interested not only in the positioning of the ROC curves as in Figure 1 but also in the sequential differences in the fitted parameters, which suggest that DT1 is suitable for the BUPA and DT2 for the PIMA data.The Gaussian kernel was used in approximating the differences in the algorithm.The optimal choice of the bandwidth is estimated as Silverman, 1984) where sigma is the standard deviation of the samples Runs based on these optimal bandwidth values yielded similar results to those in Figure 2 at suggesting DT1 for BUPA and DT2 for PIMA.Because each of the ROC curves in Figure 1 measures the probability that the corresponding model will rank a randomly chosen positive instance higher than it will rank a randomly chosen negative case, the patterns in Figure 2 provide an insight into the level of class separation and can be used to guide model selection.

CONCLUDING REMARKS AND POTENTIAL FUTURE DIRECTIONS
Selecting the "best" model from a potentially set of competing models is a conventional challenge in data science, and this paper sought to demonstrate an optimisation procedure for making that decision.Guided by the Bayesian rule, ROC curves, and the Youden Index, we empirically demonstrated the variation of the allocation rule using three decision tree models.For the purpose of addressing specific application requirements, a practical reality in a data sharing environment, we introduced a novel strategy for model selection.Based on graphical visualisation, the strategy seeks to help minimise data over-fitting and performance obscurity while remaining easily understood by non-specialists.The results from this paper serve to highlight the importance of paying attention to the allocation rules in Table 1 and the associated generalising error.The strategy can be adapted to other data-dependent domain-partitioning models, such as neural networks and support vector machines.The quantity [ ] can generally be accepted as a measure of performance that, for domainpartitioning purposes, we can align alongside similar measures, such as the ROC curves.The proposed strategy can readily be adapted to all applications of a binary nature or those that can be converted into such, and our results highlight novel paths towards tackling various real-life challenges in areas such as remote sensing, seismology, oceanography, ionosphere, and many others.Since error costing differs across applications, the decisions relating to model selection will typically remain application-specific.However, the proposed strategy provides prospects of interactivity and multi-disciplinary compliancy in a general data sharing framework irrespective of the nature of the applications.We hope that this study will supplement previous studies that have focused on methods for selecting optimal models in our increasingly expanding cross-disciplinary research environment.

Figure 1 .
Figure 1.Pima (left) and Bupa(right) ROC curves and model over/fitting points

Figure 2 .
Figure 2. Differences between sequential TP and FP values and those of differences between them

Table 2 .
Data attributes for the Pima Indians diabetes and the Bupa liver disorders datasets

Table 3 .
Performance table for each of the three models on each of the two datasets Accuracy of the test is a function of how well it separates the groups, and it is assessed by the corresponding area under the curve.Traditionally, a 90%-100% area under the curve is considered excellent while anything about 60% and less is a failure.Because ROC curves may mask or over-fit data estimated parameters, we propose a strategy for enhancing the formulation of a generalised error.