CHISC-AC: COMPACT HIGHEST SUBSET CONFIDENCE-BASED ASSOCIATIVE CLASSIFICATION

The associative classification method integrates association rule mining and classification. Constructing an efficient classifier with a small set of high quality rules is a highly important but indeed a challenging task. The lazy learning associative classification method successfully removes the need for a classifier but suffers from high computation costs. This paper proposes a Compact Highest Subset Confidence-Based Associative Classification scheme that generates compact subsets based on information gain and classifies the new samples without constructing classifiers. Experimental results show that the proposed system out performs both the traditional and the existing lazy learning associative classification methods.


INTRODUCTION
Data mining deals principally with extracting knowledge from data.In the present world where data is all around us, there is a pressing need to extract knowledge or interesting information that is hidden in the available data.Association rule mining and classification are data mining functionalities.Association rule mining is concerned with extracting a set of highly correlated features shared among a large number of records in a given database.It uses unsupervised learning where no class attribute is involved in finding the association rule.On the other hand, classification uses supervised learning where class attribute is involved in the construction of a classifier to predict the new instance.Both association and classification are significant and efficient data mining techniques.
Associative classification is a recent and rewarding technique that applies the methodology of association rule mining to classification and achieves high classification accuracy.Associative classifier construction has two steps.In the first, association rule mining is applied to discover class association rules.The important element in controlling the number of rules generated in associative rule mining is the support threshold.If the support value is high, then the number of rules generated is very small, but many high-confidence rules may get eliminated.On the other hand, if the support value is set to a minimum, large numbers of rules are generated.In the second step, the generated rules are ranked.After rule ranking, only the high-ranking rules are chosen to build a classifier while the rest are pruned.
Generating, ranking, and selecting a small subset of high-quality rules with which to construct the classifier is a challenging task indeed.Merschmann andPlastino (2007, 2010) proposed a lazy learning method based on the probabilistic analysis of patterns to classify datasets.This method delays the processing of the data until a new sample needs to be classified.Syed et al. (2011) proposed a lazy learning method based on support and confidence measures.These two methods improve the classification accuracy but lead to high computation cost.Chen et al. (2006) and Zhang et al. (2011) proposed a new approach for associative classification based on information gain where more informative attributes are chosen for rule generation.
The above proposals motivated us to investigate a new computational technique for lazy learning associative classification.In these methods, the rule generation phase in associative classification is a difficult step that requires a large amount of computation.To reduce the computation cost, the proposed method uses an information gain-based lazy learning associative classification method that computes subset confidence only for the information gain attribute.This method reduces the number of subsets generated.Our proposed method does not build a generalized classifier from training data for classification of new samples but directly predicts classification using compact subset confidence.
The rest of the paper is organized as follows: Section 2 discusses related past work in this field, and Section 3 explains the process of the proposed system.Section 4 shows the working principle of the proposed system using an example.The final section presents the experimental results and observations followed by the conclusion.

RELATED WORK
Associative classification was first introduced by Liu et al. (1998).It integrates two well known data mining tasks, association rule discovery and classification.The integration is focused on a special subset of association rules whose right hand side is restricted to a class attribute; for example, consider a rule R: X  Y, Y must be a class label.Generally there are two stages in associative classification.In the first stage, association rule generation methods such as a priori candidate generation (Agarwal & Srikant, 1994) or Frequent Pattern (FP) growth algorithms (Han et al., 2000) are used to generate class association rules.For example, the Classification based on Association (CBA) method (Liu et al. 1998) employs a priori candidate generation and other associative methods such as Classification based on Predictive Association Rules (CPAR) (Yin & Han, 2003) and Classification based on Multiple Association Rules (CMAR) (Li et al., 2001) while lazy associative classification (Baralis et al., 2002(Baralis et al., , 2004(Baralis et al., , 2008) ) uses the FP growth algorithm for rule generation.
The rule generation step can generate a very large number of rules and is a difficult step requiring a large amount of computation.Experimental results reported by Baralis et al. (2008) have shown that the CBA method, which uses the a priori association rule mining algorithm, generates more than 80,000 rules for some datasets.This leads to memory size issues and other severe problems, such as overfitting.If all the generated rules are used in the classifier, the accuracy of the classifier is high, but the process of classification is slow and timeconsuming (Syed, Chandran, & Jabez, 2011).
In the next step, generated rules are ranked.After rule ranking, only the high-ranking rules are chosen to build a classifier with the remaining rules being pruned.The rule ranking can be based on several different types of parameters and interestingness measures, such as confidence, support, lexicographical order of items, etc.
In the CBA method, the rules are arranged based on their confidence value.If two rules have the same confidence measure values, then the rules are sorted based on their support.If both confident and support values are the same for two rules, then rules are sorted based on their rule length.Even after considering confidence, support, and cardinality measures, if some rules exist with the same values for all three measures, the rules are sorted based on their lexicographic order as in the lazy pruning method (Baralis et al., 2008).After rule ranking, the CBA method uses a database coverage method to construct the classifier.CMAR (Han & Pei, 2001) applies the chi-square test (Snedecor & Cochran, 1989), which gives positively correlated rules that can be used in the classifier.CPAR (Yin & Han, 2003) uses the Laplace accuracy measure to estimate the expected error rate for each rule, and the best rules are selected to construct the classifier.
A recent approach for rule pruning is lazy pruning (Baralis et al., 2008), where a rule is pruned only if it misclassifies the data.The entire ruleset is segregated into three sets, namely, useful rules, harmful rules, and spare rules.A rule that classifies at least one data item correctly is said to be a useful rule, and that which misclassifies a data item is a harmful rule.The leftovers are spare rules, which are not pruned but used when needed.Though the lazy pruning strategy works well for small datasets, in the case of large datasets, there are constraints in memory space and ruleset quality.An evolutionary based associative classification method (Syed, Chandran, & Jabez, 2011) was proposed recently.This approach randomly uses a subset of rules to construct the classifier.The richness of the ruleset improves over generations.
Syed, Chandran, and Abinaya (2011) proposed a compact weighted associative classification scheme based on information gain, wherein the class association rule generation algorithm chooses non-class information gain attributes from the dataset, and all the items are generated based only on that attribute.Thus this algorithm reduces the number of itemsets generated.The algorithm calculates the weighted support and weighted confidence for each item and determines whether the item is frequent or not.All of the above recent associative classification methods construct a generalized model (classifier) to classify the new data sample, but our proposed work eliminates the classifier construction and uses information gain to directly choose the best attribute, predicting the class based on the highest subset confidence value.The following section discusses our proposed method.

Attribute selection based on information gain
Generally, an a priori based rule generation algorithm generates 2 k rules for a dataset with K items.To reduce the number of rules generated, an information gain attribute can be used.Information gain is a measure used in information theory to quantify the 'information content' of messages (Han & Kamber, 2001).In the ID3 decision tree algorithm (Quinlan, 1986) information gain is used to choose the best split attribute.
In this paper, information gain is used to generate class association rules.In the process of generating these rules, instead of considering all the attributes, the information gain measure selects the best splitting attribute.In this way, our work generates a limited number of good quality rules.Figure 1 shows the system flow of the proposed system, which is called the Compact Highest Subset Confidence-Based Associative Classification (CHISC-AC) method.
where |D| is the number of transactions in database D and g is the number of classes.
After the dataset D is partitioned into n values of attribute A, the expected information requirement is defined as: Eq. ( 2) The information gained by partitioning D according to attribute A is defined as: The best split attribute is the one that maximizes the information gain in data set D. These best attributes are then used to generate the subset.

Subset evaluation
After the information gain attribute is identified, the subsets are generated based on this attribute.For each generated subset, confidence values are calculated.
Let T x be the set of transactions present in the dataset.Supposing that there are m classes, Eq. ( 4) where X is the subset and C is the class.
Deciding which class will be assigned to the instance X is based on an analysis of the subsets of attribute values associated with higher confidence.
All confidence values are sorted, and a lower limit is set to limit the number of possibilities.
The lower limit can be calculated as where M is the number of classes.
Maximum class confidence is assigned to the testing samples.

SAMPLE COMPUTATION
Let us consider the sample training and testing dataset of medical observations (artificially constructed) as given in Tables 1 and 2, respectively, where the training dataset contains 14 transactions and 2 class values, and the task is to predict the class label for the test (new) data instance.Among the four attributes, attribute 'CD4 Cell Count' is selected as the information gain attribute as it has the maximum information gain value.
The CHISC-AC algorithm calculates the confidence value for each subset of the testing dataset, and class labels are assigned based on high subset confidence.Consider that the information attribute is CD4 Cell Count.
For the test dataset, the following subset can be created.The generated rules using information gain-based associative classification are: The proposed system generates only 4 rules whereas the existing CBA algorithm generates 13 rules.This clearly shows that the proposed algorithm will generate a minimal number of rules even for huge datasets with many attributes.We can see that two yes classes and one no class transaction contain more than the lower limit.Therefore, the yes class is assigned to the test data.

EXPERIMENTAL RESULTS AND DISCUSSION
The proposed CHISC-AC system was tested using benchmark datasets from the University of California at Irvine Repository (UCI Repository) (Blake & Merz, 1998).A brief description of the datasets is presented in Table 4.The experiments were carried out on a PC with Intel Core 2 Duo CPU with a clock rate of 2.00 Ghz and 4 GB of main memory.The 10 cross fold validation approach (Han & Kamber, 2001) was used where 90% of the data was randomly chosen from the dataset and used as a training dataset, and the remaining 10% was used as the testing dataset.The training dataset was used to construct a model for classification.After constructing the classifier, the test dataset was used to estimate classifier performance.

Accuracy Computation
Accuracy measures the ability of the classifier to correctly classify unlabeled data.It is the ratio of the number of correctly classified data over the total number of given transactions in the test dataset.4 shows the description of various datasets.Table 5 shows the accuracy comparison.The first column gives the dataset name; the next column describes the highest information gain attribute that will be used in the proposed system.The third column shows the accuracy for various datasets using the traditional associative classification method (Liu et al., 1998).The fourth column shows the accuracy for HiSP-GC, a lazy learning associative classification method proposed by Merschmann and Plastino (2010).The last column shows accuracy values for various datasets using the proposed CHISC-AC system.Classification of the test dataset was done ten times, and the overall accuracy was obtained by calculating the average of the accuracy values obtained from the ten different runs.
Table 5 shows that the accuracy comparison with minimum support of 1% and minimum confidence is greater than 50%.The proposed system uses information gain attribute to generate rules.The proposed algorithm has about a +3.87 percent improvement over the traditional associative classification and about a +3.39 percent improvement over the existing lazy learning (HiSP-GC) method.There are 3 instances where the new method gives poorer results.This is due to class distribution.In these databases one class attribute occurred in very low distribution.For example, the Balance Scale dataset contains 3 class labels namely L, B, and R, distributed as 46.08%, 07.84%, and 46.08%.In these cases the proposed system may not learn about the class B. Therefore, further research is needed to improve the algorithm.
Table 6 shows the average time taken to predict a single instance.For example, the existing system takes 0.591 seconds to predict a single instance for the mushroom dataset because the mushroom dataset consists of 8124 records, 23 attributes, and 2 class labels.Here the number of records and attributes is high.Therefore, the existing system takes a large amount of time to predict this single instance.On the other hand, the proposed system using the information gain attribute takes only 0.0592 seconds to predict a single instance from the mushroom dataset.

Sensitivity and Specificity
An imbalance in class occurs when one of the classes in a dataset is represented by a smaller number of samples than the other classes (Barandela et al., 2003;Weiss, 2004).The accuracy measure is used extensively to compare the performance of classifiers but may not be suitable for evaluating imbalanced datasets (Tan, 2006).
The class imbalance issue might skew the prediction accuracy of classification models (Guha & Schurer, 2008;Hsieh et al., 2008), resulting in a weakened performance of machine learning algorithms (Kang & Cho, 2006).Therefore, other metrics such as sensitivity, specificity, precision, and recall are used.
To use these metrics, the following terminology is generally used.
1.The rare class is denoted as a positive class and the majority class is denoted as a negative class.Table 7 shows the sensitivity and Table 8 shows the specificity for six datasets.The proposed system has a high sensitivity (TPR) and specificity (TNR) for most of the datasets.The only dataset with low sensitivity is Breastw.This is due to missing values.Further research on replacing missing values may improve the classification accuracy.

Precision and recall
Sensitivity and specificity measures are used in binary class datasets whereas precision and recall measures are used in multiclass problems.
The definition of precision and recall is given below.
Precision (P) = Eq.( 11) Recall(R) Eq. ( 12) Precision and recall measures are useful when the dataset consists of multiple classes.Thus we compute precision and recall for the following datasets: Balance Scale, Car, Ecoli, Flare, and Glass.

ACCURACY PERFORMANCE VALIDATION
To validate the statistical significance of the proposed associative classification algorithms, a cross validated paired t-test (Dietterich, 1998) was performed.The test was performed at level p = 0.05 for all the datasets.The CHiSC algorithm was compared with CBA and the existing HiSP -GC associative classifiers using 12 datasets.Table 11 reports the significant evaluation results.

CONCLUSION
The CHISC-AC method is a computationally efficient classifier.This method predicts the class for each sample in a database based on an evaluation of its subsets of attribute values.This method uses the information gain attribute to generate a compact subset and calculates the posteriori confidence for each subset.Then it predicts the class based on that knowledge.CHISC-AC was tested with thirteen datasets from the UCI Machine Learning Repository.When compared with the traditional associative classifier and the existing lazy learning associative classification method, the experimental results show that the proposed system functions better than the traditional and existing methods most of the time and takes an average of 0.0133 seconds to predict an instance, less time than the other two methods use.

Table 5 .
Accuracy comparison

Table 6 .
Average time taken to predict a single instance

Table 9 .
PrecisionHigh precision implies that most of the predicted classes are correct.But high precision at low recall levels indicates that the classifier might not have predicted the rare class correctly.Tables9 and 10depict the precision and recall of the existing and proposed associative classifiers for multiclass datasets.It should be noted that the Car dataset has equal precision with the existing system.There is less recall for both the Balance Scale and Flare datasets due to smaller numbers in the particular class in the class distribution.

Table 11 .
Paired t -test X/Y indicates that Proposed Method is significantly better X times / Worst Y times than the Existing Method.
Table11shows paired t -test values for an average class accuracy of 10-cross fold validation.The test shows that average classification values of 10 different runs is significantly better for our proposed CHiSC algorithm for 9 and 6 different datasets (worse for 2 and 4 datasets) when compared with CBA and HiSP -GC methods respectively.The results in Table11indicate that CHiSC gives statistically significant better performance in these cases.