^{1}

^{*}

^{2}

^{3}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JK JG. Performed the experiments: JK JG. Analyzed the data: JK IG JG. Contributed reagents/materials/analysis tools: JK JG. Wrote the paper: JK IG JG. Implemented the software: JK JG.

Precision-recall curves are highly informative about the performance of binary classifiers, and the area under these curves is a popular scalar performance measure for comparing different classifiers. However, for many applications class labels are not provided with absolute certainty, but with some degree of confidence, often reflected by weights or soft labels assigned to data points. Computing the area under the precision-recall curve requires interpolating between adjacent supporting points, but previous interpolation schemes are not directly applicable to weighted data. Hence, even in cases where weights were available, they had to be neglected for assessing classifiers using precision-recall curves. Here, we propose an interpolation for precision-recall curves that can also be used for weighted data, and we derive conditions for classification scores yielding the maximum and minimum area under the precision-recall curve. We investigate accordances and differences of the proposed interpolation and previous ones, and we demonstrate that taking into account existing weights of test data is important for the comparison of classifiers.

In both theoretical and applied machine learning, assessing the performance of a classifier is of fundamental importance as a crucial step in model selection and comparison

Several of these performance measures

Varying the threshold leads to a series of confusion matrices. In case of binary classification, this series of confusion matrices can be visualized by curves, which can then be compared quantitatively by the area under curve (AUC). One popular curve is the receiver operating characteristics (ROC) curve, which plots the true positive rate (sensitivity, recall) against the false positive rate (1 - specificity)

For this reason, the precision-recall (PR) curve

For computing the AUC-PR and AUC-ROC, the interpolation between two adjacent points of the curve is based on a linear interpolation between the underlying confusion matrices

In recent years, soft-labeling has gained increased attention, as for many classification problems the labeled input data are associated with some measure of confidence, generically denoted as

While weights are widely used for learning classifiers from training data, the assessment of classifier performance on test data is often restricted to the hard-labeled or unweighted case. However, the determination of a single confusion matrix for a given threshold is straight-forward, where the entries of the confusion matrix are accumulated weights. Hence, it is also straight-forward to compute scalar performance measures as for instance precision and recall for the weighted case. Consequently, the supporting points of ROC and PR curves can also be computed for weighted data, ROC curves and AUC-ROC can be derived as for the unweighted case due to the linear interpolation.

Aiming at computing the AUC-PR for weighted data, it is unclear how to interpolate between such real-valued confusion matrices. For this reason, previous publications considering weighted test data resorted to approximations of the AUC-PR. Examples for such approaches are average precision

In a case study, we investigate differences between the AUC-PR computed using the continuous and the discrete interpolation. Furthermore, we investigate if taking into account given weights when computing AUC-PR may lead to different conclusions when comparing classifiers or when doing model selection based on this performance measure.

In this section, we first formally define weights and then revisit confusion matrices. In the remainder of this section, we revisit discrete interpolations for PR curves, propose a generalization yielding a continuous interpolation, and finally show how this interpolation can be applied to weighted data.

We consider the case of binary classification and denote the two classes by

In this paper, we additionally consider weighted data, where the following types of weights may be assigned to data point

i) Soft class labels

ii) Values

iii) Multiplicities

Since the methods presented in the remainder of this paper are applicable to all these types of weights, we generically refer to such soft labels

In

real label | ||||

predicted label | ||||

Based on the classification score

(a) Classification scores, labels and weights | |||||

classification score | class | ||||

2.54 | 0.9 | 0.1 | |||

2.37 | 0.92 | 0.08 | |||

1.56 | 0.22 | 0.78 | |||

1.35 | 0.07 | 0.93 | |||

0.06 | 0.67 | 0.33 | |||

−1.08 | 0.09 | 0.91 | |||

TP = 2 | FP = 1 | TP = 2.04 | FP = 0.96 | ||

FN = 1 | TN = 2 | FN = 0.83 | TN = 2.17 |

Varying the classification threshold leads to a series of confusion matrices and corresponding performance measures that can be visualized by ROC and PR curves. However, previous interpolations for computing PR curves from a number of supporting points

For unweighted data, Davis and Goadrich notice that it is usually not reasonable to use linear interpolation for the PR curve if two adjacent points of the PR curve differ by more than one true positive

The interpolation introduces intermediate points, which make linear interpolation more reasonable than the linear interpolation between the points

One alternative to a stepwise interpolation along the true positives is a stepwise interpolation along the false positives, yielding

In analogy to the method of

Here, we propose a piecewise-defined function allowing to compute the AUC-PR by a sum of integrals. In general, we can compute the AUC-PR by parameterizing the PR curve by

We use a piecewise definition for computing the PR curve and the area under this curve. Specifically, we compute

determine _{A}_{fg} |
||||

_{A} |
||||

determine new |
||||

_{A}_{B} |
||||

determine |
||||

_{B}_{fg} |
||||

_{A}_{fg} |
||||

The goal of this manuscript is the computation of PR curves and AUC-PR values for weighted data as introduced in section. In this case, the entries of the confusion matrix are accumulated weights of the corresponding data points (cf.

Discrete interpolations depend on the step size of the interpolation. In case of unweighted data, a step size of one is reasonable, because it corresponds to one data point. However, it is not obvious how to choose a reasonable step size for weighted data.

In contrast to discrete interpolations, the continuous interpolation based on

Three characteristics are central for each performance measure: its maximum, its average for a random classifier, and its minimum. In the

1.

2.

Based on the minimum and the maximum, normalized performance measures as for instance the normalized AUC-PR can be computed

First, we consider the optimal classifier, i.e., the classifier with the maximal AUC-PR. For unweighted test data, the optimal classifier always yields an AUC-PR of

In

The blue and the red curve indicate estimators of the best and the worst curve, respectively. The gray curves represent 1,000 PR curves based on a random scored-based classifications, which are also summarized by the green boxplots. The pink dashed line indicates the level of the class ratio

Second, we consider the worst classifier, i.e., the classifier with the minimal AUC-PR. Such a classifier always decides for the class with the lower confidence. In

Finally, we consider the AUC-PR for random guessing. Since random guessing can not be represented by a single random classifier, we investigate an ensemble of

In this section, we investigate (i) theoretical and practical differences between the discrete and continuous interpolations and (ii) whether the extension of PR and ROC curves to weighted data may possibly allow a more detailed classifier assessment.

First, we investigate in which situations the discrete and the continuous interpolations yield identical segments of the curve and, hence, identical contributions to the AUC. If a segment of the continuous interpolation is linear, it is identical to the discrete interpolations. We obtain linear segments in two situations. On the one hand, we obtain a vertical linear segment if

Second, we investigate situations that result in a large deviation between discrete and continuous interpolation. Such situations occur if the discrete interpolations span a large range with few intermediate points. In

The figure shows for each discrete interpolation (along true positives or along false positives) one example of a larger and smaller AUC between two supporting points.

In

In

In summary, we find that the continuous interpolation fits the discrete interpolation with more intermediate points in all four cases.

In this section, we compare the interpolations for complete curves. To this end, we sample classification scores at a fixed class ratio of 1 to 10 for foreground versus background. Analyzing the AUC-PR for the interpolation discrete- TP and the continuous interpolation, we vary the size of the foreground data set as well as the uniqueness of the classification scores. To achieve the latter, we sample the classification scores from different numbers of bins.

Trying to obtain almost equally distributed AUC-PRs, we sample the classification scores from a normal distribution with mean normally distributed around 1.64 for the foreground and fixed mean 0 for the background. (The value 1.64 is based on the class ratio of 1 to 10 and the quantile function,

Panel (a) depicts the results for 10 bins equivalent to at most 10 different classification scores, whereas panel (b) depicts the results for 1,000 bins.

At a first glance, we observe that the difference between the two interpolations can be up to 0.03. We find this difference for the smallest data sets comprising 10 and 100 test data points for foreground and background, respectively, and for the least number of unique classification scores with at most ten different values. However,

In addition, increasing the size of the data sets or the number of bins, we find that the difference between both interpolations reduces drastically to almost 0. Both findings can be explained by a greater number of supporting and intermediate points and, hence, a more fine-grained coverage along the recall leading to converging AUC-PR values for continuous and discrete-TP interpolation. Hence, the discrete-TP and the continuous interpolation are similarly good approximations. This means that conclusions drawn from the interpolation discrete-TP for unweighted data are usually also valid for the proposed continuous interpolation and vice versa. However, the continuous interpolation allows to directly compute the PR curve and the AUC-PR for weighted data.

In this section, we illustrate the main benefit of the proposed continuous interpolation, which is its applicability to weighted data. More specifically, we show in a simulation study that classifiers that yield an indistinguishable performance using unweighted test data may indeed achieve a considerably different performance using weighted test data. Inspecting the relationship of the classification scores of these classifiers to the given weights, we show that the ranking of classifiers using ROC and PR curves for weighted test data is reasonable.

To this end, we generate simulated data as follows: We first sample weights (i.e.,

Panel (a) show a histogram of foreground weights (

We further sample 10 000 classification scores, one for each data point assigned to the positive class and one for each data point assigned to the negative class according to the unweighted case as shown in

We generate three different hypothetical classifiers by different assignments of classification scores to the data points within each class as visualized in

We create a

We create a

We further create a

Due to this setup, the distribution of classification scores within each class according to the unweighted case is identical and, hence, all three classifiers obtain identical ROC and PR curves for unweighted data depicted in

However, if we consider the relationship of classification scores and weights as shown in

A similar picture emerges for the areas under the ROC and PR curves as listed in

Hence, we may conclude that the applicability of PR curves to weighted test data, which has been achieved by the interpolation of

This example also illustrates that the transition from unweighted to weighted data for computing ROC and PR curves changes the objective measured by these curves. While traditional ROC and PR curves using unweighted test data only consider the distribution of classification scores within the two classes, ROC and PR curves using weighted test data additionally take into account the confidence of the labeling. Hence, these curves measure the ability of a classifier to reconstruct the ordering of data points according to the weighted

In this section, we evaluate the efficacy of AUC-PR for weighted data in practical applications. To this end, we compare the rankings of classifiers based on their AUC-PR for weighted and unweighted data in in two real-world examples.

In

The AUC-PR for unweighted test data is depicted in black, whereas the AUC-PR for weighted test data is depicted in red.

In a second case study, we perform a reassessment of classifiers from bioinformatics

In the original publication

In

The team name and the ranking is depicted on the abscissa, while the mean result for AUC-ROC and AUC-PR is depicted on the ordinate. Teams are displayed in the order of the original ranking of Weirauch

In panel (a), we plot the predicted log-intensity values of classifiers A, D, and E against the measured log-intensity values. Panel (b) visualizes the class border in the unweighted case (red line) and the weights of the foreground class (

We find that the rankings for both mean AUC-ROC and mean AUC-PR change considerably going from unweighted to weighted test data. Focusing on the mean AUC-PR, we find that the ranking obtained by AUC-PR using weighted test data are in better accordance to the original ranking of Weirauch

Three classifiers with exceptionally different rankings are A, D, and E, which obtain ranks 1, 3, and 7 considering unweighted test data, and ranks 9, 1, and 2 using weighted test data, respectively. Hence, we further investigate AUC-PR and PR-curves for classifiers A, D, and E for one exemplary data set (data set 11) in

In

The PR curves of the three classifiers using weighted and unweighted test data are shown in

Thresholds or other rules for labeling data points are often chosen arbitrarily, and mildly different choices could often be justified no worse than the selected one. For this reason, we consider the stability of performance measures to mild changes of the class border an important property.

Hence, we finally investigate the stability of AUC-PR for unweighted and weighted test data using different thresholds for the labeling. To this end, we compute the mean AUC-PR using a threshold of mean intensity plus one standard deviation, and compare the results with those for the above-mentioned threshold of mean intensity plus four times standard deviation for all 11 classifiers considered in

We measure the stability of the assessments by the Pearson correlation of the AUC-PR values obtained for each of the 11 teams and each of the 66 data sets using either of the two thresholds. We present the results of this analysis in

In panel (a), we consider unweighted test data and plot the AUC-PR values for a threshold of mean intensity plus four times standard deviation (ordinate) against the AUC-PR values for a threshold of mean intensity plus four times standard deviation (abscissa). In panel (b), we consider weighted test data and plot the AUC-PR values in analogy to panel (a). We find a substantially greater Pearson correlation between the AUC-PR values for the different thresholds for weighted data compared to unweighted data.

This indicates that the AUC-PR for appropriately weighted data is more stable leading to less changes in the ranking than the mean AUC-PR for unweighted data. Hence, the choice of the classification threshold, which is somewhat arbitrary, is down-weighted.

PR curves and the areas under these curves have gained increasing importance in machine learning during the last years. Computing the area under the precision recall curve depends on the interpolation between adjacent points for given confusion matrices.

Here, we introduced a continuous interpolation for computing the area under the precision recall curve. We compared discrete and continuous interpolations theoretically and practically and show the interpolations are in agreement using unweighted data.

The continuous interpolation can also be used for weighted data sets. The optimal AUC-PR is not necessarily equal to

Based on artificial and real-world data sets, we found that the ranking of classifiers based on their AUC-PR might differ severely using unweighted and weighted test data sets. We also found that AUC-PR using weighted test data is less sensitive to small changes in the class border than AUC-PR using hard labels.

We implemented

(PDF)

We are grateful to Jaana Kekäläinen and Matthew T. Weirauch for providing classification results and additional information on