Learning with General Performance Metrics

Learning with General Performance Metrics

Project Summary

Traditional machine learning methods such as SVM and logistic regression have become the de facto choices for practitioners, thanks to the availability of scalable, fast, and — importantly — easy-to-use software implementations such as LIBLINEAR. On the theoretical side, advances in learning theory have led to a great deal of understanding in statistical properties and performance guarantees of the learning algorithms. However, practitioners hardly are interested in the type of performance guarantees the learning theory community has developed. In fact, it is often the case that the traditional methods are not intended to be used for the performance metrics that the application area demands. There have been several evaluation metrics that application areas have independently looked at: examples include Precision at top-k, Recall at top-k, and compound metrics such as the F-measure. This motivates our work — to bridge the gap between statistical learning theory and the evaluation metrics of interest.  

Project Description

Traditional machine learning methods such as SVM and logistic regression have become the de facto choices for practitioners, thanks to the availability of scalable, fast, and — importantly — easy-to-use software implementations such as LIBLINEAR. On the theoretical side, advances in learning theory have led to a great deal of understanding in statistical properties and performance guarantees of the learning algorithms. However, practitioners hardly are interested in the type of performance guarantees the learning theory community has developed. In fact, it is often the case that the traditional methods are not intended to be used for the performance metrics that the application area demands.

For example, consider the problem of ranking documents relevant to a search query, which is often encountered in Information Retrieval and Text Mining. In this case, the so called “positive examples” which are the relevant documents are often very few, while the “negative examples” which are essentially the rest of the document collection are very large. In such cases, optimizing for classification error (which is the number of mistakes made by a classifier) is highly misleading — classifying all the documents as “negative” (i.e. “irrelevant”) achieves a very low error. Clearly, optimizing for the error is not useful — which is precisely what SVM is guaranteed to do! There have been several evaluation metrics that application areas have independently looked at: examples include Precision at top-k, Recall at top-k, and compound metrics such as the F-measure. This motivates our work — to bridge the gap between statistical learning theory and the evaluation metrics of interest.

Goals

Let P denote a joint probability distribution over instances X and (binary) labels Y.  Broadly, we are interested in:

  1. Can we characterize the “optimal classifier” for general performance metrics such as the F-measure? (For example, we know that the optimal classifier that minimizes generalization error simply thresholds P(Y=1|x) at 1/2).
  2. Can we develop learning algorithms that are efficient and show statistical guarantees for the performance metrics of interest? (For example, we know that SVM “eventually” minimizes the likelihood of making mistake on a random example drawn from P).

A Family of General Performance Metrics

We will now focus on the above two questions for a family of performance metrics that are functions of the entries of the so-called confusion matrix, namely: True Positives, False Positives, True Negatives and False Negatives:

Confusion matrix for evaluating binary classifiers

Confusion matrix (“Contingency table”) for evaluating binary classifiers

The linear-fractional family of (binary) metrics we consider is characterized by:

Screen Shot 2015-09-15 at 3.42.15 PM

where a’s and b’s are non-negative constants. We can show that the optimal classifier θ w.r.t. the metric can simply be characterized by a thresholding δ* of P(Y=1|x). Implications of this main result for some popular performance metrics of interest are below:

recoveredresults

Optimal thresholds (corresponding to the Bayes optimal classifiers) for popular performance metrics used in evaluating binary classifiers

In practice, how can we learn a classifier that optimizes a general performance metric using training data? The following algorithm provably optimizes any given performance metric (from the family above).

Screen Shot 2015-09-15 at 4.14.00 PM

In Step 3, Ldenotes the metric computed on the training data sample (the argmax can be computed efficiently).

Results on Real data sets

Screen Shot 2015-09-15 at 4.30.04 PM

Performance of the proposed algorithms “Two-Step EUM” and “Weighted ERM” for optimizing F measure. ERM (standard logistic regression) serves as the baseline.

Extension to Multi-label Learning

In many applications, instances may be assigned to several labels/classes simultaneously (e.g. document categorization, image tagging). Traditional multi-label learning methods, like the case of binary classification,  may not necessarily optimize the performance metrics of interest (such as multi-label extensions of the binary metrics discussed above). Therefore, the natural question is if optimality characterization and learning algorithms can be developed for multi-label metrics as well. The answer is indeed in the affirmative. An illustrative example of the optimality characterization is given below. The plots correspond to the Bayes optimal classifier for multi-label F1 measure on synthetic data with 4 labels, and distribution P supported on 6 instances. Plots from left to right show the bayes optimal classifier prediction for instances for labels 1 through 4. Note that the optimal δ* at which the label-wise marginal P(Y = j|x) is thresholded is shared (which indeed is our key result).

F1-multilabel2

Optimal classifier thresholds for the multi-label F-measure, on a synthetic dataset with 6 instances and 4 labels. Each plot gives the optimal assignment of a label to each of the 6 instances. We observe that the optimal threshold is shared by all labels.

Code and datasets

The matlab code for the proposed algorithms and the datasets used in the related papers are available here.

Main Publications