Large-scale Multi-label Learning

Large-scale Multi-label Learning

Primary researcher(s): Hsiang-Fu Yu

Project Summary

Multi-label classification problems abound in practice; as a result, many methods have recently been proposed for these problems. However, there are two key challenges that have not been adequately addressed: (a) the number of labels can be numerous, for example, in the millions, and (b) the test data can be riddled with missing labels. In this project, we study a generic framework for multi-label classification that directly addresses the above challenges. In particular, we pose the problem as one of empirical risk minimization, where the prediction function is parameterized by a low-rank matrix.  A key facet of our approach is that we handle missing labels in the training set by applying techniques from the domain of matrix completion. To develop a scalable algorithm that can handle a larger number of classes, we use the alternating minimization method to find  the low-rank parameter matrix. Finally, we present empirical results on a variety of benchmark datasets and show that our methods perform significantly better than existing label compression based methods.

Project Description

Due to several motivating real-life applications, recent research on multi-label classification has largely shifted its focus to the other end of the spectrum where the number of labels is assumed to be extremely large, with the key challenge being the design of scalable algorithms that offer real-time predictions and have a small memory footprint. In such situations, simple methods such as 1-vs-all or Binary Relevance (BR), that treat each label as a separate binary classification problem fail miserably. For a problem with (say) 10 thousands labels and one million features, which is common in several applications, these methods have a memory footprint of around 100 Gigabytes and offer slow predictions.

A common technique that has been used to handle the label proliferation problem in several recent works is “label space reduction”. The key idea in this technique is to reduce the dimensionality of the label-space by using either random projections or canonical correlation analysis (CCA) based projections.Subsequently, these methods perform prediction on the smaller dimensional label-space and then recover the original labels by projecting back onto the high dimensional label-space.

  • LEML: Low-rank ERM for Multi-Label Learning

In this project we take a more direct approach by formulating the problem as that of learning a low-rank linear model Z. We cast this learning problem in the standard ERM framework that allows us to use a variety of loss functions and regularizations for Z. This framework unifies several existing dimension reduction approaches.

leml-2

  • LEML: Matrix Factorization with Side Features

To learn the LEML model efficiently, we consider the decomposition form of Z=WH^T and apply the alternating minimization techniques to obtain W and H. Unlike the traditional matrix factorization, updating W is more challenging. Based on the structure of the ERM problem, we develop efficient Hessian-vector multiplication routines such that iterative methods such as conjugate gradient can be applied to solve the subproblem efficiently. In fact, from this point of view, LEML can be regarded as Matrix Factorization with row features as follows.leml-1

  • Experimental Results: We perform extensive experiments on the following datasets.leml-dataset

    The following empirical results show that LEML outperforms other state-of-the-art methods in various settings:

  1. Fully Observation + Squared L2 Lossleml-exp-1
  2. Missing Labels+Various Lossleml-exp-2

Main Publications