A Dirty Model for Multi-task Learning | Center for Big Data Analytics

Abstract: We consider the multiple linear regression problem, in a setting where some of the set of relevant features could be shared across the tasks. A lot of recent research has studied the use of ℓ1/ℓq norm block-regularizations with q>1 for such (possibly) block-structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the {em extent} to which the features are shared across tasks. Indeed they show~citep{NWJoint} that if the extent of overlap is less than a threshold, or even if parameter {em values} in the shared features are highly uneven, then block ℓ1/ℓq regularization could actually perform {em worse} than simple separate elementwise ℓ1 regularization. We are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage support and parameter overlap when it exists, but not pay a penalty when it does not? Indeed, this falls under a more general question of whether we can model such emph{dirty data} which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we decompose the parameters into two components and {em regularize these differently.} We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 and ℓ1/ℓq methods, over the entire range of possible overlaps. We also provide theoretical guarantees that the method performs well under high-dimensional scaling.

Download: pdf

Citation

A Dirty Model for Multi-task Learning (pdf, software)
A. Jalali, P. Ravikumar, S. Sanghavi, C. Ruan.
In Neural Information Processing Systems (NIPS), December 2010.

Bibtex:
@inproceedings{jalali2010adirtymo, author = "Ali Jalali AND Pradeep Ravikumar AND Sujay Sanghavi AND Chao Ruan", title = "A Dirty Model for Multi-task Learning", booktitle = "Neural Information Processing Systems (NIPS)", year = "2010", month = "dec", abstract = "We consider the multiple linear regression problem, in a setting where some of the set of relevant features could be shared across the tasks. A lot of recent research has studied the use of ℓ1/ℓq norm block-regularizations with q>1 for such (possibly) block-structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the {em extent} to which the features are shared across tasks. Indeed they show~citep{NWJoint} that if the extent of overlap is less than a threshold, or even if parameter {em values} in the shared features are highly uneven, then block ℓ1/ℓq regularization could actually perform {em worse} than simple separate elementwise ℓ1 regularization. We are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage support and parameter overlap when it exists, but not pay a penalty when it does not? Indeed, this falls under a more general question of whether we can model such emph{dirty data} which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we decompose the parameters into two components and {em regularize these differently.} We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 and ℓ1/ℓq methods, over the entire range of possible overlaps. We also provide theoretical guarantees that the method performs well under high-dimensional scaling." }