Large-Scale Recommender Systems

Large-Scale Recommender Systems

Primary researcher(s): Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, Inderjit Dhillon

Project Summary

Low-rank Matrix factorization in the presence of missing values has become one of the popular techniques to estimate dyadic interaction between entities in many  applications such as the friendship prediction in social networks (e.g., Facebook) and the preference estimation in recommender systems (e.g., Netflix). Although there are some existing methods such as alternating least squares (ALS) and stochastic gradient (SG), scalable computation remains the main issue when the matrix contains millions of rows/columns and billions of observed entries. In this project, we are working on various approaches for different variants of low-rank matrix factorization problems.

Project Description

The aim of this project is to develop an efficient parallel distributed algorithm for matrix completion. We are specifically interested in solving large industrial scale matrix factorization problems on commodity hardware with limited computing power, memory, and interconnect speed, such as the ones found in data centers. The widespread availability of cloud computing platforms such as Amazon Web Services (AWS) make the deployment of such systems feasible. We have designed the following approaches for large-scale matrix factorization problems, which are shown to be faster and more scalable than other state-of-the-art methods.

  1. Parallel coordinate descent solver (CCD++): We studied coordinate descent based methods for matrix factorization problems and proposed a new algorithm called CCD++. This new algorithm not only has a more efficient update rule compared to ALS but also enjoys faster and more stable convergence than SG. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. On a distributed HPC cluster with 256 computing nodes, CCD++ takes only 16 seconds to factorize a 20M*10M matrix with 14.6 billion observed entries, which is 40 times faster than a parallel implementation of SGD.

    Prediction Root Mean Square Error (RMSE) versus computation time on an 8-core system for different methods. The results show that our proposed algorithm, CCD++, is faster than other approaches.

    Prediction Root Mean Square Error (RMSE) versus computation time on an 8-core system for different methods. The results show that our proposed algorithm, CCD++, is faster than other approaches.

  2. Nonblocking and asynchronous stochastic gradient solver (NOMAD):  To reduce the unnecessary waiting due to uneven load balancing in traditional approaches, my colleagues and I also studied asynchronous approaches for matrix factorization. We developed an efficient parallel distributed algorithm named NOMAD. This algorithm utilizes the concept of “owner computes” to avoid conflicts without explicit locking. To our best knowledge, it is the first parallel stochastic gradient solver for matrix factorization with 1) lock-free implementation, 2)  non-blocking  communication, 3) fully asynchronous computation, and 4) a serializable update sequence.  Our empirical evaluation demonstrated that not only does NOMAD perform well on Amazon Web Service (AWS) but also outperforms state-of-the-art algorithms on an HPC cluster (Stampede).
    Illustration of the NOMAD algorithm.

    Illustration of the NOMAD algorithm.

    Comparison of NOMAD, DSGD and CCD++ on a HPC cluster.

    Comparison of NOMAD, DSGD and CCD++ on a HPC cluster.

Main Publications