Gene-Disease Prediction: A Link Prediction Approach

Gene-Disease Prediction: A Link Prediction Approach

Project Summary

We develop methods for predicting gene-disease associations, an important problem in computational biology. The prediction problem can be posed as link prediction in a heterogeneous network consisting of bipartite gene-disease network, gene-interactions network and disease similarity network. In the partially supervised formulation (called positive-unlabeled learning), the goal is to classify known positive associations from “negative” associations. We also formulate the prediction problem as inductive matrix completion on the gene-disease associations matrix, that incorporates “row” (gene) and “column” (disease) features obtained from multiple biological sources.

Project Description

Identifying causal disease genes is a fundamental problem in biology. The associated machine learning problem of predicting potential gene-disease associations is challenging because of the extreme sparsity of known associations, and lack of “negative” associations.  For the prediction task, we exploit heterogeneous sources of information such as the gene-interactions network, disease similarities, and studies in non-human species. We have developed network-based methods inspired by social network analysis, positive-unlabeled learning methods (partial supervision while training), and “inductive” matrix completion methods that incorporate gene and disease features.

I. Link Prediction Formulation


Heterogeneous network of genes, diseases and phenotypes from multiple other species. Edges indicate causal relationships or functional associations. On the right is the corresponding Adjacency matrix and computation of Katz, a path-based similarity measure between nodes

We can view the problem of predicting associations between genes and diseases as a link prediction problem in a biological network composed of gene and disease nodes. Furthermore, we can incorporate studies on diseases and traits of other species in the network. This way we arrive at a heterogeneous biological network composed of multiple types of nodes and multiple types of edges as well (capturing relationships between gene nodes, causal relationships between gene and disease nodes, etc). We want to compute path-based similarity measure between gene and disease nodes in the combined network. One popular measure of similarity used in link prediction in social networks is called Katz, which aggregates similarities based on number of paths of different lengths between a pair of nodes in the network.

II. CATAPULT: Positive-Unlabeled Learning Formulation


Obtaining path-based features for classifying gene-disease pairs

We pose the prediction problem as classifying “positive” gene-disease associations from “negative” associations. The learning is different from traditional supervised approach, where both positive and negative examples are available. Here, we have only positive examples (corresponding to known associations); negative examples are infeasible to obtain biologically (say by wet lab experiments). The machine learning formulation that we employ here is Positive-Unlabeled (PU) learning. Our method, CATAPULT (Combining dATA from multiple species using Positive-Unlabeled Learning Technique),  uses a biased support vector machine to penalize false positives and false negatives differently. The terms in the Katz measure (discussed above) expansion correspond to walks along different biological networks composing the heterogeneous network. CATAPULT uses features derived from walks (corresponding to terms in the Katz series expansion) to represent gene-disease pairs, and learns a supervised classifier in this feature space. The feature map for gene i and disease j is obtained as follows:

Try our web interface here for obtaining predictions, by inputting a few known genes already known to be linked to the disease or phenotype of interest.

 III. Inductive Matrix Completion Formulation


Schematic: We construct gene and disease features using different sources and then perform IMC using row and column features. The shaded region in the P matrix corresponds to genes or diseases with at least one known association

It is natural to model the prediction problem as a matrix completion problem, which is popular in recommender systems such as the Netflix problem, where we want to try and complete the gene-disease associations matrix. A main limitation of existing methods is that they cannot be used to make predictions for diseases (or genes) that have no known existing connections. Our matrix completion formulation is inductive: it incorporates features associated with rows (genes) and columns (diseases) in matrix completion, so that it enables predictions for diseases or genes that were not seen during training, and for which only features are known but no linkage information. For genes and diseases, we use multiple sources such as microarray gene expression, functional interactions, and text mining to obtain features.


Below are some comparison results on the OMIM data consisting of 3209 diseases. The vertical axis in the plots gives the probability that a true gene association is recovered in the top-r predictions for various r values in the horizontal axis. We observe that our IMC method significantly dominates other state-of-the-art methods proposed for the problem consistently over all r values. In particular, IMC has close to 25% chance of retrieving a true gene in the top-100 predictions for a disease, whereas even the second best performing method CATAPULT has only 15%. The second plot shows the comparison results restricted to diseases with no known associations in the training data. The significance of using disease features in IMC is evident.


Comparison of state-of-the-art disease gene prioritization methods. IMC consistently and significantly outperforms competitive methods by a large margin. CATAPULT and Katz methods are competitive as well.


Evaluation restricted to diseases with no known associations in training data. The significance of using disease features is distinct — IMC performs much better (Legend in the figure above applies).


1. OMIM term-document counts. 2. MATLAB Code (contains features and OMIM diseases as MAT files).

Main Publications