^{1}

^{*}

^{2}

^{3}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: SM QM. Performed the experiments: SM. Analyzed the data: AG SM. Wrote the paper: SM AG QM.

The properties (or labels) of nodes in networks can often be predicted based on their proximity and their connections to other labeled nodes. So-called “label propagation algorithms” predict the labels of unlabeled nodes by propagating information about local label density iteratively through the network. These algorithms are fast, simple and scale to large networks but nonetheless regularly perform better than slower and much more complex algorithms on benchmark problems. We show here, however, that these algorithms have an intrinsic limitation that prevents them from adapting to some common patterns of network node labeling; we introduce a new algorithm, 3Prop, that retains all their advantages but is much more adaptive. As we show, 3Prop performs very well on node labeling problems ill-suited to label propagation, including predicting gene function in protein and genetic interaction networks and gender in friendship networks, and also performs slightly better on problems already well-suited to label propagation such as labeling blogs and patents based on their citation networks. 3Prop gains its adaptability by assigning separate weights to label information from different steps of the propagation. Surprisingly, we found that for many networks, the third iteration of label propagation receives a negative weight.

The code is available from the authors by request.

In protein interaction networks, proteins linked by short paths of interactions tend to have similar functions

These algorithms take as input a network that represents a set of objects as nodes whose pairwise relationships are encoded as the links in the network. Then, based on a query list of nodes with a particular property (or label) of interest,

Labeling problems like these have proved difficult when the network has a relatively large number of nodes compared to the number of positive examples, especially when the number of positive examples is small; to date the best performing algorithms for these problems, so-called label propagation algorithms

Here we introduce a unifying framework that generalizes a large class of algorithms for node label prediction. We will refer to this general framework as Generic Label Propagation (GLP). This framework allows us to highlight a limiting underlying assumption shared by all algorithms that fall under this class. Further, using this framework, we introduce a new algorithm called 3Prop that retains all of the advantages of label propagation but can adapt to diverse node labeling patterns. In particular, 3Prop gains this adaptivity by learning independent weights for the first three steps of label propagation, thereby overcoming an inherent limitation that we show restricts other label propagation algorithms. 3Prop can be applied to large networks, computes node scores quickly, and is easy to implement. Furthermore, as we will show, because the topological structure of many real world networks limits the amount of node label information available through label propagation, 3Prop will likely perform well on a wide range of network labeling problems. Specifically, 3Prop predicts node labels more accurately (and sometimes much more accurately) than label propagation on five separate social and biological network labeling problems where the networks range in size from 750 to 3 million nodes and the proportion of labeled nodes ranges from 0.0002% to 40%.

To motivate our 3Prop algorithm and to illustrate why label propagation fails for some networks, we introduce a general framework, which we will refer to as Generic Label Propagation (GLP), that encompasses common variations of label propagation algorithms. As we show, these algorithms fall into one of two classes, which we call

Below, we will first establish the GLP framework, and then use the random walk interpretation of the GLP scores to illustrate the intrinsic limitation of algorithms that fall into this framework.

Label propagation algorithms address the following problem: given an undirected, possibly weighted, network over

Label propagation assigns scores to nodes by an iterative process which propagates “evidence for positiveness” out from positive nodes through the links in the network to nearby nodes; this process is often compared to heat diffusion

In particular, in each iteration of GLP, the score of node

Two different normalizations of

In SLP, the matrix

In ALP,

The solutions of SLP and ALP are closely related–a slightly modified version of the former can be used to compute node scores for the latter (and vice versa). This similarity arises because

If we interpret

Because

Displayed networks are subnetworks of a protein interaction network (top row) and a genetic interaction network (bottom row). Both networks are derived from the BioGRID database, and the true positive examples are derived from Gene Ontology. (A): Large red nodes indicate proteins involved in meiotic cell cycle, (B): Large red nodes indicate proteins involved in transcription initiation. (C,D): Large red nodes indicate four randomly selected positives selected from (A) and (B) respectively, for training 3Prop. (E-J): Node size reflects relative magnitude of scaled random walk probabilities,

The 3Prop algorithm makes GLP more adaptive by assigning independent weights to each of the first three summands (corresponding to random walks of up to length three) in

3Prop uses linear discriminant analysis (LDA) (see,

The other panels in

experiment | 1^{st} |
2^{nd} |
3^{rd} |

0.022 | 0.68 | 0.29 | |

−0.11 | −0.22 | −0.66 | |

Caltech | 0.072 | 0.45 | −0.477 |

Princeton | 0.063 | 0.45 | −0.48 |

Georgetown | 0.054 | 0.46 | −0.48 |

Oklahoma | 0.0081 | 0.51 | −0.48 |

UNC | 0.022 | 0.49 | −0.48 |

In our experiments we use a diverse collection of networks including: two types of molecular networks, protein-protein interaction (PI) and genetic interaction (GI) (downloaded from BioGRID

Dataset | nodes | edges | average shortest distance | diameter | labels |

Protein Interaction | 5,405 | 414,242 | 2.5 | 7 | 47 protein functions |

Negative Genetic Interaction | 4,563 | 152,188 | 2.8 | 6 | 47 protein functions |

Facebook^{1} (Caltech) |
769 | 33,312 | 2.3 | 6 | gender |

Facebook^{2} (Georgetown) |
9,414 | 851,276 | 2.7 | 11 | gender |

Facebook^{3} (Princeton) |
6,596 | 586,640 | 2.7 | 9 | gender |

Facebook^{4} (Oklahoma) |
17,425 | 1,785,056 | 2.7 | 9 | gender |

Facebook^{5} (UNC) |
18,163 | 1,533,600 | 2.8 | 7 | gender |

Political Blogs | 1,224 | 33,433 | 2.7 | 8 | liberal or conservative |

Patent Citation | 3,774,768 | 33,037,894 | 8.5 | 23 | 381 patent categories |

We compare the performance of symmetric 3Prop with that of symmetric GLP (SLP) which has been shown to perform well in gene function prediction problems

We evaluate GLP and 3Prop using two standard measures: area under the ROC curve (AUROC) and average precision (AUP).The ROC curve is a graphical plot of recall (number of true positives divided by the total number of positives) as a function of false positive rate (number of false positives divided by the total number of negatives) for a binary classifier as we vary the discrimination threshold. The area under this curve (AUROC) can achieve a maximum value of 1 and a minimum of 0; a random classifier will result in AUROC of 0.5. AUROC can also be interpreted as the probability that a randomly chosen positive is assigned a discriminant score that is higher than a randomly chosen negative example. Precision at a given recall is defined as the fraction of predictions that are true positives and is given by TP/(TP+FP) where TP is the number of true positives and FP is the number of false positives at the given recall rate. A classifier that performs better in terms of AUROC is not guaranteed to perform better in terms of average precision, or vice versa. In general, average precision is a more suitable measure where there are many more non-positive compared to positive examples

As we have described, 3Prop only assigns non-zero weights to random walk probabilities that are shorter or equal to three. The choice of three is motivated by our observations about the performance of GLP with increasing random walk lengths, and average shortest node distances in several real-world networks (

Here, we learn the random walk weights using the LDA algorithm (as in 3Prop). These plots show that considering random walks of length longer than 3 is unnecessary for accurate prediction of protein function from PI and GI networks.

For all networks, except the patent network, weights assigned by 3Prop are similar based on the task or the network type (

network | 1^{st} |
2^{nd} |
3^{rd} |

Caltech | 0.66 | 0.67 | 0.63 |

Georgetown | 0.59 | 0.60 | 0.56 |

Oklahoma | 0.67 | 0.65 | 0.61 |

Princeton | 0.69 | 0.65 | 0.61 |

UNC | 0.6394 | 0.61 | 0.57 |

In contrast, the 3Prop weights for the different patent categories vary considerably (

The colors depict the age of the patent (when the patent was assigned).

Despite its limitations, label propagation has become the algorithm of choice for many node labeling problems. It is easy to implement, resists overfitting because it has only a single free parameter but it nonetheless performs as well as or better than much more complex algorithms on benchmark problems

3Prop retains all of the advantages of GLP but is faster and more accurate. If provided with

Surprisingly, we have found that for many networks, the third iteration of label propagation receives a negative weight. Assigning these negative weights gives 3Prop the flexibility to exploit “over-counting” of random walks in real-world networks, where short random walk probabilities could have a large contribution to longer random walk probabilities.

3Prop only considers random walks of length up to three; a natural question is, “Why three and not more (or fewer)? ”. For example, even in assortatively mixed networks, having many paths of length two between two nodes is evidence that they are members of the same network module or “community”

Total variation distance between random walks of increasing length as a function of walk length

(TIF)

Total variation distance between random walks of increasing length as a function of walk length

(TIF)

Supplementary methods.

(PDF)