^{1}

^{2}

^{1}

^{1}

^{3}

Lilly Endowment, Inc. is a commercial funder of this work through its support for the Indiana University Pervasive Technology Institute. This does not alter the authors’ adherence to PLOS ONE policies on sharing data and materials.

Conceived and designed the experiments: SE SK KB. Performed the experiments: SE MG. Analyzed the data: SE SK KB. Contributed reagents/materials/analysis tools: SE MG. Wrote the paper: SE SK KB.

Notions of community quality underlie the clustering of networks. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms—Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes.

We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on the information recovery metrics. Additionally, our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information.

Smart local moving is the overall best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it an absolutely superior algorithm. Interestingly, Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters.

Clustering is the task of assigning a set of objects to groups (also called classes or categories) so that the objects in the same cluster are more similar (according to a predefined property) to each other than to those in other clusters. This is a fundamental problem in many fields, including statistics, data analysis, bioinformatics, and image processing. Some of the classical clustering methods date back to the early 20th century and the cover a wide spectrum: connectivity clustering, centroid clustering, density clustering, etc. The result of clustering may be a hierarchy or partition with disjoint or overlapping clusters. Cluster attributes such as count (number of clusters), average size, minimum size, maximum size, etc., are often of interest.

To evaluate and compare network clustering algorithms, the literature has given much attention to algorithms’ performance on “benchmark graphs” [

However, judging clustering algorithms based solely by their performance on benchmark graph tests assumes that the embedded clustering truly is a “gold standard” that captures the entirety of an algorithm’s performance. It ignores other properties of clustering, such as modularity, conductance, and coverage, to which the literature has given much attention in order to decide the best clustering algorithm to use in practice for a particular application [

Furthermore, previous papers that have evaluated clustering algorithms on benchmark graphs have used a single metric, such as normalized mutual information, to measure the amount of “gold standard” information recovered by each algorithm [

In this paper, we experimentally evaluate the robustness of clustering algorithms by their performance on small (1,000 nodes, 12,400 undirected edges) to large-scale (1M nodes, 13.3M undirected edges) benchmark graphs. We cluster these graphs using a variety of clustering algorithms and simultaneously measure both the information recovery of each clustering and the quality of each clustering with various metrics. Then, we test the performance of the clustering algorithms on real-world network graph data (Flickr related images dataset and DBLP co-authorship network) and compare the results to those obtained for the benchmark graphs.

Specifically, we address the following questions:

How sensitive is a clustering algorithm’s performance on benchmark graphs to the choice of information recovery metric?

How does a clustering algorithm’s performance on the metric of information recovery in benchmark graphs compare to its performance on other metrics such as modularity, conductance, and coverage?

How does a clustering algorithm’s performance on benchmark graphs scale as the size of the graphs increases?

How does an algorithm’s performance on benchmark graphs compare to its performance on real-world graphs?

Implementations of all algorithms and all metrics together with links to the synthetic datasets used in this study can be found at

Work on benchmark graphs includes the Girvan-Newman (GN) benchmark [

Lancichinetti et al. [

The LFR benchmark has become a standard on which to test algorithms. Lancichinetti and Fortunato used it to compare the performance of twelve clustering algorithms [

In addition to LFR benchmark synthetic graphs, we also consider real-world graphs to help us gain some intuition about the performance of the clustering algorithms under consideration. Specifically, we use two datasets, one of Flickr related images comprised of 105,938 nodes and 2,316,948 undirected edges [

Clustering, is the task of assigning a set of objects to communities such that objects in the same community are more similar to each other than to those in other communities. In network clustering, the literature defines “similarity” based on topology. Clustering algorithms seek to capture the intuitive notion that nodes should be connected to many nodes in the same community (intra-cluster density) but connected to few nodes in other communities (inter-cluster sparsity). We compare four clustering algorithms in this study. Each scales to networks of greater than one million nodes.

The Louvain algorithm [

The smart local moving (SLM) algorithm [

The Infomap algorithm [

The label propagation algorithm [

A cluster in a network is intuitively defined as a set of densely connected nodes that is sparsely connected to other clusters in the graph. However, there exists no universal, precise mathematical definition of a cluster that is accepted in the literature [

The modularity of a graph compares the presence of each intra-cluster edge of the graph with the probability that that edge would exist in a random graph [_{kk}, the probability of intra-cluster edges in cluster _{k}, and _{k}, the probability of either an intra-cluster edge in cluster _{k} or of an inter-cluster edge incident on cluster _{k}, are
_{k} ⊆

The color of each node defines its cluster.

We define the conductance of a cluster by the number of inter-cluster edges for the cluster divided by either the number edges with an endpoint in the cluster or the number of edges that do not have an endpoint in the cluster, whichever is smaller. The conductance for a cluster is given by _{k} ⊂ _{k}) = ∑_{i∈Sk} ∑_{j∈V} _{ij} − ∑_{i∈Sk} ∑_{j∈Sk} _{ij}, the number of edges with an endpoint in _{k}.

We define the conductance of a graph

There are several possible ways to define the conductance of a graph that has already been clustered. In this paper we use inter-cluster conductance as opposed to intra-cluster conductance because the next metric (coverage) deals with intra-cluster density. Still, it is worth mentioning that this definition of conductance emphasizes the notion of inter-cluster sparsity but does not wholly capture intra-cluster density. For a more detailed discussion of measures of conductance, including intra-cluster conductance, see Almeida et al. [

Coverage [_{i} is the cluster to which node

While coverage captures the notion of intra-cluster density, optimizing too heavily for the measure leads to a trivial clustering in which all nodes are assigned to the same cluster. For example, the graph shown in

When working with an input graph with well-defined clusters, we would like to be able to compare how well a particular clustering algorithm finds the correct clusters. It is not trivial to quantify the agreement between the community assignments returned by a clustering algorithm with the “gold standard” community assignments embedded in the LFR benchmark graph. Two popular metrics to measure the similarity of clusters are the adjusted Rand score, which is based on counting, and normalized mutual information, which is based on the Shannon entropy of information theory [

The adjusted Rand index is based on counting. If

_{11}—

_{00}—

_{10}—

_{01}—

Intuitively, _{11} and _{00} indicate agreement between _{10} and _{01} indicate disagreement between

While the Rand index has a range of [0, 1], chance leads it generally to fall within the more restricted range of [0.5, 1]. To correct for chance, the adjusted Rand index, given by

The adjusted Rand index equals 0 when the agreement between clusterings equals that which is expected due to chance, and 1 when the agreement between clusterings is maximum. In our experiments we use the Scikit-learn implementation of adjusted Rand score [

Normalized mutual information is built on the Shannon entropy of information theory. Let partitions _{i}} and {_{i}} for each node _{x} _{y} _{x}∑_{y}

In order to normalize the value of mutual information in the range 0 to 1, we define the normalized mutual information [

Lancichinetti et al. [_{i}} and {_{i}} for each node _{i}} and {_{i}} are binary arrays whose lengths equal the number of different communities in _{i} is in the _{i} to a random variable _{k} of probability distribution _{k} is the number of nodes of community _{i} in the

Lancichinetti et al. [_{k} = 1, _{l} = 1), _{k} = 0, _{l} = 1), _{k} = 1, _{l} = 0), and _{k} = 0, _{l} = 0). The additional information of a given _{k} to a given _{l},
_{k} from _{k} from all choices of _{l} from _{k}) to normalize the expression and averaging the value of each assignment _{norm} is defined equivalently. Lancichinetti et al. [

Note that the normalized mutual information variant of

First, we generated a total of 930 undirected LFR benchmark graphs using the parameters outlined in

Param. | Description | Value | Notes |
---|---|---|---|

N | Number of nodes | [1 |
In power of ten increments |

k | Average node degree | 25 | Same constant for all sizes |

maxk | Maximum node degree | N/10 | To scale with size of graph |

Mixing parameter | 0.4, 0.5, 0.6 | To see impact on performance | |

_{1} |
Node degree distrib. exp. | -2 | The default value |

_{2} |
Community size distrib. exp. | -1 | The default value |

minc | Min community size | 50 | Same constant for all sizes |

maxc | Max community size | N/10 | To scale with size of graph |

We also used two large real-world graphs obtained from analyzing a dataset of related images on Flickr [

Second, we clustered each of the 932 graphs using undirected implementations of the Louvain, smart local moving, Infomap, and label propagation algorithms. We used undirected implementations of all algorithms for consistency with Lancichinetti et al.’s comparison of Louvain and Infomap [

We clustered the 930 benchmark graphs using a Dell C6145 cloud server with 4 central processing units, 64 cores, and 256 gigabytes of random-access memory. We clustered the 2 real-world data sets using Karst, a supercomputer of Indiana University. Karst’s compute nodes are IBM NeXtScale nx360 M4 servers. Each contains two Intel Xeon E5-2650 v2 8-core processors, 32 gigabytes of random-access memory, and 250 gigabytes of local disk space.

Finally, we computed the information recovery of the 930 produced clusterings of the benchmark graphs with the embedded gold standard clusterings using the metrics of adjusted Rand index, traditional normalized mutual information, and the variant of normalized mutual information used by Lancichinetti et al. in [

The code we developed to implement this study, including all scripts, statistics, and analyses, is available and documented at

We present our results using violin plots. A violin plot is an adaptation of the box plot that enables viewers to make better inferences about the data shown, by capturing sample density, in addition to summary statistics, such as the min, mean and max values [

We drew each “violin” using a Gaussian kernel density estimation. Red lines indicate the minimum, maximum, and mean of the data.

We drew each “violin” using a Gaussian kernel density estimation. Red lines indicate the minimum, maximum, and mean of the data.

We drew each “violin” using a Gaussian kernel density estimation. Red lines indicate the minimum, maximum, and mean of the data.

For a given matrix, each of the clustering algorithms in our study defines a row, and each of the cluster quality metrics in our study defines a column. In this way, each cell in these matrices is a violin plot of the performance of one clustering algorithm by one cluster quality metric. The structure of these matrices allows one to compare the performance of different algorithms by scanning the columns, and to compare performance of different metrics by scanning the rows.

A fairly clear overall trend is that performance decreases as

Our results show that the choice of information retrieval metric has a significant impact on the performance of algorithms. For example, at

Lancichinetti’s variant of normalized mutual information does not match traditional normalized mutual information when there is no overlap between clusters, which we expected. Unexpectedly, our results show that the variant can differ from the traditional formulation by as much as 0.4; see Louvain’s performance at N = 1,000,000 in

Our results suggest that coverage is a poor cluster quality metric. Although we would expect metrics of cluster quality to decrease as

Our results show that modularity is also an unreliable metric to indicate benchmark graph performance. A clustering algorithm’s performance can deteriorate on information recovery metrics without dropping in modularity. Louvain’s performance at

These results question the validity of using metrics such as coverage and modularity to evaluate an algorithm’s clustering performance when a gold standard is not known. Because coverage and modularity do not reflect performance on benchmark graph tests, these two measures capture fundamentally different properties of clustering than does benchmark graph testing.

Conductance is the metric that best indicates benchmark graph performance in our experiments. The performance of Louvain and Infomap in

A surprising result of our work is Louvain’s performance, which surpasses Infomap’s in nearly all of our experiments. This contradicts the previous work of Lancichinetti et al. [

The “resolution limit” of modularity and the “field-of-view limit” of both Louvain and Infomap explain how our choice of a relatively large maximum community size leads to this contradictory result. The resolution limit of modularity is the well-known limitation that modularity has in detecting small communities [

Note that while our experiments use the bottom hierarchical level of Infomap, which suffers from the field-of-view limit, Schaub et al. have shown how to overcome the field-of-view limit [

Lancichinetti et al. [

Label propagation shows the widest variability in performance of the four clustering algorithms, which is illustrated by the length of its distribution curve in

Label propagation’s relative sensitivity to

Smart local moving performs best of the algorithms in our study by far. It has an equal to or higher value than the other algorithms on traditional normalized mutual information and adjusted rand score on virtually all benchmark graph sizes at all values of

In order to inform the choice of which clustering algorithm to use in practice, we would like to be able to rank the performance of clustering algorithms on real-world data sets that do not have a “gold standard” clustering using stand-alone quality metrics. However, our earlier results from the synthetic graph analysis reveal that such an absolute ranking of clustering algorithms based on stand-alone quality metrics does not exist. There is disagreement on the performance of clustering algorithms both amongst the different stand-alone quality metrics and between the information recovery metrics and the stand-alone quality metrics.

We are not able to make definitive statements about the superiority of clustering algorithms, but it is possible to compute the stand-alone quality metrics, such as those shown in

(B) A comparison of clustering algorithm performance by conductance on the real-world graphs. (C) A comparison of clustering algorithm performance by coverage on the real-world graphs.

We evaluate clustering algorithms and cluster quality metrics on graphs ranging from 1,000 to 1M nodes. Our results show overall disagreement between stand-alone quality metrics and information recovery metrics, with conductance as the best of the stand-alone quality metrics. Our results show that the variant of normalized mutual information employed by Lancichinetti et al. [

Overall, smart local moving is the best performing algorithm in our study. Note that disagreement between stand-alone quality metrics and information recovery metrics prevents us from claiming that smart local moving is absolutely superior to the other clustering algorithms. Additionally, the high performance of smart local moving on our LFR benchmark graph tests must be taken with a caveat. The LFR benchmark graphs rely on a synthetic model for their construction with assumptions such as a power law distribution of node degrees. There is inherent circularity in judging a clustering algorithm by its performance on benchmark graphs, and smart local moving’s high performance on the LFR benchmark graphs shows that it is based on a model similar to that of the LFR model. However, one may still challenge the LFR model, and potential future work includes analyzing models such as “CHIMERA” that enable more precise control of network structure than the LFR benchmark [

Practitioners seeking to use the best clustering algorithm for a particular application must rely on testing of effectiveness in their respective domain. Lack of a rigorously defined notion of “community”, which is intuitively appealing but remains in general to be mathematically defined, is the root of discrepancies amongst stand-alone quality metrics and information recovery metrics. Without a rigorous notion of a community, which may vary depending on the domain, absolute statements about the superiority of clustering algorithms cannot be made.

Our results suggest future work in unifying various notions of community, as well as precisely quantifying how current notions of community differ. Additionally, better understanding of the significance of cluster quality metric values (e.g., what does it mean when one clustering algorithm scores 0.1 higher than another in modularity?), will enable more meaningful claims based on these metrics.

We would like to thank Santo Fortunato for suggestions regarding experimental design, Yong-Yeol Ahn for feedback on the choice of clustering algorithms, Ludo Waltman for editing a completed draft of this work and for helping run the smart local moving code, Martin Rosvall for explaining the relative performance of Louvain and Infomap, and Bahador Saket for discussing early drafts of this work.

Our code draws on Lancichinetti’s implementatin of the LFR benchmark (

This research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute, and in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative at IU is also supported in part by Lilly Endowment, Inc.