^{*}

Conceived and designed the experiments: RA IM. Performed the experiments: RA. Analyzed the data: RA. Contributed reagents/materials/analysis tools: RA. Wrote the paper: IM.

The authors have declared that no competing interests exist.

The analysis of complex networks permeates all sciences, from biology to sociology. A fundamental, unsolved problem is how to characterize the community structure of a network. Here, using both standard and novel benchmarks, we show that maximization of a simple global parameter, which we call Surprise (

A network of interacting units is often the best abstract representation of real-life situations or experimental data. This has led to a growing interest in developing methods for network analysis in scientific fields as diverse as mathematics, physics, sociology and, most especially, biology, both to study organismic (e. g. populational, ecological) and cellular (metabolic, genomic) networks

Some years ago, we suggested determining the community structure of a network by evaluating the distributions of intra- and inter-community links with a cumulative hypergeometric distribution

Where ^{2}−k

In this work, we show that

Testing the performance of a global parameter to determine community structure requires both a set of efficient algorithms for community detection and a set of standard benchmarks, consisting in synthetic networks of known structure. In this study, six selected algorithms (see

For each benchmark, we estimated _{max} and Q_{max}) provided the partitions used to compare with the known community structures. As in previous works

_{S} = 1) even when that structure is blurred by a large number of inter-community links, generated by increasing the mixing parameter μ up to 0.5–0.7 (see _{S}<1). This suggests that the original community structure is not present anymore, which is in good agreement with the fact that _{max}≫_{orig}, where _{orig} is the _{S}>NMI_{Q} in 2827/3600 = 78.5% of the cases, NMI_{Q}>NMI_{S} in just 4.1% of them and the rest are ties. Interestingly, NMI_{Q}≪NMI_{S} in quasi-random and random networks (

a) Results for the four standard LFR networks. B and S indicate big and small communities respectively and 1000 or 5000 the number of nodes. μ: mixing parameter. NMI measures the congruence between the known and the deduced community structures. Each point is based on 100 different networks; standard errors of the mean are too small to be visualized. Values for 100 random (R) networks with the same number of units and degree distributions are also shown. b) Comparison of _{Q}/NMI_{S} ratios, which are almost always below 1, are shown. c) Results for the RC benchmark. The parameter Degradation (D) indicates the percentage of both deleted and shuffled links. Each black dot is based on 100 networks, again standard errors are so small that cannot be visualized at this scale. For each value of D, results for 100 random networks with the same number of links are also shown (open circles). d) Relative quality of the partitions generated by maximizing _{Q}/NMI_{S} ratios are shown. White dots: results for random networks with different D values.

The algorithms used were described by Arnau

The discovery of the resolution limit of Q showed that heterogeneous community sizes may greatly affect the ability of global parameters to detect structure _{S}>NMI_{Q} in 848/900 = 94.2% of the cases, while NMI_{Q}>NMI_{S} in just 3.3% of the cases). As occurred for the LFR benchmarks, none of the algorithms obtained the best results in all networks (

The results just shown indicate that using _{max} to detect community structure has obvious advantages over maximizing Q. However, they do not allow to evaluate how optimal is that criterion, given that the potential maximum NMIs are unknown. To solve this limitation, we generated closed LFR and RC benchmarks, in which we had an a priori expectation of the maximum NMI values. Results are shown in _{max} was used, an almost perfectly symmetrical dynamics was observed. In the process of converting the original structure into the final one (by increasing the Conversion parameter; see _{IF})/2, where NMI_{IF} is obtained comparing the initial and final structures (_{max} was always identical or higher than _{orig} (

a) LFR benchmark with 1000 units and big communities. For each Conversion (C) value, NMIs comparing the _{max} partition with the initial (black dots) or final (red squares) community structures were obtained. The symmetrical results led to NMI averages (blue diamonds) that, with great precision, fell in a straight line of value (1+NMI_{IF})/2. Dots are based on 100 independent analyses. b–d) LFR benchmarks with, respectively, 1000 units, small communities (b), 5000 units, big communities (c) and 5000 units, small communities (d). Results are very similar to those in panel a). e) Average NMI values for partitions obtained maximizing Q are worse than those obtained maximizing _{max}/_{orig} ratio ≥1, i. e. either the original structure or a different one with higher

Three networks with different heterogeneity in community sizes (Pielou's indexes equal to 0.70, 0.85 and 1.00 respectively) were used as examples. a) PI = 1; b) PI = 0.85; c) PI = 0.70. Results similar to those in _{max}/_{orig}<1 with heterogeneous community sizes. The algorithms used did not detect in those cases the maximum possible _{max}/_{orig}≫1 with C<0.50 and PI = 0.70 (blue diamonds) implies that the algorithms are detecting structures different from the initial one.

_{max} results for three real networks. The first example is based on the CYC2008 database, which compiles 1604 proteins that belong to 324 protein complexes _{max} and a priori defined protein complexes is almost perfect, NMI_{S} = 0.91. On _{Q} = 0.57. The largest five communities alone almost cover the whole network (_{S} = 0.93). Finally,

Community structure of the CYC2008 network (a, b), College football network (c) and Zachary's karate club network (d), according to _{max} analyses. While _{(2 communities)} = 13.61, the optimal division found has _{(19 communities)} = 25.69. Twelve of these optimal communities are singletons (white dots).

In this study, we have shown the potential of maximizing the global parameter Surprise (_{Q}>NMI_{S} (3–4% of all the cases examined in the open benchmarks) could be also explained by an incomplete success in determining _{max} with these algorithms.

The commonly used open benchmarks are useful for general evaluations of the performance of different algorithms, but they do not allow to establish how optimal are the results obtained. For that, we have devised novel closed benchmarks in which an initial known community structure is progressively transformed into a second, also known, community structure. Provided that both community structures are identical, it can be demonstrated that, at any point of the transformation from one to the other, the average of the NMIs of the solution found respect to the initial and final structures should approximate a constant value ([1+NMI_{IF}]/2), if that solution is optimal (see

When _{max}. On the other hand, the structure of the Zachary's karate network is far from obvious (_{max}, the network contains some small groups plus many singletons is, at least a posteriori, not so unexpected. A natural question is then why the scientific community has been so keen of exploring this particular network, often to establish whether an algorithm was able or not to detect the putative two communities

Six of the best available algorithms, selected either by their exceptional performance in artificial benchmarks or their success in previous analyses of real and simulated networks

First, the recently developed LFR benchmarks, specifically devised for testing alternative community detection strategies

Once found that these LFR benchmarks generated networks with communities of very similar sizes, we decided to implement RC benchmarks in which these sizes were more variable. All networks in these benchmarks had 512 units divided into 16 communities. One hundred networks with random community sizes, determined using a broken-stick model

In the LFR and RC benchmarks just described it was possible to compare networks having obvious community structures (generated with low μ or D parameters) with others that were increasingly random. This type of benchmarks, we have called open. We also generated closed LFR and RC benchmarks. In them, links were shifted in a directed way, in order to convert the original community structure of a network into a second, also predefined, structure. In this way, it is possible to monitor when the original structure is substituted by the final one according to the solutions provided by _{max} or Q_{max}. In the LFR and RC closed benchmarks, the starting networks were the same described in the previous paragraphs, with μ = 0.1 (LFR) or D = 0 (RC) respectively, and the final networks were obtained by randomly relabeling the nodes. Therefore, the initial and final networks had identical community structures but the nodes within each community were different. Conversion (C) is defined as the percentage of links exclusively present in the initial network that are substituted by links only present in the final one (i. e. C = 0: initial structure present; C = 100: final structure present).

In our closed benchmarks, a peculiar symmetrical behavior of NMI values respect to the initial and final partitions is expected. Imagine that a putative optimal partition is estimated according to a given criterion. Let us now consider the following triangle inequality:_{IE} is the normalized mutual information calculated for the initial structure (I) and the estimated partition (E), NMI_{EF} is the normalized mutual information for the final structure (F) versus the estimated partition and NMI_{IF} is the normalized mutual information for the comparison between the initial and final structures. Inequality (2) holds true if the structures of I, F and E are identical (i. e. both the number and sizes of the communities are the same, but not necessarily are the same the nodes within each community). This follows from the fact that

Where VI_{XY} is the Variation of Information for both partitions

If, as indicated, the structures of all partitions are identical, then all their entropies are also identical. In that case, the following inequality can be deduced from formulae (3) and (4):

From this inequality, and substituting A, B and C with I, E and F, respectively, formula (2) can be deduced. Formula (2) therefore means that, provided that I, E and F have the same structure, the average of NMI_{IE} and NMI_{EF} may acquire a maximum value [(1+NMI_{IF})/2]. Inequality (2) will also hold approximately true if the entropies of I, E and F are very similar (i. e. many identical communities). In our closed benchmarks the I and F structures are identical, and we progressively convert one into the other. It is thus expected that the optimal partition along this conversion is similar in structure to both I and F. Hence, deviations from the expected average value (1+NMI_{IF})/2 are a cause of concern, as they probably mean that the optimal partition has not been found. On the other hand, finding values equal to (1+NMI_{IF})/2 is a strong indication that the optimal partition has indeed been found.

It is worth noting that, although NMI has been commonly used in this field

Two of the three networks explored, known as

(TIF)

(TIF)

_{IF})/2 when Q is maximized.

(TIF)

_{max} is again qualitatively better than the one of Q_{max}, except when all communities are identical.

(TIF)

_{IF}/2 (blue line). Results are almost identical to those shown in _{max} behavior is clearly better than Q_{max} behavior.

(TIF)

_{max} is again qualitatively better than the one of Q_{max}, confirming the results shown in

(TIF)

_{max} and _{orig} (i. e. the _{max}>S_{orig}, meaning that the original structure is not the one present anymore. In those cases, NMIs are expected to rapidly decrease, as indeed is observed.

(DOC)

(DOC)