^{1}

^{2}

^{1}

^{1}

^{1}

^{*}

Conceived and designed the experiments: IAK PC. Performed the experiments: IAK RP MSS. Analyzed the data: IAK RP MSS PC. Wrote the paper: IAK RP MSS PC.

The authors have declared that no competing interests exist.

Network communities help the functional organization and evolution of complex networks. However, the development of a method, which is both fast and accurate, provides modular overlaps and partitions of a heterogeneous network, has proven to be rather difficult.

Here we introduce the novel concept of ModuLand, an integrative method family determining overlapping network modules as hills of an influence function-based, centrality-type community landscape, and including several widely used modularization methods as special cases. As various adaptations of the method family, we developed several algorithms, which provide an efficient analysis of weighted and directed networks, and (1) determine pervasively overlapping modules with high resolution; (2) uncover a detailed hierarchical network structure allowing an efficient, zoom-in analysis of large networks; (3) allow the determination of key network nodes and (4) help to predict network dynamics.

The concept opens a wide range of possibilities to develop new approaches and applications including network routing, classification, comparison and prediction.

In real networks, module or community structure plays a central role in the understanding of topology and dynamics. Numerous module determination methods are based on the intuitive picture identifying the network communities as dense groups of the network, whose nodes have a much stronger influence on each other than on the rest of the network. The development of a method, which translates this intuitive definition of modules into a practically applicable, fast, accurate and widely usable algorithm turned out to be a very challenging problem. So far a wide variety of great ideas and powerful approaches based on very different physical or algorithmic grounds were applied in order to solve this problem. At the moment there is no ‘best method’ available to find network modules, and even the widely used algorithms may suffer from serious problems (see Figure S1.1, and Tables S1.1 and S1.2 in the

In 2002 Girvan and Newman published a seminal paper

In this paper we introduce an integrative network module determination method family, called ModuLand (see

Keeping in mind the emerging needs for an integrative approach for the determination of network modules, we have developed the ModuLand method family (

Determination of influence functions: If a node lies in a module, than its influence on the links of the given module is typically larger than on more distant links of the network. As a first step, we determine the influence function, _{i}

Construction of a community landscape: The influence functions of different nodes in the same module are generally different. Nevertheless, the module is the set of nodes, which mutually have a large influence on each other. In order to take this mutuality into account, we summarize the influence function values of Step 1 over each link of the network:

Determination of hills of the community landscape: Modules are determined as the ‘mountains and hills’ of the community landscape of Step 2. We present two different approaches:

Modules are the connected components above a chosen centrality-threshold.

Modules are determined by local maxima of the community landscape and their surrounding region (Figures S1.4 and S1.5 in the

Determination of a hierarchy of higher level networks: We note that a higher level network of the modules of Step 3 can also be constructed, where each former module is a node of this higher level network. If the higher level networks are re-assessed with the ModuLand method again and again, a set of hierarchical layers of modules can be defined until the giant component of the whole network coalesces to a single node (Figure S1.6 in the

For this illustrative example we used the network science co-authorship network

In the followings we will describe the four major steps of the ModuLand method in detail.

In principle, the determination of the influence functions (or indirect impact of a node or link) requires a network-dependent perturbation-flow simulation on the network (as an example, see our PerturLand algorithm in Section IV.2. in the

NodeLand algorithm: starting from a given node _{ij}_{s}

LinkLand algorithm: the LinkLand algorithm, used in our module determinations of the main text below, differs from the NodeLand algorithm in two points.

In the LinkLand algorithm the influence functions are assigned to starting links instead of starting nodes, thus initially

In contrast to the NodeLand algorithm, while calculating the influence function the weight of the starting link (

On undirected networks we prefer to use the LinkLand algorithm, which is found to provide an acceptable compromise between precision and speed. Identification of the influence function of a node or link in the case of NodeLand and LinkLand algorithms is structurally similar to a breadth-first search, therefore the worst-case runtime complexity of the two algorithms for all nodes or links is

As an example for the results obtained by the LinkLand algorithm,

In order to find the regions with nodes mutually having a relatively large influence on each other, we calculate the sum of the individual influence functions on a given network link resulting in the

Here we present two main approaches of hill-determination suitable for the determination of modules.

Centrality threshold based hills: as a natural choice, hills may be identified as the connected components of the community landscape above a given threshold. This approach results in distinct network modules without overlaps, like in case of the widely used Girvan and Newman method

Local maxima based hills: in this method we start the identification of the modules by finding the module centers, which are identical with the hill-tops or local maxima of the community landscape, defined as follows:

Undirected networks: A hill-top of the community landscape contains all connected links having the same, locally maximal centrality value, while having all of their neighboring links with lower centrality values.

Directed networks: The definition of hill-tops is more complicated in directed networks, but we also show it here for clarity. Let the outbound links of a link

By this definition the number of local maxima automatically yields the number of modules, and at the same time all small and large modules are identified simultaneously. This is in strong contrast to the previously described threshold-based approach, which often needs special criteria to determine the threshold value.

At this stage only the central links or plateaus of the modules have been identified. In the next step, the modules will be extended towards lower regions of the community landscape. We have developed several methods for this extension process detailed in Section V.2. in the

ProportionalHill method for the determination of network modules: here we present the algorithm of the ProportionalHill method for undirected networks, while the analogous directed version can be found in Section V.2.b. in the

If we need smaller or larger overlaps between the modules, than those obtained with the ProportionalHill method, we may use the GradientHill or TotalHill methods, respectively, as described in Section V.2. in the

Optionally, a higher level hierarchical representation of the network can also be created, where the nodes of the higher level correspond to the modules of the original network, and the links of the higher level correspond to the overlaps between the respective modules (

In the description of the calculation of the higher hierarchical level let us consider here the undirected case only (the directed case is described in Section VII. in the _{ij}(n)_{ij}(n)_{i}(n)_{j}(n)

The steps leading to a higher level hierarchical representation can be applied repetitively until the giant component of the whole original network is represented by a single node allowing a fast, zoom-in type analysis of large networks (Section VII. in the

A simple case illustrating this scenario can be seen on Figure S1.6 in the

The ModuLand method family, even with its simplest NodeLand influence function calculation method correctly identified the observed split of the gold-standard Zachary karate club network

Application of the LinkLand influence function calculation method to the University of South Florida word association network

The application of the ModuLand method on the benchmark graphs of Lancichinetti et al.

To obtain a more detailed picture we directly compared the method-pair of NodeLand, or LinkLand influence function calculation algorithm and the ProportionalHill hillfinder method with the InfoMap method

Panel A: Comparison of the identified modules with the modules of the benchmark graph of Lancichinetti et al. _{max} = 50, the average degree was K = 15 and the network fuzziness μ of the x-axis of Panel A) was ranging from 0.1 to 0.85, where μ>0.5 means that the modules are no longer defined in the strong sense. Higher normalized mutual information (shown on the y-axis) represents a better recovery of the original modules. The panel shows the averaged results of 50 representations. Panel B: comparison of module assignment of the cAMP-dependent protein kinase family in the yeast protein-protein interaction network. The panel shows the modular assignment of the 3 catalytic and the regulatory subunit of the yeast cAMP-dependent protein kinase together with that of their first neighbors in the high fidelity protein-protein interaction network of Ekman et al.

Benchmark graphs have been criticized recently due to their limited capacity to reflect the complexity of real-world networks

In conclusion, both i.) the comparison of ModuLand-derived modules with those obtained by other methods and ii.) the experimental data of the literature showed that the pervasive overlaps of the ModuLand method give an adequate representation of the functional multiplicity of protein-protein interaction network nodes. It is important to note that, in contrast to the other methods tested, the ModuLand method gives this rich background of functional information at the single node level as opposed to the subnetwork level of other methods. Moreover, the Moduland-based, different modular assignment strengths of related nodes (such as those of the 3 cAMP-kinase catalytic subunits; Table S1.4 in the

Extending the analysis of the gold-standard Zachary karate club network, we examined the much larger University of South Florida word association network having 10,617 nodes and 63,788 links

Modules of the University of South Florida word association network

The modular hierarchy of the high school friendship Community-44 of the Add-Health dataset

We have determined the modular structure of Community-44 of the Add Health survey

To test whether the ModuLand method family can identify key network nodes, we calculated the change of network integrity

The figure shows the decreasing integrity of the USA Western Power Grid network

Discrimination of date- and party-hubs of protein interaction networks, i.e. proteins sequentially or simultaneously interacting with a large number of neighbors, is a rather difficult task

Overlapping modules of the yeast protein-protein interaction network of Ekman et al.

After the examples showing the utility of the ModuLand method family to determine overlapping modules of a variety of model and real world networks in this section we will summarize the characteristics of the ModuLand method family. In principle both the calculation of the influence functions and the determination of the community landscape hills are demanding problems, requiring specific solutions depending on the precise nature of the analyzed network. However, by constructing the community landscape, the small details of the influence functions get averaged out, therefore in practical cases fast and approximate solutions of the mentioned problems become possible and sufficient. This is the reason why rather simple influence function calculation methods (like the NodeLand algorithm) perform well on various kinds of real-world networks. On the other hand, the module membership value of any given node is obtained as the sum of the module membership value of the links of the given node, thus the small details of the hill determination step get also averaged out. The summation of the link module membership values provides an overlapping modularization of the nodes even in the absence of an overlapping modularization of the links themselves. (A similar situation is described in ref.

Several widely used efficient network modularization methods

As an important example for the first case, Bagrow and Bollt

As examples for the second case, namely, for the direct identification of the community landscape in previous methods, we briefly summarize the previously described network landscapes. Previous network landscape construction methods used clustering coefficients

New modularization methods can easily be generated by taking an existing ModuLand modularization protocol, and changing any of its influence function calculation, community landscape generation, or hill determination methods. Additionally, former methods yielding non-overlapping modules (which can be interpreted as the application of the threshold-based hill determination method) can be upgraded to overlapping modularization methods using the local maxima-based module determination approach of the ModuLand method family (for details see Section IV.4. in the

Enriching the binary, yes/no module membership assignment of many previous methods, the ModuLand method family gives a continuous scale for the association of each link and node to all modules (Figure S1.7 in the

In the ModuLand approach we divided the very challenging problem of module determination into two likewise hard subproblems: the influence function determination (1); and the determination of hills of the resulted community landscape (2). Although in most cases a relatively fast, approximate treatment of these subproblems provides sufficiently fine modularizations in the end, in the following section we give a brief guide to select the optimal algorithms for these subproblems.

As we mentioned earlier, the determination of the influence functions requires a network-dependent perturbation-flow simulation on the network. However, we saw, that the details of the influence functions usually average out during the community landscape construction, which justifies the use of less specific, faster approximations. We prefer to use the LinkLand algorithm on undirected networks, which is found to provide an acceptable compromise between precision and speed. However, on directed networks we suggest to use the PerturLand algorithm (see Section IV.2. in the

For the hill determination on the community landscape we presented two main approaches (for other possibilities see Section V. in the

We note that although the local maxima-based approaches we described in this paper (including the ProportionalHill method suggested above) outperform the traditional threshold-based approach in terms of overcoming the giant-component problem and producing continuously overlapping modules, nevertheless they also have their own drawbacks. When applying the local maxima-based approach on a ‘noisy’ community landscape, each local maximum will result a new (and possibly highly overlapping) module. Therefore we routinely applied a simple, yet effective post-processing step for merging the groups of extremely overlapping modules (having a correlation higher than 90%) (see Section VI. in the

To summarize, the hill-finding approach, which is the second phase of the ModuLand methods, gives an additional layer of flexibility, where the relatively inaccurate results of simpler hill definitions, and the large computational costs of more accurate optimization processes can be tailored to the network and to the experimenter's needs and possibilities.

The ModuLand method family we introduced in this paper and in part in an earlier patent application

The extensive and rich overlaps, network hierarchy, as well as the novel centrality and bridgeness measures uncovered by the ModuLand method can be used for the identification of long-range, stabilizing weak links, for the determination of the recently described creative, trend-setting nodes governing network development and evolution

The giant component of the undirected, un-weighted network science co-authorship network contained 379 nodes and 914 links

The weighted and undirected social network of a karate club has been reported by W. Zachary

The giant component of Appendix A of the University of South Florida word association network (

The giant component of the high school friendship Community-44 of the Add-Health database (

The un-weighted and undirected network of the USA Western Power Grid

The giant component of the un-weighted and undirected yeast protein-protein interaction network

In this Electronic Supplementary Material S1 we give a detailed description of the ModuLand network module determination method family. This integrative method is based on the construction of community landscapes from influence functions. In Section IV. we describe three versions of the influence function calculation algorithms, the NodeLand, LinkLand and PerturLand algorithms in detail. As the next step. the combination of influence functions to a community landscape is shown. We demonstrate the wide applicability of the ModuLand method to accommodate previous community detection methods in the examples of the BetweennessCentralityLand (BCLand) and CliqueLand community landscape determination methods resulting in distinct and overlapping network modules, respectively. In Section V. we show the local maxima-based identification of modules as hills of the community landscape. The module membership of network nodes and links is calculated using one of the developed module membership assignment methods, such as the GradientHill, ProportionalHill or TotalHill methods yielding modules of minimal, fair or detailed overlaps, respectively. In Sections VII. And VIII. we also show that the ModuLand method family enables a hierarchical analysis of network topology and the construction of a zoom-in network visualization method. Besides the detailed description of the ModuLand method the Electronic Supplementary Material S1 also contains 14 Supplementary Figures and their Supplementary Discussion, as well as a detailed summary of 18 module definitions, 129 different modularization methods, 13 module comparison methods as 5 Supplementary Tables and 396 references.

(4.88 MB PDF)

We thank Gábor Szuromi and Balázs Zalányi for help in the analysis of networks, members of the LINK-group (^{th} June 2005 giving us the starting encouragement to work on the ideas of this paper and for his continuous suggestions.