^{1}

^{2}

^{3}

^{1}

^{2}

The authors have declared that no competing interests exist.

Despite the fact that many important problems (including clustering) can be described using hypergraphs, theoretical foundations as well as practical algorithms using hypergraphs are not well developed yet. In this paper, we propose a hypergraph modularity function that generalizes its well established and widely used graph counterpart measure of how clustered a network is. In order to define it properly, we generalize the Chung-Lu model for graphs to hypergraphs. We then provide the theoretical foundations to search for an optimal solution with respect to our hypergraph modularity function. A simple heuristic algorithm is described and applied to a few illustrative examples. We show that using a strict version of our proposed modularity function often leads to a solution where a smaller number of hyperedges get cut as compared to optimizing modularity of 2-section graph of a hypergraph.

An important property of complex networks is their community structure, that is, the organization of vertices in clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters [

Yet another example could be financial markets where we have several groups of financial instruments that might be correlated with each other in several different groups. Such groups can be represented as hyperedges and hence detection of communities in such hypergraph could lead to better understanding of dependencies between financial instruments.

Hypergraphs can also be used to model transportation systems. For example in [

Being able to identify communities in a network could help us to exploit this network more effectively. For example, clusters in citation graphs may help to find similar scientific papers, discovering users with similar interests is important for targeted advertisement, clustering can also be used for network compression and visualization.

The key ingredient for many clustering algorithms is

Myriad of problems can be described in hypergraph terms, however, despite being formally defined in the 1960s (and various realizations studied long before that) hypergraph theory is patchy and often not sufficiently general. The result is a lack of machinery for investigating hypergraphs, leading researchers and practitioners to create the 2-section graph of a hypergraph of interest [

The paper is organized as follows. In Section 2, we review the Chung-Lu model for graphs and its link to the modularity function. We then propose a generalization of the Chung-Lu model for hypergraphs, as well as a hypergraph modularity function. In Section 3, we provide the framework to develop algorithms using our hypergraph modularity function. We propose an hypergraph partitioning algorithm and a few illustrative examples in Section 4. This is a new measure we are proposing, and there is plenty of future work to do, which we summarize in Section 5. Additionally, we made the source codes available on-line [

In this section, we recall the definition of modularity function for graphs, and we propose its generalization for hypergraphs. Throughout the paper we will use

Let _{1}, …, _{n}} are the vertices, the edges _{G}(_{G}(_{v∈A} _{G}(_{G}(_{v∈V} _{G}(

We define _{i}, _{j}}, _{i}, _{j} ∈ _{i}, _{j}) is greater than one and so it should really be regarded as the expected number of edges between _{i}, _{j}) between each pair of vertices ^{2} ≤ 2|_{i}, _{j}) ≤ 1 for all pairs.)

This model is a function of the degree sequence of

The definition of modularity for graphs was first introduced by Newman and Girvan in [

For a graph _{1}, …, _{k}} of _{G}(_{i}) = |{{_{j}, _{k}} ∈ _{j}, _{k} ∈ _{i}}| is the number of edges in the subgraph of _{i}. The modularity measures the deviation of the number of edges of _{i} is
_{G}(_{G}(_{1}}, …, {_{n}}}, then

The maximum _{G}(_{A} _{G}(_{G}(

Consider a hypergraph _{1}, …, _{n}}, where hyperedges _{e}(_{v} _{e}(_{H}(_{e∈E} _{e}(_{i} to denote _{e}(_{i}).

A hypergraph is said to be _{d}, where _{d} = (_{d}), _{d} ⊆ _{H}(_{v∈A} _{H}(

Similarly to what we did for graphs, we define a random model on hypergraphs, _{H}(_{d} denotes the edges of

Let _{d} be the family of multisets of size _{d}| > 0, the probability of generating the edge _{d} is given by:
_{i} = _{e}(_{i}).) Let _{i} is selected with probability _{d}| times and use the expected number of times edge _{d}|. Finally, as with the graph Chung-Lu model, if

In order to compute the expected _{i} ∈ _{i}, we get:
_{d≥2} _{d}|.

We will use the generalization of the Chung-Lu model to hypergraphs as a null model allowing us to define hypergraph modularity.

Consider a hypergraph _{1}, …, _{k}}, a partition of

all vertices of an edge have to belong to one of the parts (clusters) to contribute; this is a

the

at least 2 vertices of an edge belong to the same part; this is implicitly used when we replace a hypergraph with its 2-section graph representation.

We see that the choice of hypergraph modularity function is not unique; in fact, it depends on how strongly we believe that a hyperedge is an indicator that vertices belonging to it fall into one community. More importantly, one needs to decide how often vertices in one community “blend” together with vertices from other community; that is, how hermetic the community is. In particular, option (c) is the

In this case, the definition of edge contribution for _{i} ⊆ _{d}(_{d} be the family of multisets of size _{H}(

As with graphs, one can easily generalize the modularity function to allow for weighted hyperedges. As we already mentioned, we focused on the strict definition of modularity but it is straightforward to adjust the degree tax to many natural definitions of edge contribution. In particular, for the majority definition (see option (b) at the beginning of this section), one can simply replace ^{d} in (_{d}, we get the following degree-independent modularity function:
_{d} for each _{d}| > 0.

In this section, we show that the solution that maximizes (^{|E|} avoiding the search of the full set

Let _{p} is an equivalence relation (based on equality), we can define the quotient set ^{|E|}; however, it is typically much smaller than this trivial upper bound.

Now, let us define the _{1}, _{2} ∈ [_{1} and _{2} is also in [

_{1},…, _{k}}

The set of canonical representatives, the image of

The next Lemma shows how the degree tax behaves on partition refinement.

_{1},…, _{k}}. Since _{i}, there exists _{i}, a subset of parts of _{i} _{i}. Hence, for each _{i} and for each _{i}| = 1 for all

The next result, the main result of this section, shows that one can restrict the search space to canonical representatives from

_{H}(⋅),

_{1},…, _{k}} maximizes the strict modularity function _{H}(⋅). We will show that there exists _{H}(_{H}(_{i} for some _{H}(_{H}(

In this section, we first illustrate the correlation between our hypergraph modularity and the Hcut measure, which counts the number of edges touching more than one part (cluster). We then propose an algorithm for hypergraph partitioning based on our hypergraph modularity function, which we apply to a real dataset. However, let us stress again that the aim of this paper is to introduce a generalization of the modularity to hypergraphs, not to introduce new algorithms to actually find good partitions. We currently work on designing and testing algorithms for the hypergraph counterpart and, after that, we plan to do extensive experiments. The results will be included in the forthcoming paper.

We generate hyperedges following the process described as ^{2}, 30 points per line, perturbed with Gaussian noise ^{2}) where

We build hyperedges of size 3 (3-edges) by sampling sets of 3 points {_{d} = 0.02. This amounts to selecting 3-edges consisting of sets of 3 well aligned points. We do the same with sets of 4 points to generate the 4-edges.

The hyperedges can either consist of points all coming from the same line (which we call “signal”) or not (which we call “noise”). We sample hyperedges so that the expected proportion of signal vs. noise is 2:1, and we consider 3 different regimes for the mix of edge sizes: (i) 75% 3-edges, (ii) 75% 4-edges or (iii) balanced between 3 and 4-edges. For the 3 regimes, we generate 100 hypergraphs and for each hypergraph, we apply the fast Louvain clustering algorithm (see [

In the left plot of ^{2} value of 0.0008 (for majority of 4-edges the slope is 0.0768 and ^{2} 0.0734, while for majority of 3-edges the slope is 0.0061 and ^{2} 0.0004).

Comparing graph and hypergraph modularity.

In the right plot of ^{2} value of 0.9693 (for majority of 4-edges the slope is -0.7079 and ^{2} 0.9696, while for majority of 3-edges the slope is -0.7079 and ^{2} 0.9696). This is an illustration of the fact that when we measure our proposed hypergraph modularity for different partitions, we are favouring keeping hyperedges in the same parts (clusters).

While we can compute the modularity for very small hypergraphs by exhausting over all possible partitions, this is generally not a viable option. We propose a generalization to hypergraphs of the CNM algorithm for graph partitioning [

Our proposed algorithm for hypergraphs is presented in Algorithm 1. The idea is that we start with a partition where each node is in its own part. Then in each step, we loop through every hyperedge touching two or more parts, and we select (or we just randomly chose a single hyperedge) the one which, when we merge all the parts it touches, yields the best modularity, provided it is at least as high as the modularity from the previous step. We stop when no such edge exists. We use this algorithm in the next example.

_{opt}, a partition of _{opt}

1 Initialize _{opt} the partition with all _{opt} the corresponding modularity;

2

3

4

5 compute the partition _{e*} obtained when merging all parts in _{opt} touched by _{e*};

6

7

8

9 compute the partition _{e} obtained when merging all parts in _{opt} touched by _{e};

10

11 select edge _{e*};

12

13 _{e*} ≥ _{opt}

14 _{opt} = _{e*} and _{opt} = _{e*};

15 update _{opt};

16

17 _{e*} < _{opt})

18 output: _{opt} and _{opt}

We have implemented the simplified stochastic version of the algorithm in Julia and use it on the hypergraph generated with the forementioned regime (i) that has 75% edges of degree 3 and 25% edges of the degree 4. In our implementation of Algorithm 1, we choose hyperedges to consider at random order (that is, at each step one hyperedge is randomly selected). In order to validate the algorithm, we have run it 1920 times with 500 steps at each run. A single run on a hypergraph consisting of 150 vertices and 5094 hyperedges, using a modern CPU and a single thread, took around 7 seconds. Note that the computational complexity of the proposed algorithm is

The results presented in

The thick red line represents the average performance, the dashed line represent standard deviation for the performance and the wide pink area represents the smallest and the biggest performance found in all experiments.

The DBLP computer science bibliography database contains open bibliographic information on major computer science journals and proceedings. The DBLP database is operated jointly by University of Trier and Schloss Dagstuhl. The DBLP Paper data is available at

We consider a hypergraph of citations where each vertex represents an author and hyperedges are papers. In order to properly match author names across papers we enhance the data with information scraped from journal web pages. The DBLP database contains the doi.org identifier. We use this information to obtain the journal name and retrieve paper author data directly from journal—we update available author name data using ACM, IEEE Xplore, Springer and Elsevier/ScienceDirect databases. Since the same author names can be written differently we match author names of the paper across all three data sources. This can give good representation of author names for later matching. For the analysis, we only kept the (single) large connected component. We obtained a hypergraph with 1637 nodes, 865 edges of size 2, 470 of size 3, 152 or size 4 and 37 of size 5 to 7. The complete source code for this example along with instructions and data files is available online [

In _{G}(), as well as the results with our CNM algorithm on hypergraphs. Comparing the Louvain and CNM algorithms, we see that there is a tradeoff between _{H} and _{G} and moreover, the Hcut value is lower with the CNM algorithm. The increased number of parts with our algorithms is mainly due to the presence of singletons.

algorithm | _{H}() |
_{G}() |
Hcut | #parts |
---|---|---|---|---|

Louvain | 0.8613 | 0.8805 | 0.1181 | 40 |

CNM | 0.8671 | 0.8456 | 0.0945 | 92 |

Another observation is that the actual partitions obtained with objective function _{G}() (Louvain) and _{H}() (CNM) are different. For the Louvain and CNM algorithms, we found values of 0.4355 for the adjusted Rand index (ARI), 0.4416 for its graph-aware counterpart (see [

One of the difference lies in the number of edges of size 2, 3 and 4 that are cut with the different algorithms, as we see in _{H}() will tend to cut less of the larger edges, as compared to the Louvain algorithm, at expense of cutting more size-2 edges.

Algorithm | 2-edges | 3-edges | 4-edges |
---|---|---|---|

Louvain | 0.0382 | 0.1815 | 0.3158 |

CNM | 0.0590 | 0.1277 | 0.1842 |

In this paper, we presented a generalization of the Chung-Lu model for hypergraphs, which we used to define a modularity function on hypergraphs. Interestingly, in hypergraph modularity case there is no one unique way to define modularity and we show that it depends on how strongly we think that a hyperedge indicates members of the same community. If the belief is soft this leads to standard 2-section graph modularity. However, if it is strong, a natural definition is strict hypergraph modularity, which we tested on numerical examples. We also proposed the in-between majority-based modularity function.

The objective of this paper is to develop a definition of hypergraph modularity. However, in order to show that this notion is numerically traceable, at least approximately, we provided the theoretical foundations for the development of algorithms using this modularity function that greatly reduce the solution search space.

A key natural question with any new measure is if it provides qualitatively different outcomes than existing ones. Therefore we have compared strict hypergraph modularity with a standard 2-section graph modularity. For this we proposed a simple heuristic algorithm. We illustrated the fact that in comparison to 2-section graph modularity (optimized using Louvain algorithm) optimization using strict modularity function tends to cut a smaller number of hyperedges. Therefore the proposed measure is potentially highly valuable in application scenarios, where a hyperedge is a strong indicator that vertices it contains belong to the same community.

Hypergraph modularity is a new measure, and there is still a lot of work that should be done. First of all, the development of good, efficient heuristic algorithms would allow to look at larger hypergraphs. Such algorithms would allow us to perform a study over hypergraphs with different edge size distributions, comparing the hypergraph modularity function with other definitions such as graph modularity over the 2-section representation of the hyperedges, and hypergraph modularity using the less strict majority rule.

Finally, let us mention that the method of modularity maximization (in its generalized form which incorporates a resolution parameter controlling the size of the communities discovered) is equivalent to another widely used methods of community detection in networks, the method of maximum likelihood applied to the special case of the stochastic block model known as the planted partition model, in which all communities in a network are assumed to have statistically similar properties [

The authors would like to thank Claude Gravel for useful discussions while developing the algorithm.