The authors have declared that no competing interests exist.
Conceived and designed the experiments: AMN SK. Performed the experiments: SK IS ND. Analyzed the data: SK. Contributed reagents/materials/analysis tools: IK. Wrote the paper: SK AMN IK.
Finding motifs in biological, social, technological, and other types of networks has become a widespread method to gain more knowledge about these networks’ structure and function. However, this task is very computationally demanding, because it is highly associated with the graph isomorphism which is an NP problem (not known to belong to P or NPcomplete subsets yet). Accordingly, this research is endeavoring to decrease the need to call NAUTY isomorphism detection method, which is the most timeconsuming step in many existing algorithms. The work provides an extremely fast motif detection algorithm called QuateXelero, which has a Quaternary Tree data structure in the heart. The proposed algorithm is based on the wellknown ESU (FANMOD) motif detection algorithm. The results of experiments on some standard model networks approve the overal superiority of the proposed algorithm, namely QuateXelero, compared with two of the fastest existing algorithms, GTries and Kavosh. QuateXelero is especially fastest in constructing the central data structure of the algorithm from scratch based on the input network.
Milo et al.
Since the introduction of this concept by Milo et al. in a seminal paper
Motif detection in networks consists of two main steps: first, calculating the number of occurrences of a subgraph in the network and, second, evaluating the subgraph significance. Various methods proposed so far differ mainly in the first step, the enumeration of subgraphs. These methods can be grouped roughly into two categories regarding this aspect:
Methods counting subgraph occurrences exactly.
Methods using sampling and statistical approximations for the enumeration.
In this work, the focus is in the first category, which is also much more computationally demanding. The methods in this group require classifying the subgraphs after enumerating them in the network. In other words, the nonisomorphic classes of enumerated subgraphs should be determined. This can be done in two ways. First, one can generate all different nonisomorphic classes of a prescribed size and then calculate the frequency of each in the network (i.e., count the number of matches of each class in the network). The drawback is that the number of nonisomorphic classes grows exponentially with the given size of the subgraph. GrochowKellis
The classification step is the most time consuming step of the second category methods. The reason is the application of isomorphism detection algorithms, mostly NAUTY
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: remove random chosen 
10: 
11: 
The approach is different in GTries, in which a multiway tree of depth
Although NAUTY is one of the fastest isomorphism detection methods, but its computational cost is
According to the above, it seems rational to search for methods that eliminate or decrease the number of executions of NAUTY in finding motifs. In fact, as stated above, this is the reason of GTries’s success as the fastest method so far. GTries algorithm eliminates the need to call NAUTY during the census on random networks. But, still, it uses the FANMOD for enumerating the subgraphs of the original network which is very time consuming and sometimes infeasible when the size of network and subgraph are large. GTries also provides other options that will improve its performance on original network, but applying these options need some prior knowledge or preprocessings. These options will be discuss later.
This paper provides a new algorithm with the aim of decreasing the number of calls to NAUTY. For this, the authors propose embedding a quaternary tree data structure in ESU (the algorithm used in FANMOD). A quaternary tree is a rooted tree data structure and each internal node has at most four children (see
The root node and internal nodes have at most four children.
Each edge, connecting a parent to one of its children, can be labeled with a mark, which can be a number, character, or any other symbol. A labeled quaternary tree can be searched using a given string that consists of the same set of symbols used for labeling that tree. This searching initiates in the tree’s root. In each step, one symbol is read from the input string and the current pointer, initially set as root, moves to the child of the current node, connecting edge of which corresponds to the symbol that is read recently from the input string. Because it is allowed to add nodes during the search, if one node in the path has no child for an input symbol, a child is added to the current node for that symbol and the current pointer moves to that child. Thus, this search continues until the input string is read completely. See
Searching starts at the root of the tree. After respectively visiting children 3 and 2 throughout the path, the search finishes in a newly added leaf, corresponding to number 1.
This quaternary tree performs a partial classification for enumerated subgraphs in the proposed algorithm. This data structure, which is similar to GTrie data structure in some aspects, is used before calling NAUTY and eliminates the need to use it most of the times. According to experimental results, the proposed novel algorithm outperforms the existing algorithms in most of the cases.
Like GTries, Kavosh, and FANMOD, QuateXelero consists of three main phases: enumeration, classification, and motif detection. Although enumeration and classification phases are intertwined, describing them separately makes them more understandable. Below, these phases are elaborated.
For enumerating all subgraphs of size
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: remove random chosen 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
Lines 6, 7, and 8 classify a subgraph after it is fully expanded. This is described in detail in the next section. Here, the
In this figure, −1 indicates one way connection from the existing vertex to added vertex, 0 indicates no connection between them, 1 stands for a one way connection in the reverse direction, and 2 shows a two way connection. The order of numbers in the input string is the same order as the corresponding vertices are added during expanding the subgraph (that is 1, 2, 3, and then 4 in this example).
After searching the quaternary tree, the
During the enumeration, the appropriate leaf of the quaternary tree is returned by the
The
The leaf already existed in the tree and is not added newly: in this case, this leaf will have a previously set pointer to a leaf in the binary tree (i.e., the condition in line 6 is not satisfied) which indicates the isomorphism class to which the current subgraph belongs (see
1) The quaternary tree is searched and the new leaf is added 2) Because the leaf is new and its pointers is not set, NAUTY is executed for the subgraph being enumerated 3) After finding the canonical label for the subgraph, the binary tree is searched using that label and the corresponding leaf in the binary tree is identified 4) The subgraph counter of that leaf (which indicates the number of subgraph of that class found so far in the network) is increase one unit 5) The pointer of the leaf of quaternary tree is set to the identified leaf of the Binary Tree.
1) The quaternary tree is searched and the corresponding leaf is identified 2) Using the identified leaf’s pointer to the corresponding leaf from binary tree, the latter’s counter is augmented.
In either of the above cases, the next step is to increase the counter of the corresponding leaf in the binary tree. This is performed in line 8 of
The rationale underlying this classification is that if two different subgraphs reach the same leaf in the proposed quaternary tree, then those subgraphs are isomorphs of each other. But, it should be noted that the reverse is not true; in other words, it is possible for two isomorphic subgraphs to reach two different leafs of the quaternary tree. Thus, there may be two or more different quaternary tree leaves pointing to the same Binary Tree leaf.
Accordingly, in this algorithm (lines 6 to 7) the need to invoke the NAUTY function and searching the binary tree is eliminated in many cases by exploiting the proposed quaternary tree. That is, the cost of
There is a delicate difference between census on the original network (
At the first glance, the algorithm might seem similar to the ESU option of GTries algorithm
After the census on the original network with the help of a quaternary tree, each leaf of the binary tree will contain the number of subgraphs belonging to the corresponding isomorphism class. Then, some random networks are generated by rewiring and the census on is repeated on them. As the random generation method, we used the same method applied in GTries (3 swaps per edge with random Markov Chain process). The generated networks are checked against those generated by GTries and the results indicate the consistency of the random generation method.
Finally, the number of subgraphs of each isomorphism class for original and random networks will be used in calculating the zscore of each isomorphism class as below:
We used six standard networks for evaluating our algorithm. These were three biological networks: the metabolic pathway of bacteria
Network  Directionality  Vertices  Edges  Description  Source 
Yeast  Directed  688  1079  Yeast transcription network  
E. coli  Directed  672  1275  Metabolic pathway of bacteria 

Social  Directed  67  182  A real social network  
Electronic  Directed/Undirected  252  399 (both dir and undir)  Electronic circuit  
YeastPPI  Undirected  2361  6646  Proteinprotein interaction network in budding yeast  
Dolphins  Undirected  62  159  Frequent associations between a group of dolphins 
Because Kavosh and GTries are the bests amongst the existing motif finders, they are chosen for comparison with QuateXelero. GTries is superior regarding the speed and Kavosh is better in memory usage.
For comparing QuateXelero with Kavosh, both algorithms were executed on the same computer with Quad Core AMD Opteron ™ Processor 2354 and CentOS Linux Release 6.0 (final) operating system. The number of random networks is set to two in all experiments, which is enough for having valid results in experiments. It is important to note that this number of random networks is not suitable for motif detection in practice and is only used here for getting fast results for comparison. Moreover, different sizes of motif were considered in the experiments in order to assess the effect of the motif size on the performance of the algorithms.
The results are illustrated in
Processing Times  Comparison vs. Kavosh  
Network  Subgraphs  Classes  Subgraphs/Classes  Kavosh  QX  
Yeast  5  2508149  174  14414.65  23.4  0.5  46.80x 
Yeast  6  32883898  888  37031.42  438.5  8.9  49.27x 
Yeast  7  416284878  4809  86563.71  14056.2  166.4  84.47x 
Yeast  8  5184710063  27003  192004.97  224497  2609.5  86.03x 
Yeast  9  64730339589  156025  414871.59    53852.1   
Average Run Time Growth Ratio:  
Electronic  5  19675  49  401.53  0.13  0  nan 
Electronic  6  97038  199  487.63  0.8  0.08  10.00x 
Electronic  7  495274  907  546.06  5.9  0.3  19.67x 
Electronic  8  2572125  4333  593.61  38.7  1.9  20.37x 
Electronic  9  13512688  20692  653.04  278.2  11.9  23.38x 
Electronic  10  71614362  96483  742.25  2614.2  71.2  36.72x 
Electronic  11  381985209  437821  872.47    493.3   
Average Run Time Growth Ratio:  
E.coli  5  80724  590  136.82  0.48  0.05  9.60x 
E.coli  6  558080  3884  143.69  4.3  0.3  14.33x 
E.coli  7  4019781  23587  170.42  45.3  2.8  16.18x 
E.coli  8  29294103  136569  214.50  410.7  23.6  17.40x 
E.coli  9  212782282  768121  277.02  4000  190.7  20.98x 
Average Run Time Growth Ratio:  
Social  5  10599  773  13.71  0.11  0.06  1.83x 
Social  6  52156  5062  10.30  0.82  0.36  2.28x 
Social  7  254674  30217  8.43  5.4  2.6  2.08x 
Social  8  1224376  165958  7.38  33.3  16.3  2.04x 
Social  9  5764767  854023  6.75  220.3  96.22  2.29x 
Average Run Time Growth Ratio: 
However, generally, the results indicate that QuateXelero outperforms Kavosh regarding processing time in all cases. This is also illustrated in
In the charts, the horizontal axis indicates the size of motif and the vertical axis is the log of running time. The bases of logarithms are set to integer numbers close to the average running time growth rates shown in
The only drawback of the proposed algorithm is the considerable amount of memory that is used to construct the quaternary tree for larger motif sizes and for networks containing larger number of nonisomorphic subgraph classes. For example, among the experiments mentioned in
To compare QuateXelero with GTries, three groups of experiments are conducted. First, both of the algorithms are tested against smaller motif sizes on directed networks, second the same experiments are performed for larger sizes to understand the effects of motif size on run time of the two algorithms, and finally algorithms’ performances are tested for undirected networks.
Here, before explaining the experimental results, there is a point that worths noting. Currently, GTries provide an important and useful option for census on networks: having a list of nonisomorphic classes whose occurances are going to be counted, one can generate a GTrie based on those subgraphs and then apply that GTrie for enumerating subgraphs of both original and random networks.
However, it should be noted that if the goal is to exploit this option to enumerate all subgraphs occurring in a given network, two rough solutions might come to mind initially: 1) knowing all nonisomorphic classes occurring in the given network in advance, one can generate a GTrie based on those subgraphs and then apply the GTrie for enumeration, and 2) one can generate a GTrie containing all possible nonisomorphic classes of a given size and then using it for enumeration. The first solution is obviously impossible as we need to first enumerate all subgraphs of a network before knowing their complete list of nonisomorphic classes. In other words, before being able to use this option to generate the solution, we need the solution itself. The second solution, although useful in smaller motif sizes, becomes impractical for sizes larger than 7 or 8 for directed and 11 or 12 for undirected networks, since the number of nonisomorphic classes grows exponentially and storing the generated GTries would need a tremendous amount of memory.
The provided option in GTries is useful when we are performing a setcentric subgraph enumeration (i.e., counting the occurances of a given set of subgraphs) or when the motif size is small. This option can (and is planned to) also be embedded in QuateXelero easily, as the general structure of QuateXelero and GTries are similar. However, the aim of this paper is not to compare the performance of two algorithms in setcentric searches, but this work is aimed at comparing these algorithms in both steps of generating and applying the Quaternary Tree and GTrie data structures, specially for larger motifs where the setcentric option becomes inapplicable. Thus, here we emphasize the ESU option of GTries, which we call ESU+GTries. So the algorithm will have two steps: ESU (the algorithm of FANMOD) or census on original network, and GTries or census on randomized networks. The comparison of other options of GTries with the equivalent options in the proposed algorithm (which are planned to be implemented) takes a separate research.
Having said this, we continue discussing the comparison results. For comparing the algorithms a metric called “Equality Point” is defined. The equality point (
This concept is also illustrated in
Positive and negative equality points are illustrated respectively in the left and the right charts. The vertical axis
First the results for the small motifs are discussed. These results are presented in
Numbers in the parenthesis show the size of the motif for which the experiments are conducted (the results can be generalized to other motif sizes). The vertical axis indicates the ratio (in percentage) of run time to the run time for 20 random networks. Except
Census on Original  Avg. Census on Randoms  Total Time  Memory  
Network  Subgraphs  Classes  Subgraphs/Classes  Avg. # of subgraphs in random nets  ESU of GTries  QX  GTries  QX  ESU+GTries  QX  GTries  QX  Equality Point  
Yeast  5  2508149  174  14414.65  3277239  30.846  0.733  0.693  0.955  37.85  10.51  1.5 MB  1.8 MB  114.35 
Yeast  6  32883898  888  37031.42  51982245  532.806  11.201  11.909  17.856  651.07  190.20  2.3 MB  2.5 MB  87.50 
Yeast  7  416284878  4809  86563.71  872973082  12314.314  164.596  220.656  344.539  14494.60  3611.77  7.1 MB  8.8 MB  97.85 
Social  5  10559  773  13.66  16060.49  0.094  0.031  0.019  0.009  3.55  1.31  5.4 MB  2.7 MB  −124.00 
Social  6  52156  5062  10.30  90430.94  0.581  0.218  0.118  0.070  22.74  9.00  30.7 MB  13.9 MB  −186.25 
Social  7  254674  30217  8.43  499632.89  3.532  1.451  0.725  0.612  154.91  72.78  184.9 MB  143.7 MB  −626.81 
E.coli  5  80724  590  136.82  89831.69  0.612  0.063  0.126  0.037  15.63  5.51  4.5 MB  7.7 MB  −13.71 
E.coli  6  558080  3884  143.69  639690.34  5.604  0.546  0.910  0.303  104.07  33.65  22.1 MB  13.6 MB  −16.01 
E.coli  7  4019871  23587  170.43  4800418.40  51.092  4.430  7.195  2.600  822.42  274.45  135.4 MB  74.6 MB  −19.25 
Electronic  5  19675  49  401.53  20316.55  0.184  0.015  0.014  0.009  2.13  1.28  1.2 MB  3.4 MB  −70.00 
Electronic  6  97038  199  487.63  99766.42  1.097  0.063  0.068  0.051  8.49  5.59  2.4 MB  3.9 MB  −70.59 
Electronic  7  495274  907  546.06  522890.20  7.780  0.390  0.376  0.302  45.81  31.29  8.6 MB  8.5 MB  −96.22 
Now, we return back to
Taking into account the results for larger motifs shown in
Census on Original  Avg. Census on Randoms  Total Time  Memory  
Network  Subgraphs  Classes  Subgraphs/Classes  Avg. # of subgraphs in random nets  ESU of GTries  QX  GTries  QX  ESU+GTries  QX  GTries  QX  Equality Point  
Yeast  9  64730339589  156025  414871.6  255298149957  6205.98  23950.802  125962.31  889 M  
Social  9  5764767  854023  6.8  15950595  21.25  9.62  5.830  12.228  119.34  83.37  1.5 G  2.8 G  10.62 
Social  10  26429201  4161477  6.4  81106854  121.01  54.30  34.570  85.252  731.66  558.70  7.9 G  18 G  8.41 
Social  11  117219394  19285152  6.1  392209489  669.35  273.54  237.761  653.014  4527.95  4368.38  40.0 G  59 G  5.38 
E.coli  9  212782828  768121  277.0  281406579  728.69  47.52  86.493  41.078  1223.27  264.96  1.2 G  2.4 G  −16.10 
E.coli  10  1529707241  4223040  362.2  2564178587  6357.95  352.46  929.200  402.146  11461.69  2443.21  7.6 G  19 G  −12.11 
E.coli  11  10854043472  22764206  476.8  13801545748  53819.37    8834.432    101184.86    44.0 G    
Electronic  9  13512688  20692  653.0  17031795  65.89  2.34  2.360  2.604  79.18  16.11  42 M  130 M  263.48 
Electronic  10  71614362  96483  742.3  78568259  483.41  13.89  11.626  14.962  550.76  90.76  206 M  678 M  142.89 
Electronic  11  381985209  437821  872.5  464546660  3998.61  82.75  76.793  113.920  4438.76  663.11  1.0 G  4.6 G  106.70 
Electronic  12  2045287405  1943681  1052.3  2450710026    504.40    796.268    4557.15    25 G  
For Yeast network the situation is different. While the limited experiments here are not enough to make a judgment about this, but regarding
However, for
The third series of experiments were about undirected networks. These results are displayed in
The ratio in the left chart indicates the ratio of average time spent by QuateXelero for census on random networks to the same time required for GTries.
Census on Original  Avg. Census on Randoms  Total Time  Memory  
Network  Subgraphs  Classes  Subgraphs/Classes  Avg. # of subgraphs in random nets  ESU of GTries  QX  GTries  QX  ESU+GTries  QX  GTries  QX  Equality Point  
YeastPPI  4  2003998  6  333999.7  2447825  3.457  0.120  0.134  0.138  5.62  1.79  1.5 G  17.3 M  1031.26 
YeastPPI  5  48870476  21  2327165.5  69599664  100.466  3.310  2.658  4.604  127.92  49.63  7.9 G  17.4 M  50.24 
YeastPPI  6  1292780544  112  11542683.4  2158612083  3755.902  105.780  91.173  169.054  4667.06  1796.6  40.0 G  17.7 M  46.86 
Electronic  5  19675  11  1788.6  21682  0.040  0.001  0.002  0.001  0.09  0.03  0.3 M  1.1 M  −43.90 
Electronic  6  97038  33  2940.6  109648  0.267  0.010  0.010  0.012  0.39  0.14  0.4 M  1.1 M  166.26 
Electronic  7  495274  89  5564.9  570167  1.544  0.060  0.056  0.073  2.13  0.81  0.6 M  1.5 M  89.25 
Electronic  8  2572125  293  8778.6  3002254  10.118  0.370  0.310  0.427  13.26  4.65  1.3 M  3.0 M  83.26 
Electronic  9  13512688  1001  13499.2  18291623  61.370  2.110  1.945  2.755  80.99  29.68  4.2 M  11.6 M  73.37 
Electronic  10  71614360  3659  19572.1  104346200  393.620  12.110  11.872  16.959  513.26  181.82  19.3 M  49.7 M  75.15 
Dolphins  6  107775  101  1067.1  251036  0.227  0.015  0.024  0.025  0.48  0.28  0.9 M  1.3 M  228.88 
Dolphins  7  550428  633  869.6  1651879  1.338  0.140  0.174  0.200  3.18  2.20  1.8 M  2.6 M  47.88 
Dolphins  8  2683740  4940  543.3  9602379  8.201  0.733  1.197  1.370  20.89  14.87  9.6 M  17.2 M  44.91 
Dolphins  9  12495833  39963  312.7  53553629  42.379  4.618  8.403  8.800  133.24  96.27  79.8 M  170.2 M  103.15 
Dolphins  10  55824707  295236  189.1  283463110  220.406  26.628  81.293  77.200  1096.96  828.15  638.3 M  1722.0 M  −55.68 
Generally, regarding the experiments the followings can be concluded:
QuateXelero is always faster in census on original networks compared with ESU of GTries.
QuateXelero is generally faster in census on random networks for smaller motifs.
GTries is in most of the cases (especially for directed networks) faster in census on random networks for larger motif sizes.
QuateXelero is always better than ESU+GTries in the experienced motif sizes on E.coli network regardless of the number of random networks (negative
QuateXelero is generally better than ESU+GTries for smaller motif sizes.
QuateXelero surpasses ESU+GTries in most of our experiments for larger motif sizes in directed networks, however,it seems that ESU+GTries will be better for larger sizes not achievable with facilities available to the authors.
For undirected networks, QuateXelero surpasses ESU+GTries in smaller and seemingly larger motifs, however, ESU+GTries is better for medium size motifs.
There are two points that should be noted here. First, regarding the exponential growth in occupied memory, it seems infeasible to go further in motif size than what we have done, since it requires huge amounts of memory found only in limited scales in supercomputers. Second, most of the current researches focus on motifs of size under 8, because the dynamical features of bigger motifs are yet unknown. Accordingly, the performed tests seem to be sufficient to provide reliable data.
For small size experiments, we employed a laptop computer with Intel Core™ 2 Duo CPU 2.5 GHz and 4 GB of RAM. For larger experiments, a master node of model QuadCore AMD Opteron ™ Processor 2384 800 MHz with 64 GB main memory was used. The experiments for each network were conducted up to as large motif size as possible. However, some experiments were limited to the available memory and time. Generally, QuateXelero was mainly limited by the available memory while ESU+GTries was sometimes limited by time and sometimes by memory. These limitations and their details are listed in
Network  Motif Size  Algorithm  Stopping Reason 
Yeast  9  ESU+GTries  Long run time (close to 11 days) 
Yeast  10  QX  Long run time (about 26 days) 
Social  12  GTries  Memory 
Social  12  QX  Memory 
E.coli  12  GTries  Memory 
E.coli  11  QX  Memory 
Electronic  12  GTries  Core Dumped 
Electronic  13  QX  Memory 
Electronic (Undir)  11  GTries  Core Dumped 
Network motif detection is a challenging problem regarding the computational time and memory it requires and there have been remarkable efforts to solve it efficiently. This paper provides a new solution for this problem which is claimed to be superior in terms of processing time to the existing solutions in special cases. This claim is approved with respect to the experimental results on some standard complex networks. The results of comparing the proposed algorithm, namely QuateXelero, with the wellknown existing method Kavosh indicated the superiority of it to Kavosh in all cases regarding processing time. But QuateXelero uses a massive amount of memory compared with Kavosh. Another more important analysis was the comparison against ESU+GTries algorithm (ESU option of GTries algorithm). Generally, the results indicate that QuateXelero is always much faster than ESU of GTries in constructing the central data structure (i.e., the census on the original network), but slower in the census on random networks for larger motif sizes in most of the directed cases. The results for undirected networks illustrate the superiority of QuateXelero in small and probabily large motif detection, but not in the medium size problems. Furthermore, while QuateXelero is faster in most of the attempted experiences, but it seems that two algorithms, QuateXelero and ESU+GTries, will converge and the situation will be reverse when the size of the
Anyway, the proposed algorithm still seems to be improvable. With respect to the above, the future works can be focused on comparing the other options of GTries algorithm with the equivalent options in QuateXelero. Besides, combining the strength points of QuateXelero (e.g., faster census on original network) with the strength points of GTries (e.g., generally faster census on random networks and less memory occupation), to achieve a more efficient motif detection tool for solving problems in which the motif size is large and so other options are infeasible is another topic for further reseach. Furthermore, the question “When is QuateXelero faster than GTries or vice versa in the census on random networks?” is not answered completely yet. So, another point of focus can be the development of a strategy for choosing the appropriate method between two algorithms for census on random networks in processing a particular input network. Finally, one can use more compact data structures to compress the size of constructed quaternary tree to improve the memory complexity of QuateXelero.
QuateXelero is implemented in C++ programming language under Linux operating system. The program is also applicable under Windows (please refer to help file). The source code and sample networks are available for download at:
AMN would like to appreciate DAAD visiting professorship research program in Frankfurt University. The authors also acknowledge supports of Dr. Pedro Ribeiro from University of Porto, Portugal for providing the source code of GTries and his invaluable comments on improving the manuscript.