^{1}

^{2}

^{3}

^{1}

^{1}

^{4}

^{*}

Conceived and designed the experiments: RRV AR AR. Performed the experiments: RRV DC SL. Analyzed the data: RRV DC SL. Wrote the paper: RRV AR AR.

The authors have declared that no competing interests exist.

In spite of the scale-free degree distribution that characterizes most protein interaction networks (PINs), it is common to define an

In this paper we present three objective methods to define hub proteins in PINs: one is a purely topological method and two others are based on gene expression and function. By applying these methods to four distinct PINs, we examine the extent of agreement among these methods and implications of these results on network construction.

We find that the methods agree well for networks that contain a balance between error-free and unbiased interactions, indicating that the hub concept is meaningful for such networks.

A large number of cellular processes are mediated by physical interactions among proteins, including signal transduction, enzyme activity, and post-translational modification. The elucidation of large networks of protein-protein interactions has contributed to the identification of biochemical and signaling pathways, and to functional annotation of genes. Such networks have been systematically determined and explored in the baker's yeast

Given the largely

The class of hub nodes in a protein interaction network may be defined by specifying, as stated above, a degree threshold such that all proteins with degree higher than this threshold are hubs or by specifying a number threshold such that when proteins are ranked by their degree a certain number of proteins from the top of this ranked list are hubs. In either case an objective definition of hub proteins will require criteria used to specify these thresholds that can be applied in the same manner to different networks.

What might these criteria be? Hub proteins (defined in an

In contrast, it is also possible to define the set of hub proteins using the biological properties themselves. For example, it has been reported that the set of hub proteins in

Yet another biologically important property that hub proteins are found to have is that they are significantly enriched for essential proteins

The three criteria discussed above are not meant to represent an exhaustive set; rather, they represent three properties most commonly attributed to hubs: lack of intra-set connectivity, bimodality of coexpression, and enrichment for essentiality. As a way of examining the meaning (or lack thereof) of the hub concept, we apply hub definitions based on these three criteria to four differently constructed high confidence protein interaction networks in

In order to test our criteria for defining hubs in protein interaction networks, we used four different multi-validated protein networks constructed for ^{h} network (with 1787 nodes and 3004 edges)

The coexpression criterion for defining hubs (Results) requires the use of gene expression data. Yeast mRNA expression data profiles corresponding to five different conditions

One way to define hubs is to use the mutual connectivity properties of high degree nodes in protein interaction networks. It has been reported, for example, that hub-hub connections in PINs are suppressed

A simple measure of topological connectivity of a graph is the relative size of its largest component, i.e., the number of nodes in the largest component divided by the total number of nodes in the graph. We will call this measure the relative connectivity

Suppose we are now given a large network _{n}_{n}_{n}

Successive subgraphs are generated from a ranked degree list, and the relative connectivity _{4}

Note that other natural measures of relative connectivity are also possible. One example is a suitably normalized entropy of the distribution of component sizes. This entropy would be zero for connected networks and maximal for completely fragmented networks where every node is isolated. We found similar results when using this measure but ultimately chose the simpler definition of relative connectivity.

We examined whether the boundary between hubs and non-hubs, i.e., the occurrence of a minimum followed by a rapid increase in the subgraph connectivity is a statistically significant feature. To assess this, we constructed, corresponding to each network studied, 10,000 random networks of the same size and degree sequence, following the configuration model of Newman _{n}_{n}_{n}_{n}_{+k}−_{n}_{n}_{n}

In order to assess the difference in composition of essential genes among hubs and non-hubs (see _{1}_{2}_{x} _{2} _{1} and Π_{2} for the individual distributions are constrained to lie between 0 and 1 and to satisfy Π_{1}+Π_{2} = 1. Here, we choose each weight to be proportional to the number of genes/proteins in the corresponding set, i.e., Π_{1} = _{2} =

The relative connectivity (_{n}^{h} and FYI networks reach a minimum around

The four yeast protein interaction networks studied are HC and LC (panel (a), first 100 nodes), HC^{h} and FYI (panel (b), first 500 nodes), showing regions of interest where the relative subgraph connectivity increases from a minimum.

Dataset | Reference | Nodes | Edges | Results based on relative connectivity | Hub definitions used in the literature | ||

Degree cutoff | Number of hubs | Degree cutoff | Number of hubs | ||||

HC | 2998 | 9258 | 33 | 40 | 16–21 | 150–300 | |

LC | 3307 | 14169 | 85 | 11 | 82, 17 | 12; 294 | |

HC^{h} |
1787 | 3004 | 17 | 20 | 7–10 | 90–180 | |

FYI | 1379 | 2493 | 5 | 300 | 5 | 320 |

Next, the statistical significance of sharp increases within subgraph connectivity profiles was assessed, as described in ^{−4}), although other statistically significant regions corresponding to local fluctuations in subgraph connectivity were also identified (

Empirical P-values (dashed lines) for significance of the relative connectivity measure (solid lines) for all the four networks were computed using 10,000 random networks corresponding to each real network. P-values that are less than 10^{−4} can be identified by the circles on the x-axis in each panel.

We note that, although it has been reported that hub-hub interactions are not suppressed in the HC network

Hub proteins have been reported to have special properties with respect to their level of co-expression with neighboring proteins in a protein interaction network. Han et al.

In contrast to the approaches mentioned above, by following the connectivity profile analysis carried out in the previous subsection, if we identify only the top 40 or so high degree proteins as hubs in the HC network, it is apparent that they do exhibit a bimodal coexpression distribution with their protein interaction neighbors under several expression conditions (^{h} network and bimodality does not occur at all in the LC network. In the case of the LC network, this is most likely due to the fact that there are few nodes (according to the relative connectivity criterion) that can be treated as hubs, resulting in lack of significance due to small sample size.

The panels display distributions of the average Pearson correlation coefficient (PCC) between expression profiles of hubs with their interaction partners (solid line) and non-hubs with their interaction partners (dashed line) for HC ((a)–(f)), LC ((g)–(l)), HC^{h} ((m)–(r)) and FYI ((s)–(x)). The set of hubs for each network was determined from the relative subgraph connectivity analysis. Average PCC values were computed using normalized gene expression profiles over the full yeast compendium that includes expression data under all five conditions each of which is also analyzed individually.

Inclusion of non-hub nodes into the list of HC hubs leads to reduction in bi-modality of the average PCC distribution. This can be seen as the number of hubs included increases from 40 to 419 in the HC dataset. The panel on the left displays smoothed probability density functions corresponding to the average PCC distribution while the panel on the right displays the cumulative distribution functions. Percentiles refer to the percentages of top high degree nodes included in the hub set, following

To assess whether a given hub set can be significantly decomposed into party and date hubs, we make use, as in earlier work

As before, we ordered protein nodes in each of the four networks studied in decreasing order of degree and successively included more and more nodes in the hub set. For each constructed hub set, we computed the Pearson correlation coefficient (PCC) between hubs and their protein interaction neighbors, then computed the dip statistic for the PCC distribution. The result of this analysis is shown in ^{h} network does not admit a date and party hub decomposition for any choice of the hub set.

Values of the dip statistic for all four networks studied as a function of the number of top degree nodes included in the hub set. The straight line marks the boundary between statistically significant and insignificant dip values (at 95% confidence).

Note that Han et al.

It is well known that proteins with high degree in a protein interaction network are more likely to be essential than proteins of low degree

There are two natural measures of difference of composition of nodes of a certain type (here, essential proteins) between two sets of nodes (here, hubs and non-hubs). One well known measure is the P-value for the Kolmogorov-Smirnov test for difference in distributions. If _{1}_{2}_{1}_{1}_{1}_{2}_{2}_{2}

Enrichment for essential genes among hubs relative to non-hubs, as measured by the Jensen-Shannon divergence (upper panels) and the P-value for the Kolmogorov-Smirnov test (lower panels).

It is clear that a hub definition based on statistical significance of the compositional differences of essential proteins among hubs and non-hubs is far more unconstrained than the previous two hub definitions we have considered: most choices of hub sets, that is, choosing between the top 3 and 2973 high degree nodes for HC (99.01% of nodes); 15 and 3240 for LC (97.52%); 17 and 1756 for HC^{h} (97.31%); and 138 and 1232 (79.33%) for FYI, as hub nodes leads to statistically significant (P-value≤0.05) compositional differences of essential genes, with hubs having significantly more essential genes than non-hubs. Specifically, except for the LC network, relative connectivity based hub definitions and coexpression based ones are consistent with statistically significant essential gene compositional differences, although none of the definitions correspond to a maximally significant compositional difference (this maximum occurs when approximately 900 top degree nodes are included in the hub set for the HC network, 1200 for LC, 550 for HC^{h}, and 550 for FYI). For the LC network, the connectivity based hub definition and the essentiality based one do not quite overlap: by the connectivity criterion, only about the top 10 high degree nodes can be included in the hub set, whereas by the essentiality criterion, at least 15 of the top high degree nodes must be identified as hubs for a statistically significant difference in number of essential genes among hubs versus non-hubs. However, given errors and incompleteness in protein interaction network data, these numbers appear close enough to pronounce weak agreement between the connectivity-based and essentiality-based criteria for defining the LC hub set. We address this and related issues in greater detail in the Discussion.

The four yeast protein interaction networks that we examine, although high confidence, are still subject to some error. Specifically, it is reasonable to expect that the false negative rates for these networks could be quite high, although the false positive rates are low. In such situations, it is useful to work with clean, simulated data in order to test the applicability of a new concept or algorithm (for example, reverse engineering algorithms for gene regulatory networks can be tested using simulated gene expression data

^{h} and FYI, there is similarly no appreciable change to the subgraph connectivity profiles upon random removal of up to 15% of edges but the profiles do change upon random addition of edges. These two networks contain far fewer edges in comparison to the other two networks: HC^{h} and FYI contain about a third of the number of edges in HC and about a fifth of the number of edges in LC. It is therefore expected that HC^{h} and FYI would have a large false negative rate, and therefore that addition of edges would reduce the false negative rate and substantially change the connectivity profiles. It is also expected that addition of edges would bring the connectivity profiles of these two networks closer to that of LC and HC, as observed. Furthermore, we find that change in the location of the sharp rise in relative connectivity does not substantially affect the degree value at which that rise occurs, even though it affects the number of nodes classified as hubs by the connectivity criterion: the degree cutoff value changes from 17 (unperturbed value) to 16 (15% edges added) for HC^{h}, and from 5 (unperturbed value) to 7 (15% edges added) for FYI. We thus find that the connectivity-based criterion for hubs is reasonably robust with respect to random edge deletion and addition. Furthermore, we found that the other two criteria are extremely robust: random addition and deletion of up to 15% of the edges has no appreciable effect on the statistically significant ranges of cutoff values for all four networks (data not shown).

Relative subgraph connectivity profiles for unperturbed versions of all four networks are shown, along with the corresponding profiles upon random addition and removal of 10% and 15% of the edges in the unperturbed networks.

Our aim in this work was to examine objective criteria that could be used to define hubs in protein interaction networks. We presented three such criteria here - one based purely on network topology and the other two involving gene expression and function. We applied these criteria to four differently constructed protein interaction networks in

First, we found that all four networks displayed a characteristic and relatively sharp statistically significant increase in relative subgraph connectivity as successive lower degree nodes are added to the subgraph. This increase identifies a clear “scale” in these power law networks, and marks a transition between high degree nodes (hubs) and intermediate degree nodes such that these two classes have very different topological properties: hubs by themselves form a highly fragmented subgraph while intermediate degree nodes play the role of mediating connections among hubs so that the subgraph formed by hubs plus intermediate degree nodes has high connectivity.

Second, we found that, for two networks – namely, FYI and HC, the hub notion as defined by this transition agrees well with the hub notion defined by the ability to split the hub set into date and party hubs based on their neighbor coexpression characteristics. In the process, we also found that the split between date and party hubs is quite sensitive to how hubs are defined in the first place, an issue that has largely been ignored in recent controversies regarding the separation of date and party hubs. We also found no agreement between the connectivity based hub notion and the expression based one for the HC^{h} and LC networks. We note that both FYI and HC networks are constructed by combining literature-curated and high-throughput data so that the resulting network is, to a large extent, balanced in terms of both bias and error. However, HC^{h} is most likely error-prone (although more unbiased than FYI or HC) and LC is most likely biased (although more error-free than FYI or LC). It is intriguing that two very different objective criteria for defining hubs agree well for networks that have a balance of error-free and unbiased interactions. Third, we find, in all four networks, that virtually any characterization of the hub set results in a significant difference in essential gene composition among hubs versus non-hubs. This is of course a result of previously reported strong correlations between degree and essentiality, and it implies that statistical significance of difference in essential gene composition is not a very precise way to define hubs in protein interaction networks.

To summarize, it appears that the hub concept is more meaningful for “balanced” networks (because of the agreement between three independent notions of a hub) than it is for networks that are dominated by error-prone, high-throughput data or networks that are compendia of error-free but biased literature-curated interactions. This observation, coupled with the methods presented here, could therefore be used both to test a protein interaction network constructed by a combination of methods as well as to define hub proteins in such a network.

Finally, we remark that our methods can be generalized beyond the simple notion of degree centrality to other more complicated centrality measures that also have functional significance. Just as the sharp rise in connectivity at a certain degree defines a degree “scale” that can be used to differentiate hubs from non-hubs, other centrality measures could possess characteristic scales in protein interaction networks.