PLoS ONEplosplosonePLOS ONE1932-6203Public Library of ScienceSan Francisco, CA USAPONE-D-20-0737310.1371/journal.pone.0241790Research ArticlePhysical sciencesMathematicsProbability theoryRandom variablesMedicine and health sciencesEpidemiologyMedical risk factorsTraumatic injury risk factorsViolent crimeAssaultMedicine and health sciencesPublic and occupational healthTraumatic injury risk factorsViolent crimeAssaultSocial sciencesSociologyCriminologyCrimeViolent crimeAssaultPhysical sciencesMathematicsProbability theoryProbability distributionPhysical sciencesMathematicsApplied mathematicsAlgorithmsResearch and analysis methodsSimulation and modelingAlgorithmsComputer and information sciencesNetwork analysisSocial networksSocial sciencesSociologySocial networksPhysical sciencesMathematicsGeometryGeodesicsComputer and information sciencesInformation theoryGraph theoryClustering coefficientsPhysical sciencesMathematicsGraph theoryClustering coefficientsPhysical sciencesMathematicsNumerical analysisMeasuring event concentration in empirical networks with different types of degree distributionsMeasuring event concentration in empirical networkshttps://orcid.org/0000-0002-6049-620XCamposJuanConceptualizationData curationFormal analysisInvestigationWriting – original draft^{1}*FinkeJorgeMethodologySupervisionValidationWriting – review & editing^{2}Sports Models Division, Genius Sports, Medellín, ColombiaDepartment of Electrical Engineering and Computer Science, Pontificia Universidad Javeriana, Cali, ColombiaCherifiHocineEditorUnviersity of Burgundy, FRANCE
The author Juan Camilo Campos received a salary from Genius Sports. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
* E-mail: juan.campos@geniussportsmedia.com202021220201512e0241790842020201020202020Campos, FinkeThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Measuring event concentration often involves identifying clusters of events at various scales of resolution and across different regions. In the context of a city, for example, clusters may be characterized by the proximity of events in the metric space. However, events may also occur over urban structures such as public transportation and infrastructure systems, which are naturally represented as networks. Our work provides a theoretical framework to determine whether events distributed over a set of interconnected nodes are concentrated on a particular subset. Our main analysis shows how the proposed or any other measure of event concentration on a network must explicitly take into account its degree distribution. We apply the framework to measure event concentration (i) on a street network (i.e., approximated as a regular network where events represent criminal activities); and (ii) on a social network (i.e., a power law network where events represent users who are dissatisfied after purchasing the same product).
Center of Excellence and Appropriation in Big Data and Data Analytics (CAOBA)FP44842-anex46-2015FinkeJorgeCAOBA and Genius Sports provided, respectively, support in the form of salaries for the authors Jorge Finke and Juan Camilo Campos; but did not have any additional role in the design of the study, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.Data AvailabilityThe datasets analyzed during the current study are available at https://data.cityofchicago.org/Transportation/Major-Streets/ueqs-5wr6 (Chicago Major Streets), https://data.cityofchicago.org/Public-Safety/Crimes-2001/8v97-unyc (Chicago Crimes Dataset), and https://github.com/juancamilocampos/Amazon-Ratings-Konect (Amazon Co-purchase network).Introduction
Consider a non-uniformly distribution of events over different regions. Past efforts to explain the mechanisms through which some regions reveal a high concentration of events (i.e., form hotspots) range from agent-based [1], game theoretic [2, 3], reaction-diffusion [4], and predator-prey [5] modeling. Generally, these approaches account for the relative location of events on the metric space, and use kernel density techniques to identify and recreate hotspots [6, 7]. For a number of scenarios, however, the concentration of events can be better captured by their distribution over a set of interconnected nodes [8, 9].
The work in [8] adapts the main idea behind kernel density techniques [6, 7] to identify hotspots on networks. In particular, a hotspot indicates a subnetwork that contains the maximum number of events on the smallest total path length. The authors consider two types of networks, namely, binary trees and regular networks. To compute the optimal subnetwork for binary trees, they introduce an algorithm that identifies hotspots based on dynamic programming. For regular networks, identifying hotspots requires that all possible subnetworks be evaluated, which becomes computationally costly for networks of large size. A second shortcoming of the work in [8] is that the approach does not extend to networks with more realistic topologies. In most cases, empirical networks exhibit degree distributions under which the nodes connect to different numbers of neighbors (ranging orders of magnitude in degree values, e.g., for networks with power law degree distributions).
The work in [9] introduces an alternative approach, which evaluates the concentration of events on networks with different types of degree distributions (namely, networks with with regular, Poisson, and power law degree distributions) based on Voronoi diagrams [10]. Nodes that are associated with the occurrence of a certain number of events are marked as generator nodes. Voronoi cells are then defined according the geodesic distances from generator nodes to all other nodes of the network. The measure of concentration of events that the authors propose builds on a key property of Voronoi diagrams: groups of small, adjacent cells (created by generator nodes) correspond to subnetworks with a high event concentration.
Simulation results in [9] illustrate that evaluating event concentration in networks depends on their degree distribution. This paper extends the work in [9] in two ways. First, we provide the mathematical foundation to characterize the theoretical distribution of the sizes of the Voronoi cells when events are located uniformly at random over a network with an arbitrary distribution. Second, we use the resulting distribution and apply the criterion in [9] to measure event concentrations in empirical networks. In particular, we consider events that represent (i) criminal activities on a street network (approximated as a regular network), and (ii) dissatisfied users on a social network (a power law network). Our results illustrate the importance of understanding the relationship between the degree distribution and the dispersion of events on a network in order to identify and recreate the formation of hotspots.
The remaining sections are organized as follows. Section 2 introduces some notation and preliminaries. Section 3 presents the proposed framework for measuring event concentration, and introduces the criterion for detecting hotspots based on a summary statistic derived from the proposed framework. Section 4 applies the criterion to the two empirical networks and compares the outcome to that of detecting hotspots by the proximity of events in the metric space. Finally, Section 5 draws some conclusions and future research directions.
Preliminaries
Consider an undirected network G = (V, E), where V = {v_{1}, ⋯, v_{n}} denotes the set of nodes and E ⊆ V × V the set of edges. The geodesic distance between nodes v_{i} and v_{j} is denoted by ρ(v_{i}, v_{j}). We borrow the definition of a Voronoi diagram of a network from [11].
Definition 1. Suppose that there exists a subset of nodes marked as generator nodes. This subset is denoted by U = {u_{1}, ⋯, u_{m}} ⊆ V. The Voronoi diagram of G = (V, E) associated to the nodes belonging to U is a partition {V(u_{1}), ⋯, V(u_{m})} of V, such that:
If v_{i} ∈ V(u_{s}), then ρ(v_{i}, u_{s}) ≤ ρ(v_{i}, u_{s′}) for all s′ ∈ {1, ⋯, m}.
If ρ(v_{i}, u_{s}) = ρ(v_{i}, u_{s′}), then node v_{i} is equally likely to be assigned to V(u_{s}) or V(u_{s′}).
Let u_{i} represent a generator node which is associated to the occurrence of at least a certain number of events ε∈Z,ε≥0. A generator node may capture, for example, an intersection on a representation of a street network where ε or more criminal activities occur. Regular nodes, on the other hand, represent intersections at which less than ε events occurs. Such nodes belong to the set U^{c} = V − U. Note that the generator node associated to cell V(u_{s}) is denoted by u_{s} and a cell refers to an element of the Voronoi partition. Note also that any cell V(u_{s}) contains one generator node. Finally, note that if the network G is a connected network, then any regular node v_{i} ∈ V belongs to some cell.
Based on Definition 1, n_{s} = |V(u_{s})| ≥ 1 denotes the size of cell V(u_{s}). The distribution of n_{s} for all u_{s} ∈ U for G, that is, the distribution of the sizes of all cells determines whether events on G = (V, E) are uniformly distributed. Deviations from a uniform distribution evidence a concentration of events that results from a non-uniform allocation.
Fig 1 shows the Voronoi cells in a scenario with two particular cases. Note that in Fig 1(a) the non-uniform allocation of the generator nodes yields relative small geodesic distances between them. In comparison to Fig 1(b), where generator nodes (events) are uniformly distributed, most cells of the Voronoi diagram in Fig 1(a) contain a small number of regular nodes. Fig 2 depicts the probability mass function (pmf) of the sizes of the cells for uniform and non-uniform event allocations.
10.1371/journal.pone.0241790.g001Voronoi cells.
Resulting Voronoi cells when generator nodes (defined by the occurrence of at least ε events) are (a) concentrated, and (b) distributed uniformly at random. For this scenario, a node is labelled as a regular node only if no event occurs at that node (ε = 1). Otherwise, if one or more events occur, the node is marked as a generator node. In general, the event threshold ε defines the minimum number of events that must occur for a node to be marked as a generator node.
10.1371/journal.pone.0241790.g002Probability distributions of the sizes of the Voronoi cells.
Next, consider a randomly selected node v, with degree d_{v} = d. Let Nδv={v′∈V:ρ(v,v′)=δ} represent the neighborhood of nodes located at a distance δ from node v. Furthermore, let D denote a random variable that represents the degree of a randomly selected node. And finally, let Ddδ denote a random variable that represents the degree of a randomly selected node in Nδv.
To derive the pmf of the sizes of the Voronoi cells resulting from a uniform distribution of events, consider the following assumptions. Suppose that the degree distribution (i.e., the pmf of D) and the conditional degree distribution (i.e., the pmf of Dd1) are known [12]. Furthermore, suppose that the pmf of Ddδ for δ ≥ 2 can be approximated as
P[Ddδ=di]≈nP[D=di]di2|E|=P[D=di]dik¯
where k¯ is the average degree of all nodes of the network.
Under the above assumptions, we are now able to introduce a framework that defines the pmf of the sizes of Voronoi cells when events are distributed uniformly at random. The notation for the framework is summarized in Table 1.
10.1371/journal.pone.0241790.t001Notation for determining event concentration on networks.
Notation
Meaning
X
Size of the cell of a randomly selected generator node
X_{d}
Size of the cell of a randomly selected generator node u with d_{u} = d
Xd1
Number of nodes in V(u)∩N1u, where d_{u} = d
Xd2i
Number of nodes in V(u)∩N2u, d_{u} = d, that are themselves neighbors of node v_{i} in N1u
D
Degree of a randomly selected node
D_{g}
Degree of a randomly selected generator node
Dd1
Degree of a randomly selected node in N1u
Dd2
Degree of a randomly selected node from N2u, that is itself a neighbor of a node in V(u)∩N1u
Framework and hotspot detection criterion
Let D_{g} denote a random variable that represents the degree of a randomly selected generator node. Furthermore, the proportion of nodes that are generator nodes is denoted by p = m/n, 0 < p < 1. Consider a randomly selected generator node, denoted node u. For convenience, we will write N_{δ} to denote Nδu.
Assumption 1. Suppose that
The degree distribution of the generator nodes resembles the degree distribution of all nodes of G.
If v ∈ V(u), v ≠ u, then v ∈ N_{1} ∪ N_{2}.
The local clustering coefficient of node u is negligible (less than 0.1).
Nodes in N_{2} have a single neighbor in N_{1}.
Fig 3 illustrates conditions 2-4 of Assumption 1. These conditions are satisfied for Fig 3(a) but not for Fig 3(b).
10.1371/journal.pone.0241790.g003Illustration of the conditions of Assumption 1 (except the first condition).
The boundary indicates the cell V(u). (a) Scenario where conditions 2, 3 and 4 are satisfied, and (b) scenario where neither condition is satisfied.
The degree of generator node u is denoted by d_{u} = d. Note that the random variable Ddδ represents the degree of a randomly selected node in N_{δ}. Let X_{d} denote a random variable that represents the size of V(u). Moreover, let Xd1 denote a random variable that represents the number of nodes in V(u) ∩ N_{1}. Similarly, let Xd2i denote a random variable that represents the number of nodes in V(u) ∩ N_{2}, that are themselves neighbors of node v_{i}, located in V(u) ∩ N_{1}. Note that Xd2i and Xd2j are independent and identically distributed random variables.
Note that if conditions 3 and 4 of Assumption 1 are satisfied, then
Xd=(∑i=1Xd1Xd2i)+Xd1+1
The first term in Eq (2) characterizes the number of nodes in N_{2}, which depends on the realization of Xd1. The term Xd1+1 characterizes the number of nodes in N_{1} plus the generator node. Let F1d(x)=P[Xd1=x] and F2d(x)=P[Xd2i=x] represent the pmfs of Xd1 and Xd2i. Moreover, let k_{m} represent the minimum degree of all nodes. The following theorem characterizes F_{1d}(x).
Theorem 1. The pmf of Xd1 is given by
F1d(x)=∑i=0dP[Rd1=i]P1(x,i,P[Wd1=1])
where
P[Rd1=x]=P1(x,d,1−p)P[Zd1=x]=∑di=kmnP[Dd1=di]P1(x,di−1,p)P[Wd1=x]=∑i=0nP[Zd1=i]P2(x,11+i)P1(x,d,q)=(dx)qx(1−q)d−xP2(x,q)=qx(1−q)1−x
Note that F_{1d}(x) can be obtained if p and the pmf of Dd1 are known. The proof of Theorem 1 and all other theorems can be found in appendices. Next, Theorem 2 characterizes F_{2d}(x).
Theorem 2. The pmf of Xd2i is given by
F2d(x)=∑i=0nP[Yd2=i]P1(x,i,P[Wd2=1])
where
P[Rd2=x]=∑dj=i+1nP[Dd1=dj]P1(x,dj−1,1−p)(1dj−i)∑k=0n∑dj=k+1nP[Dd1=dj]P1(k,dj−1,1−p)(1dj−k),P[Gd2=x]≈∑di=kmnP[D=di]dik¯P1(x,di−1,p)P[Yd2=x]=∑i=0nP[Rd2=i]P1(x,i,P[Gd2=0])P[D¯d2=di]=P[Dd2=di]P1(0,di−1,p)∑di=kmnP[Dd2=di]P1(0,di−1,p)P[Zd2=x]=∑di=kmnP[D¯d2=di]P1(x,di−1,1−∑di=kmnP[D=di]dik¯P1(0,di−1,p))P[Wd2=x]=∑i=0nP[Zd2=i]P2(x,1i+1)
Similarly to F_{1d}(x), note that F_{2d}(x) can be derived, if in addition to p and the pmf of Dd1, the pmf of D22 is known. Let F_{d}(x) = P[X_{d} = x] represent the pmf of X_{d}. The following result characterizes F_{d}(x).
Theorem 3. The pmf of X_{d} is given by
Fd(x)=∑i=0dF1d(i)F2di(x−i−1)
where
F2d0(x)={1x=00otherwiseF2d1(x)=F2d(x)F2di(x)=∑m=−∞∞F2di−1(x−m)F2d(m)
Note that if F_{2d} is known, then F2di can be obtained recursively. Let F(x) = P[X = x] represent the pmf of the sizes of the Voronoi cells in the case where generator nodes (events) are uniformly distributed. Note that
F(x)=∑d=kmnP[Dg=d]Fd(x)
Based on Theorems 1-3, we can now compute F(x) using the following algorithm.
Algorithm 1 Computing the theoretical pmf of X.
Input: Pmfs of D and D1d, and p.
Output: F(x)
1: k_{m} ← minimum degree of all nodes in G
2: ford ← k_{m} to ndo
3: Compute the pmf of Rd1,Zd1, Wd1, and F_{1d} (using Eqs (3)–(6))
4: Approximate the pmf of Dd2 (using Eq (1))
5: Compute the pmf of Rd1,Gd2, Yd2, D¯d2, Zd2, Wd2, and F_{2d} (using Eqs (9)–(15))
6: Compute F2d0 and F2d1 (using Eqs (17) and (18))
7: Fd(x)=F1d(0)F2d0(x−1)+F1d(1)F2d1(x−2)
8: forj ← 2 to ddo
9: Compute F2dj (using Eq (19))
10: Fd(x)←Fd(x)+F1d(j)F2dj(x−j−1)
11: end for
12: end for
13: Compute F (using Eq (20))
14: returnF(x)
Algorithm 1 enables us to define the following criterion for determining whether the distribution of events in a network obeys a non-uniform distribution.
Criterion 1 (Hotspot criterion). Let F represent the pmf of the sizes of Voronoi cells when generator nodes are uniformly distributed and F_{e}the empirical pmf of the sizes of Voronoi cells for a given network.
For regular or Poisson networks, there is event concentration ifχ2(Fe,F)≥c(α)where c > 0 is a threshold (which determines the significance level α of the Chi Square distribution χ^{2}).
For power law networks, there is event concentration ifΨ(Fe,F)=n¯(F)−n¯(Fe)n¯(F)−1≥βwhere β > 0 represents a threshold, andn¯(F)=4((∑i=1Q1(F)−1F(i)i)+(14−∑i=1Q1(F)−1F(i))Q1(F))represents the average size of the cells in the first quartile (Q_{1}(F)) of F.
Note that the criterion for identifying hotspots depends on the distribution of the sizes of the Voronoi cells, which in turn depends on the degree distribution of the network. The criterion compares the output distribution of Algorithm 1 with the empirical distributions of the sizes of Voronoi cells. Deviations from F indicate the amount of concentration of events on the network. For regular and Poisson networks, deviations are measured using the χ^{2} test. For power law networks, deviations are measured based on the average size of the cells in the first quartiles of F and F_{e}.
Empirical networksChicago street network
We use data between January 1 and December 31, 2017, to evaluate the formation of assault hotspots on the street network of the city of Chicago [13]. The street network considers expressways, collectors, and arterials. It has 2902 edges, which represent streets, and 1650 nodes, which represent street intersections. An assault is represented as an event and associated to the nearest intersection. We first consider only handgun assaults and then widen the analysis for all types of assaults reported in the 12 months.
In particular, we evaluate the dispersion of assaults over time based on Criterion 1. To identify the length of the observation period that is required for a stationary, high-concentration outcome to be observed, we consider the following steps.
Define an observation period of t days, t = 7, 14, 21, ….
Consider assaults reported within t days after January 1, 2017.
Associate each reported assault to the closest street intersection (node) and mark each node with ε = 1 or more assaults as a generator node.
Apply Criterion 1 to determine if there is a concentration of events (i.e., evaluate whether generator nodes create a significant number of small, adjacent Voronoi cells).
Move the observation period by one day and repeat steps 1-4 (i.e, consider the assaults reported within t days after January 2, 2017).
Repeat steps 1-5 until the observation period starts on December 31, 2017.
For each observation period of length t, calculate the percentage of instances (throughout 2017) where the proposed criterion determines a high event concentration (that is, the percentage of instances that the null hypothesis of there being no hotspot formation is rejected).
Since the street network of Chicago resembles a lattice, we need to consider Criterion 1.1, that is, the criterion for detecting hotspot on a regular network. The average local clustering coefficient is 0.07 (a negligible value close to 0), meaning that there are hardly any connections in the network neighborhoods of any node. However, note that for a street network, nodes in N_{2} share two instead of a single neighbor in N_{1}. As a consequence, condition 4 of Assumption 1 is not satisfied, and the derivation of the random variable X_{d} (Eq (2)) is no longer a precise expression. In particular, note that, if nodes in N_{2} share an additional neighbor, then the first term in Eq (2) will take into account some nodes at level 2 twice. Under such a scenario, the expression for X_{d} is hard to compute and an important part of our current research efforts. Nonetheless, Eq 2 serves as an approximation for the size of the cell of a randomly selected generator node with degree d. Accordingly, Algorithm 1 provides an approximation rather than a precise expression for F. Finally, to minimize the number of false positives in detecting hotspots, we use a significance level of α = 10^{−4}.
The solid line in Fig 4 represents the percentage of instances that the null hypothesis is rejected for observation periods of different lengths. Note that the minimum period for which the criterion consistently identifies the formation of hotspots throughout 2017 is t = 21 days, in which case 100 percent of all 365 evaluations of the null hypothesis are rejected. That is, for an observation period of 21 days (or longer), the proposed criterion suggests that hotspots of handgun assaults are formed over the network. In contrast, the dashed curve in Fig 4 represents the percentage of instances that the null hypothesis is rejected based on the proximity of events in the metric space. In particular, it depicts the outcome of determining whether events are concentrated (i.e., step 4 above) when applying the Hopkins test [14] (instead of Criterion 1). Identifying the formation of hotspots for a Hopkins score below 0.25, the test requires an observation period of more than two months in order to reject the null hypothesis over 90 percent of all instances.
10.1371/journal.pone.0241790.g004Percentage of instances that the null hypothesis is rejected for different observation periods of length <italic>t</italic> according to the proposed criterion (solid curve) and the Hopkins test (dashed curve).
Next, Fig 5 shows the values of the χ^{2} test (Eq (21)) for different observation periods. The error bars represent one standard deviation. Note that for observation periods that are longer than two months, the values of χ^{2} remain approximately constant, meaning that the amount of event concentration on the network does not change significantly when longer observation periods are considered.
10.1371/journal.pone.0241790.g005Value of <italic>χ</italic><sup>2</sup> test for different observation periods.
The analysis so far classifies as a generator node any intersection (node) associated on the street network to a single assault (event). That is, ε = 1. However, it is often of interest to distinguish between intersections where sporadic criminal activities take place and those where criminal activities are comparatively more frequent. We now evaluate the effect of classifying a generator node based on a stronger condition, that is, if at least a particular number of assaults (of any type) is associated to that node within a given observation period.
In particular, we revisit the procedure for determining the length of a period for a stationary, high-concentration outcome to be observed, and consider a varying event threshold ε ≥ 1 for different observations periods in step 3 above. As before, each reported assault is associated to the closest node in the street network. However, only nodes with a number of assaults of at least ε = t/7 are marked as generator nodes for observation periods of length t = 7, 14, 21, …. In other words, an intersection represents a generator node if, on average, more than one assault occurs every 7 days.
Note that defining an event threshold ε that varies depending on the length of the observation period represents a stronger condition for classifying generator nodes. Step 3 above marks fewer nodes as generator nodes (compared to the case where ε = 1), since fewer intersections have a persistent high rate of assaults. Nonetheless, the solid line in Fig 6 shows that Criterion 1 identifies the formation of hotspots for observation periods of any length. The horizontal line indicates that, regardless of the length of the observation period, hotspots on the street network are formed as the outcome of high rates of assaults at intersections which are located relatively close to each other. The dashed line in Fig 6, in contrast, shows that when applying the Hopkins test for the same observation periods, the percentage of instances the null hypothesis is rejected (i.e, hotspot are detection) tends to decrease as longer periods are considered. According the Hopkins test, intersections with high rate of assaults are not close to each other in the metric space. Unlike to the proposed approach, the Hopkins test is not a decisive approach to identify a high concentration of events on the network (it now rejects the null hypothesis only about 40-50% of all instances).
10.1371/journal.pone.0241790.g006Percentage of instances that the null hypothesis is rejected for different observation periods of length <italic>t</italic> and varying event threshold <italic>ε</italic> = <italic>t</italic>/7 according to the proposed criterion (solid curve) and the Hopkins test (dashed curve).Co-purchase network of Amazon
Next, consider a co-purchase network of products from Amazon [15]. After a purchase, users can rate their satisfaction with the product they bought on a scale from 1 (very dissatisfied) to 5 (very satisfied). The co-purchase network consists of nodes that represent users and edges that connect users who bought the same product. The average rating of a user indicates overall user satisfaction. Dissatisfied users are defined as the users whose average rating is below to the 10^{th} percentile. Analyzing event concentration on the user co-purchase network enables us to evaluate whether dissatisfied users, who purchase a shared set of products, are concentrated on some parts of the network. The network contains 8444 nodes and 38492 edges.
The average local clustering of the network is 0.68, which implies that the network does not satisfy Assumption 1. Fig 7 assesses the quality of the approximation provided by Algorithm 1. It shows the percentage of simulations where the null hypothesis is accepted based on the Chi Square test for different significance levels α. Note that with α = 10^{−4} the null hypothesis is accepted more than 90% of all simulations. In other words, for a significance level α ≤ 10^{−4}, the simulated distribution is equal to the theoretical approximation for more than the 90% of the runs. Though Assumption 1.3 is not satisfied, the approximation provided by Algorithm 1 is quite good.
10.1371/journal.pone.0241790.g007Percentage of instances that the the null hypothesis is accepted when generator nodes are located uniformly at random for 100 simulations.
Fig 8 shows the complementary cumulative degree distribution. Given the power law that characterizes the tail of the degree distribution, we use Criterion 1.2 to determine whether events are concentrated.
10.1371/journal.pone.0241790.g008Complementary cumulative degree distribution of the co-purchase network of Amazon.
Fig 9 shows the pmf of the distribution of the sizes of Voronoi cells and the theoretical distribution. The proposed criterion determines that dissatisfied users are not concentrated. Indeed, note that the empirical distribution resembles the theoretical distribution for which events are located uniformly at random. If we apply the χ^{2} test to both distributions, then the null hypothesis is accepted (for α = 10^{−4}), which means that dissatisfied users are uniformly distributed across the network.
10.1371/journal.pone.0241790.g009Distributions <italic>F</italic> and <italic>F</italic><sub><italic>e</italic></sub> when dissatisfied users are marked as generator nodes on the co-purchase network of Amazon.
Finally, we modify the initial ratings to obtain an artificial concentration of dissatisfied users. To generate these hotspots, we first divide the co-purchase network into communities based on the measure of community modularity. Second, we select two communities such that the total number of members of both communities is approximately 844 (p = 0.1). Third, we select products that have been bought by at least two members of the two communities, and assign a rating of 1 to the transactions that involve these products. Finally, we compute the set of dissatisfied users based on the new average rating of each user. Note that dissatisfied users are now arbitrary concentrated across the two selected communities. Criterion 1.2 determines that dissatisfied users are now indeed concentrated. Fig 10 shows the pmf of the sizes of Voronoi cells when dissatisfied users are marked as events and the theoretical pmf from a uniform allocation. As expected, the number of cells of small size increases when events are concentrated. Note that the number of cells of size one in the empirical distribution is approximately four times larger than the number in the theoretical distribution.
10.1371/journal.pone.0241790.g010Distributions <italic>F</italic> and <italic>F</italic><sub><italic>e</italic></sub> when dissatisfied users are marked as events on the co-purchase network.Discussion
The proposed framework enables us to derive a summary statistic for measuring event concentration based on Voronoi diagrams. It provides an approximation for the distribution of the sizes of Voronoi cells for regular, Poisson, and power law networks in which events are distributed uniformly at random. When the distribution of events obeys a non-uniform allocation, groups of small, adjacent Voronoi cells indicate subnetworks where events (generator nodes) are highly concentrated (hotspots are formed).
Building on this key property of Voronoi diagrams, the proposed criterion for detecting hotspots enables us to measure concentration across a variety of scenarios in which events are distributed over a network. Its applications range from determining whether events such as traffic accidents or fire outbreaks are concentrated in certain parts of a city, to evaluating whether influencers in a topic area (e.g., sports or politics) are gathered together in particular subgraphs of a social network.
Our work illustrates the criterion by analyzing the distribution of assaults on the street network of Chicago at various time scales, and considering various event thresholds ε. We show how the criterion can be used to estimate the smallest observation period for explaining the formation of stationary concentrations of events. Our analysis of the distribution of events over urban structures such as the street network aims to complement traditional approaches that identify clusters on the metric space. We compare the outcome of the proposed criterion to that of detecting hotspots by evaluating the proximity of events in the metric space (using the Hopkins test). Finally, we also measure event concentration in a co-purchase network and show that dissatisfied users are uniformly distributed over the network.
The results presented in this paper should be considered in the light of some limitations. The theoretical framework used to derive the hotspot criterion requires certain assumptions on the topological properties of the network, which are generally only approximately met by empirical networks. In particular, we assume that (i) the average local clustering coefficient is negligible (below 0.1), and (ii) the degree distribution of the nodes with events resembles that of the entire network. Satisfying these assumptions guarantees that Algorithm 1 can compute the pmf of the sizes of the cells for a network with a uniform event distribution. Analyzing the behavior of the proposed framework for networks with high clustering remains an interesting direction for future research.
AppendixA: Proof of Theorem 1
Theorem 1. The pmf of Xd1 is given by
F1d(x)=∑i=0dP[Rd1=i]P1(x,i,P[Wd1=1])
where
P[Rd1=x]=P1(x,d,1−p)P[Zd1=x]=∑di=kmnP[Dd1=di]P1(x,di−1,p)P[Wd1=x]=∑i=0nP[Zd1=i]P2(x,11+i)P1(x,d,q)=(dx)qx(1−q)d−xP2(x,q)=qx(1−q)1−x
Proof. To define the pmf of Xd1, consider both regular and generator nodes in N_{1}. Let Rd1 be a random variable that denotes the number of regular nodes in N_{1}. Since p is the probability of randomly selecting a generator node, we know that
Rd1∼Bin(d,1−p)
Remark 1. Based on definition 1, if a regular node v_{i}satisfies that ρ(v_{i}, u) ≤ ρ(v_{i}, u_{j}) for all u_{j} ∈ U, and i = |{u_{i} ∈ U \ {u}:ρ(v_{i}, u_{i}) = ρ(v_{i}, u)}|, then the probability that node v_{i}belongs to V(u) is1i+1. Otherwise, the probability that v_{i}belongs to V(u) is 0.
Note that a regular node in N_{1} does not necessarily belongs to V(u). According to remark 1, the probability that a regular node in N_{1} belongs to V(u) depends on the number of neighboring generator nodes of that node. Let Zd1 be a random variable that represents the number of generator nodes in N_{2}, that are neighbors of a regular node in N_{1}.
Remark 2. Note that the distribution of the number of generator neighbors in N_{δ+1}of a regular node with degree d_{i}, located in N_{δ}, obeys a binomial distribution Bin(d_{i} − 1, p).
According to Remark 2, the pmf of Zd1 is a mixture of binomial distributions
P[Zd1=x]=∑di=kmnP[Dd1=di]P1(x,di−1,p)
where
P1(x,d,q)≔(dx)qx(1−q)d−x
represents the pmf of the Binomial distribution.
Let Wd1 be a Bernoulli random variable that indicates if a regular node in N_{1} belongs to V(u). According to Remark 1, note that
P[Wd1=x]=∑i=0nP[Zd1=i]P2(x,11+i)
where
P2(x,q)≔qx(1−q)1−x
represents the pmf of the Bernoulli distribution.
Note that P[Wd1=1] denotes the probability that a regular node in N_{1} belongs to V(u). Moreover, if there are a total of i regular nodes in N_{1}, then Bin(i,P[Wd1=1]) is the distribution of the number of nodes that belong to the cell in N_{1}. The distribution of Xd1 obeys
F1d(x)=∑i=0dP[Rd1=i]P1(x,i,P[Wd1=1])
B: Proof of Theorem 2
Theorem 2. The pmf of Xd2i is given by
F2d(x)=∑i=0nP[Yd2=i]P1(x,i,P[Wd2=1])
where
P[Rd2=x]=∑dj=i+1nP[Dd1=dj]P1(x,dj−1,1−p)(1dj−i)∑k=0n∑dj=k+1nP[Dd1=dj]P1(k,dj−1,1−p)(1dj−k),P[Gd2=x]≈∑di=kmnP[D=di]dik¯P1(x,di−1,p)P[Yd2=x]=∑i=0nP[Rd2=i]P1(x,i,P[Gd2=0])P[D¯d2=di]=P[Dd2=di]P1(0,di−1,p)∑di=kmnP[Dd2=di]P1(0,di−1,p)P[Zd2=x]=∑di=kmnP[D¯d2=di]P1(x,di−1,1−∑di=kmnP[D=di]dik¯P1(0,di−1,p))P[Wd2=x]=∑i=0nP[Zd2=i]P2(x,1i+1)
Proof. We extend the analysis described in the appendix A: Proof of Theorem 1 of N_{1} to N_{2}. Let Rd2 denote a random variable that represents the number of regular nodes, located in N_{2}, that are neighbors of a single node in V(u) ∩ N_{1}. Note that a node with degree d_{j} in N_{1} has, with probability P_{1}(i, d_{j} − 1, 1 − p), a total of i regular neighbors in N_{2}. Based on Remark 1, this node belong to V(u) with probability 1dj−i. The probability that a node that belongs to V(u), located in N_{1}, has i regular nodes in N_{2} is given by
P[Rd2=x]=∑dj=i+1nP[Dd1=dj]P1(x,dj−1,1−p)(1dj−i)∑k=0n∑dj=k+1nP[Dd1=dj]P1(k,dj−1,1−p)(1dj−k)
Note that the probability that a regular node, located in N_{1} of node u, belongs to V(u) is greater than 0. However, according to Remark 1, this is not true for N_{δ} when δ ≥ 2. For instance, if a regular node, located in N_{2}, has a neighboring generator node, located in N_{3}, then the probability that the regular node belongs to V(u) is 0. We refer to a regular node that has a probability greater than 0 to belong to V(u), as a candidate node. In other words, a regular node v is a candidate node if for all u′ ∈ U: ρ(v, u) ≤ ρ(v, u′). Note that each regular node, located in N_{1}, is a candidate node. Fig 11 highlights the candidates nodes in green.
10.1371/journal.pone.0241790.g011The concept of candidate node.
Generator nodes are marked in red, regular nodes in blue, and candidates nodes of V(u) in green.
Consider a randomly selected regular node v, located in N_{2}, that is a neighbor of a single node in V(u) ∩ N_{1}. Let Gd2 denote a random variable that represents the number of neighboring generator nodes of v, located in N_{3}. According to Remark 2 and using the approximation given in Eq (1), note that
P[Gd2=x]=∑di=kmnP[Dd2=di]P1(x,di−1,p)≈∑di=kmnP[D=di]dik¯P1(x,di−1,p)
Furthermore, let Yd2 denote a random variable that represents the number of candidate nodes in N_{2} that are neighbors for a single node that belongs to V(u) ∩ N_{1}. A regular node in N_{2} is a candidate node if there is not a neighboring generator node in N_{3}. So the probability of a regular node in N_{2} of being a candidate node of V(u) is P[Gd2=0]. Then, note that if a node v, located in N_{1}, has i neighboring regular nodes, the distribution of the number of neighboring candidate nodes in N_{2} for node v obeys Bin(i,P[Gd2=0]). That is
P[Yd2=x]=∑i=0nP[Rd2=i]P1(x,i,P[Gd2=0])
Based on the distribution of candidate nodes in N_{2}, we now calculate the distribution of the number of nodes that belong to V(u) ∩ N_{2}. Let D¯d2 denote a random variable that characterizes the degree of a randomly selected candidate node in N_{2}. According to Remark 2, P_{1}(0, d_{i}−1, p) is the probability that a node with degree d_{i}, located in N_{2}, has no generator nodes as neighbors in N_{3}. So the probability that a randomly selected candidate node in N_{2} has degree d_{i} is given by
P[D¯d2=di]=P[Dd2=di]P1(0,di−1,p)∑di=kmnP[Dd2=di]P1(0,di−1,p)
Note that a candidate node in N_{2} satisfies that its distance to any generator node is greater or equal to 2. Let Zd2 denote a random variable that represents the number of generator nodes, except node u, that are at distance of 2 for a candidate node that is located in N_{2}. In other words, Zd2 represents the number of Voronoi cells, except V(u), that can contain a candidate node, located in N_{2} of node u. Fig 12 depicts a candidate node of V(u) that can be potentially contained in three Voronoi cells including V(u), i.e., the probability that the candidate node belongs to any of the three cells is 13. Note that for each regular node in N_{3} that has at least one neighboring generator in N_{4}, there is an additional Voronoi cell that can potentially contain the candidate node in N_{2}. Based on Eq (1), an approximation of the probability that a node in N_{3} has at least one neighboring generator node in N_{4} is
1−∑di=kmnP[D=di]dik¯P1(0,di−1,p)
10.1371/journal.pone.0241790.g012The concept of candidate node in <italic>N</italic><sub>2</sub>.
Representation of a candidate node of V(u) in N_{2} that can potentially belong to three different Voronoi cells.
Furthermore, note that the distribution of the number of generator nodes, except node u, that can contain in their cells a candidate node in N_{2} with degree d_{i} is given by
Bin(di−1,1−∑di=kmnP[D=di]dik¯P1(0,di−1,p))
According to Remark 1, the pmf of Zd2 is the mixture of binomial distributions
P[Zd2=x]=∑di=kmnP[D¯d2=di]P1(x,di−1,1−∑di=kmnP[D=di]dik¯P1(0,di−1,p))
If Wd2 is a Bernoulli random variable that indicates the probability that a candidate node in N_{2} belongs to V(u), then
P[Wd2=x]=∑i=0nP[Zd2=i]P2(x,1i+1)
Based on the pmf of Wd2, the distribution of Xd2 is given by
F2d(x)=∑i=0nP[Yd2=i]P1(x,i,P[Wd2=1])
C: Proof of Theorem 3
Theorem 3. The pmf of X_{d} is given by
Fd(x)=∑i=0dF1d(i)F2di(x−i−1)
where
F2d0(x)={1x=00otherwiseF2d1(x)=F2d(x)F2di(x)=∑m=−∞∞F2di−1(x−m)F2d(m)
Proof. To define the pmf of X_{d}, recall that P[Xd2i=x]=F2d(x). It can be shown that
P[Xd21+Xd22=x]=∑m=−∞∞F2d(x−m)F2d(m)
Let F2d2(x)=∑m=−∞∞F2d(x−m)F2d(m). The distribution of the sum of multiple instances of a random variable can be generalized as
P[Xd21+Xd22+⋯+Xd2(j+1)︸j+1=x]=∑m=−∞∞F2dj(x−m)F2d(m)
where F2dj(x)≔∑m=−∞∞F2dj−1(x−m)F2d(m). Based on Eqs (2) and (35), the pmf of X_{d} can be written as a mixture distribution
Fd(x)≔P[Xd=x]=∑i=0dF1d(i)F2di(x−i−1)
where
F2d0(x)={1x=00otherwiseF2d1(x)=F2d(x)
The first term in Eq (36) indicates the probability that i nodes in N_{1} belong to V(u). The second term represents the pmf of the sum of i instances of Xd2, with a shift of i + 1 to account for the number of nodes in N_{1} and for the generator node. Note that if the number of nodes in N_{1} that belong to V(u) is zero, then it is not possible that a node in N_{2} belongs to V(u), according to Eq (37).
ReferencesEckJE, LiuL. Contrasting simulated and empirical experiments in crime prevention. GroffER. Simulation for theory testing and experimentation: An example using routine activity theory and street robbery. GordonMB, NadalJP, PhanD, SemeshenkoV. Discrete choices under social influence: Generic properties. ShortMB, BrantinghamPJ, BertozziAL, TitaGE. Dissipation and displacement of hotspots in reaction-diffusion models of crime. PitcherAB. Adding police to a mathematical model of burglary. Gonzales AR. Mapping Crime: Understanding Hot Spots; 2005. National Institute of Justice report. Available at: http://discovery.ucl.ac.uk/11291/1/11291.pdf.ChaineyS, RatcliffeJ. BuchinK, CabelloS, GudmundssonJ, LöfflerM, LuoJ, RoteG, et al. Detecting hotspots in geographic networks. In: CamposJ, FinkeJ. Detecting Hotspots on Networks. In: CherifiH, GaitoS, MendesJF, MoroE, RochaLM, editors. MeyerW. DemiryurekU, ShahabiC. Indexing network Voronoi diagrams. FotouhiB, RabbatMG. Degree correlation in scale-free graphs. Chicago Police Department (2016). Crimes—2001 to Present. Chicago Open Data Portal https://data.cityofchicago.org/Public-Safety/Crimes-2001/8v97-unyc.HopkinsBrian and SkellamJohn GordonBing Lu (2010). Amazon ratings. Koblenz Network Collection http://konect.cc/networks/amazon-ratings/