^{¤}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JL HS DATC. Performed the experiments: JL HS MKG. Analyzed the data: JL HS MKG. Contributed reagents/materials/analysis tools: JL HS. Wrote the paper: JL HS MKG DATC.

Current address: Department of Biology, University of Florida, Gainesville, Florida, United States of America

Global spatial clustering is the tendency of points, here cases of infectious disease, to occur closer together than expected by chance. The extent of global clustering can provide a window into the spatial scale of disease transmission, thereby providing insights into the mechanism of spread, and informing optimal surveillance and control. Here the authors present an interpretable measure of spatial clustering, τ, which can be understood as a measure of relative risk. When biological or temporal information can be used to identify sets of potentially linked and likely unlinked cases, this measure can be estimated without knowledge of the underlying population distribution. The greater our ability to distinguish closely related (i.e., separated by few generations of transmission) from more distantly related cases, the more closely τ will track the true scale of transmission. The authors illustrate this approach using examples from the analyses of HIV, dengue and measles, and provide an

The spatial and temporal scales over which cases are likely to be found is one of the most fundamental determinants of the population dynamics of infectious disease, but is also one of the hardest to measure. In particular, heterogeneities in the spatial distribution of the underlying population are often unknown, and may confound the measurement of spatial clustering resulting from the disease process. Here we discuss the potential uses of global clustering statistics in infectious disease epidemiology, and introduce an approach that can be used to measure clustering resulting from the disease process even when information on the underlying spatial distribution of the population is unknown. We discuss the performance characteristics of this approach and present three examples of its use.

Measures of clustering can be broadly divided into two types: local (or first order) clustering statistics that measure the tendency of events (i.e., cases) to occur around a particular point in space, and global (or second order) clustering statistics that measure the tendency of events to cluster in space in general [

The location of a case in relation to their infector defines a spatial transmission kernel for an infectious disease (

While a spatial transmission kernel is an essential property of disease transmission, it can be hard to measure. Even if we account for clustering in where people live, the actual distribution of related cases that we see in a population represents the results of multiple generations of transmission. This distribution is driven in part by the transmission kernel for a disease, but can also be highly influenced by both exogenous risk factors and clustering in disease susceptibility [

In many cases the observed spatial clustering of cases after multiple generations of transmission may be of greater public health importance than the transmission kernel (

Measures of spatial dependence, particularly when compared across times and settings, may also provide important clues to other aspects of disease epidemiology and biology. Spatial dependence not driven by transmission may indicate clustering in underlying risk factors or behaviors (e.g., clustering of vaccine refusal in the United States [

There exist a number of popular global clustering statistics that either use the exact locations of the points or aggregate them into grid cells. Aggregate (or quadrant) approaches initially place a grid over the study area and count the number of points falling within each grid cell. Area-level statistics can then be used to determine if cells with many points tend to occur near each other (e.g., Moran’s I or Geary’s c) [^{2} [

If a global clustering statistic is to be useful in the study of infectious disease it must be easily interpretable in terms of disease risk, comparable across settings and distinguish spatial variation due to the transmission process from variation due to clustering in the underlying population. While commonly used techniques like the K-function and Moran’s I are highly useful, none quite satisfy these criteria. In an attempt to meet this challenge, we have sought to develop a measure of spatial dependence that is easily interpretable in terms of disease risk, and extend this measure to make efficient use of pathogen strain (as measured by serotype, genotype, or other measure) to make valid inferences in cases when the spatial distribution of the underlying population is unknown. Hence, this approach should be useful even when the catchment area for cases or the underlying distribution of the population is unknown.

Epidemiologists typically characterize populations or exposures leading to elevated risk of a health outcome using measure or approximation of relative risk. Measures such as the hazard ratio, incidence rate ratio and the cumulative incidence ratio will be familiar to all epidemiologists. We propose the τ-statistic, a measure of the relative risk of someone at a particular spatial distance from a case also being a case, versus the risk of anyone in the population being a case. That is:
_{1}, _{2}) is the expected incidence rate of someone in distance range _{1}, _{2} of a case, and

In infectious disease epidemiology, we are often interested in the relative risk of infection at various spatial scales in situations where we do not know the underlying distribution of the population at risk. In these circumstances we can adapt the τ-statistic to use information on the infecting pathogen (such as serotype/genotype) to distinguish between pairs of cases that are consistent with coming from the same transmission chain and pairs that are inconsistent with coming from the same chain. For example, there are four serotypes of the dengue virus that often co-circulate. Pairs of cases infected by the same serotype (homotypic cases) may come from the same chain whereas cases from different serotypes (heterotypic cases) must come from separate chains. Simply approximating _{1}, _{2}) by proportion of the cases within a spatial region that are homotypic biases spatial estimates towards the null (see _{1}, _{2} of a case is homotypic (see _{1}, _{2} of a case is the same as the probability of a heterotypic case occurring anywhere in the population. If this is not the case, then _{1}, _{2}) is provided (appendices 5, 6 and 7).

Where we do not have information on genotype or serotype, we may be able to use temporal information to distinguish those cases that cannot be closely related from those that may be closely related. A case occurring within a small number of serial intervals (e.g. one to two) from another case has the potential to be that case’s direct offspring or share a recent common ancestor, whereas those cases separated by longer intervals are less likely to be so. In such circumstances, _{1}, _{2} occurred within a short time frame.

The τ-statistic is equivalent to ratios of particular multitype pair correlation functions. For example, where both a sample of cases (points of type _{ii}_{ij}

Simulation methods can be used to calculate statistical significance and confidence intervals for the τ-statistic. The null distribution of the τ-statistic can be determined by repeatedly randomly permuting the locations of the point pattern data (i.e., randomly assigning each covariate vector to one of the existing point locations, one covariate vector per location) and recalculating the τ-statistic after each permutation, then taking quantiles of the resulting distribution as one would do in point pattern data (a similar process has been used to determine the null distribution for other spatial statistics [

We have developed and submitted to CRAN an R package,

Observed cases of disease usually come from multiple generations of transmission stemming from one or more transmission chains that share a common ancestor sometime in the (often very distant) past (

Using simulated epidemics, we can assess the performance of the τ-statistic in a population that is itself spatially clustered (population distribution shown in

Each transmission chain consisted of cases caused by a single strain, with all strains divided into four serotypes.

We are rarely able to detect all infections in an outbreak. Large numbers of asymptomatic infections and imperfect surveillance systems mean that only a minority of cases may be captured. The τ-statistic is robust even when 99% of cases are unobserved (

For the spatially random observation all cases had a 1% probability of being observed. For the partially observed observation, the probability of observation was 0.1xexp(-

Dengue is a mosquito borne virus that exists in one of four genetically distinct forms called serotypes. The virus causes substantial morbidity and mortality in global tropical and sub-tropical regions. Once a person is infected with a serotype they are immune to that serotype for life, but may still be infected with another serotype [

To characterize the spatial dependence between cases of dengue in Bangkok, a study used the geocoded home addresses of 1,912 patients that presented at a local hospital with confirmed dengue between 1995 and 1999, the date they were admitted and information on the infecting serotype [

HIV is endemic in sub-African Africa and occurs with higher incidence than anywhere else in the world; however, the mechanisms of HIV spread and persistence within African populations are poorly understood, particularly at local spatial scales. For instance, the relative frequency of direct transmission between sexual partners who live far apart, versus direct transmission between partners who reside in the same community is unknown. Insights into the spatial scale of HIV transmission can be gained from global spatial clustering analyses. Such spatial analyses can inform basic HIV epidemiology, control efforts, and the design of community randomized clinical trials for HIV prevention, the latter of which often assume that the majority direct transmission occurs within communities.

In a large population-based cohort in rural Rakai, Uganda, HIV incidence was measured in over 8000 geolocated households in 46 communities [

To assess evidence of spatial dependence using the τ-statistic in a directly transmitted disease, we used data from a well-studied outbreak of measles in Hagelloch, Germany in 1861–1962. 188 children became sick with measles between October 1861 and January 1862. Information on the location of the case homes and when they presented with symptoms is available in the R package

Global clustering statistics are an important tool for spatial analytics that can be used to better understand the transmission of infectious disease. The τ-statistic presented here represents one approach to measuring global clustering that has an easy interpretation and overcomes many of the challenges encountered when analyzing infectious disease data. To catalyze use of this approach in empirical studies and methodological research, we have provided an open source R package implementing many of the methods presented here (to which all are welcome to contribute). Important areas for further work include, extension of these methods to better analyze particular types of space time data, the development of statistics to derive transmission kernels from observed clustering data, and new techniques for identifying specific clusters building on the ideas presented here.

The τ-statistic provides a valuable tool to capture spatial dependence in epidemiological relevant terms. However, it should be used alongside existing measures of spatial dependence, in particular as it provides a qualitatively different tool to other approaches. This means that they may not be comparable on a quantitative level. Further, we describe the application of the τ-statistic to specifically infectious disease point pattern processes. In these processes, the clustering process for the underlying population is typically independent from the clustering due to disease transmission. Where the underlying population is not known, we expect independence in the clustering processes of the different pathogen types (e.g., independent clustering of homotypic and heterotypic cases of dengue). Such an approach may not be appropriate where we do not have any

This statistic has already proven useful in the analysis of the spatial dispersal of wide range of infectious diseases, including dengue, chikungunya, influenza and HIV [

Confidence intervals for _{1}, _{2}) estimates for cases separated by (A) four months and (B) 15 months. Results from 1,912 dengue cases that presented at a Bangkok hospital between 1995 and 1999.

(TIF)

Estimates of _{1}, _{2}) when there is spatially biased observation dependent on the location of a pair of healthcare locations. (A) Locations of pairs of healthcare providers. (B) Estimates of _{1}, _{2}) using locations in (A). The color of each line is the color of the corresponding locations in (A). The solid blue line is the estimate when all cases are observed.

(TIF)

(PDF)