^{1}

^{*}

^{2}

^{2}

Conceived and designed the experiments: AGL LMB. Analyzed the data: AGL. Contributed reagents/materials/analysis tools: AGL HY LMB. Wrote the paper: AGL LMB.

The authors acknowledge partial support for this research from The Boeing Company. This does not alter their adherence to all the PLoS ONE policies on sharing data and materials.

Urban scaling relations characterizing how diverse properties of cities vary on average with their population size have recently been shown to be a general quantitative property of many urban systems around the world. However, in previous studies the statistics of urban indicators were not analyzed in detail, raising important questions about the full characterization of urban properties and how scaling relations may emerge in these larger contexts. Here, we build a self-consistent statistical framework that characterizes the joint probability distributions of urban indicators and city population sizes across an urban system. To develop this framework empirically we use one of the most granular and stochastic urban indicators available, specifically measuring homicides in cities of Brazil, Colombia and Mexico, three nations with high and fast changing rates of violent crime. We use these data to derive the conditional probability of the number of homicides per year given the population size of a city. To do this we use Bayes’ rule together with the estimated conditional probability of city size given their number of homicides and the distribution of total homicides. We then show that scaling laws emerge as expectation values of these conditional statistics. Knowledge of these distributions implies, in turn, a relationship between scaling and population size distribution exponents that can be used to predict Zipf’s exponent from urban indicator statistics. Our results also suggest how a general statistical theory of urban indicators may be constructed from the stochastic dynamics of social interaction processes in cities.

The search for a general multidisciplinary science of cities is a fundamental scientific problem with strong roots in economics

Cities should be regarded primarily as dynamical social networks, constantly changing in terms of their composition and interactions. Consequently, urban indicators, denoted by

In order to probe urban indicators that show granularity and a large level of temporal and geographic variation we analyze here data on annual homicides in cities of three Latin American countries over a several year period during which national homicide rates have varied substantially. We analyze data from three of the largest nations in Latin America, presently showing some of the highest homicide rates in the world: Brazil, Colombia and Mexico, for which data are available at the municipal level. The number of homicides is a quantity that is widely available at the local level in developed and developing nations. It is thought generally to be reliably reported, notwithstanding some important caveats

Homicides, as the ultimate expression of violence in human societies, are a widely investigated quantity

In the next section we discuss some of the characteristics of the data and our main formal objective, the estimation of the conditional probability density

We have recently shown

We have also observed that for US metropolitan areas many urban quantities vary only slowly, with most change being due to the temporal variation of

To address these points, we introduce here new extensive data sets for homicides in three fast evolving (and developing) nations: Brazil, Colombia and Mexico. These nations are presently among the most violent in the world with registered homicide rates greater than 15 per 100,000 inhabitants, see

In Brazil, Colombia and Mexico the smallest spatial unit for which data are available is the municipality (

We motivate the need for our statistical study by displaying in

Large cities are defined in terms of metropolitan areas which are aggregations of municipalities (red circles) while non-metropolitan municipalities are shown separately (green squares). The solid blue line fits only the scaling of homicides for metropolitan areas. Large variations, especially among the smaller population units, and the fact that many municipalities have

Equation (1) is an average statement that cannot be obeyed exactly in every instance. This is not only because all cities have specific local characteristics and urban indicators fluctuate over time but, more fundamentally, because a continuous scaling relation must break down in the limit of small discrete numbers. The correct statement must then be formulated in probabilistic terms. To do this we think of both

This problem is specified in terms of the conditional probability distribution function of

The reason to estimate

The distribution of total homicides in cities

In practice, we adopt a common procedure of plotting the complementary cumulative distribution function rather than the probability density function, which avoids the noisy character of the tail for large cities.

The empirical cumulative distributions of homicides for the year 2007 in Colombia, Mexico and Brazil are shown in

Here we plot not the density function but the complementary cumulative distribution to attenuate the tail fluctuations and ease visual interpretation. Best fits (dashed red line) of the form

To calculate

Each column corresponds to a different country and each row, from top to bottom, corresponds to the values

The shape of this distribution, which we had more implicitly noted in

One drawback of the log-normal distribution is that both

The mean and variance are given by:

The maximum likelihood estimators of the log-normal parameters are:

If the normal distribution holds in terms the logarithmic variables of population given different values of

Log-normal probability density functions for the three nations are shown as solid red lines. This shows that power-law distributions describing total homicides in the urban systems have in fact more predictable statistics when conditioned on city population size.

We can now estimate the parameters of Eq.(5) using Eqs.(8) and (9), and plot

A different curve was constructed for every year of the analysis (see Methods). The plots show the average over several years. Error bars represent one standard deviation intervals (67% confidence level). The plots show no clear systematic

A different curve was constructed for every year of the analysis, and the points plotted are the averages over several years. The error bars represent one standard deviation intervals about the mean. Plots show a logarithmic dependence on

The behavior of

Finally, using Eq.(2), we derive the conditional probability function

Using Eq.(10) to replace

We can expand the squared terms, the logarithms, group some of the terms, so this equation transforms into:

Now, recall that both

These expressions connect the log-normal statistics of the conditional distribution

Determining these conditional distributions enables us to calculate their moments, such as the mean and variance. We take Eq.(10) and Eq.(16) to derive

Similarly, the standard deviations

From the preceding sections it should already begin to be clear how the log-normal distribution relates to Zipf’s law. We can show how a power-law distribution emerges by deriving the probability distribution of

If superlinear scaling (

In this manuscript we characterized the statistics of homicides - a highly variable and granular metric - in three fast changing urban systems in Latin America. This analysis allowed us to address the statistics of urban indicators under extreme conditions and investigate how urban scaling laws emerge for noisy and granular variables within a larger probabilistic context.

We have found that homicides

Much effort has traditionally been devoted to model the broad distributions and lack of characteristic scales displayed by urban systems

Furthermore, the consistency between log-normal statistics for individual cities and Zipfian distributions for the urban system, as well as scaling relations across city sizes, suggest that local indicators are the result of self-consistent urban system dynamics and that these indicators are naturally bounded. Consequently, when considering goals for urban planning it is important to think at once locally and at the level of the urban system. In this light, on the one hand, questions about particular cities and the magnitude of their metrics may not make much sense unless we take into account the whole urban system in which they are embedded. On the other hand, characterizing urban systems only through power-law distributions prevents us from observing finer quantitative patterns present locally. Several mechanisms have been proposed for the emergence of log-normal

Given the general implications of these results a few remaining issues and some caveats are worth further discussion. First of all, we motivated the log-normal distribution as a good general description of the data. However, the data may be compatible with other statistical densities, specifically a Laplace distribution (which is also characterized by two parameters a scaling mean and a fixed dispersion, see

Second, one of our main results is the observation of deviations from scaling in the limit

As they stand the present results suggest several interesting new questions for future research. First, they provide a mesoscopic view of urban indicators and take a step in suggesting the form of a statistical mechanics approach to universal aggregate properties of cities, such as scaling laws and size distributions. Such an approach should lead to theory and methods to bridge scales of analysis from individuals, through social and economic organizations, to entire cities and urban systems.

This shows that a log-normal distribution is an excellent description of

Finally, it is interesting to briefly discuss the practical implications of the statistical treatment of urban indicators developed here. Quantitative knowledge of the distribution of indicators for a given population size allows us to make predictions e.g. for the homicide rate of a particular place with quantified levels of uncertainty. The approach developed here takes into account only data aggregated over a time period, usually a year. However we know in addition that there is also considerable predictability for urban indicators of the same city across time. Thus, we expect that the future combination of these two elements will yield a procedure to make better predictions of future indicators for specific places with quantified uncertainty. This ability will also allow the detection of exceptional events as statistical anomalies in urban indicators. We hope therefore that our growing quantitative understanding of cities and urban systems throughout the world will provide the basis for the development of a predictive science of cities that will help inform more effective policy in an increasingly urbanized world.

Homicides are defined as deaths caused by other persons, intentionally or not. Data for Colombia is available online at the National Institute of Legal Medicine and Forensic Sciences (

Best fits (dashed red line) of the form

We adopted standard definitions of metropolitan areas available at

Reference

Because we are interested in the regime of small numbers where the number of homicides

If we let

A numerical estimation of

A rigorous procedure to estimate these parameters from the data, estimate the error and determine its scaling properties is part of future work.

We test the log-normal distribution as a description of

Maximum likelihood estimations of

Error bars represent plus and minus one standard deviation about the average.

The fits were performed using a Levenberg-Marquardt algorithm, which minimizes the sum of least squares of a set of non-linear equations, weighting every point by its error. The function to minimize with respect to vector parameter

Here we give additional details of the calculations leading to Eq.(22). First, we write

For simplicity of notation, we drop the subscript in

We can now complete the square and re-arrange terms to obtain.

Now, we see that the term inside the integral is a log-normal distribution, integrated over its entire domain. Consequently, the integral is a constant, regardless of the form of

This relationship can also be derived in a more straightforward way under the assumptions that i) a power-law distribution for

We thank Jose Lobo and Geoffrey West for discussions and for comments on the manuscript. We thank Diego Valle for data on homicides for Mexican municipalities and for useful discussions. We also acknowledge Jesse Taylor for helpful comments and suggestions, and Maria Jose Uribe for discussions and for providing us help with the Colombian data.