^{1}

^{2}

^{*}

^{2}

^{1}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JA EB DP. Performed the experiments: JA. Analyzed the data: JA. Wrote the paper: JA, EB, DP.

Power laws are theoretically interesting probability distributions that are also frequently used to describe empirical data. In recent years, effective statistical methods for fitting power laws have been developed, but appropriate use of these techniques requires significant programming and statistical insight. In order to greatly decrease the barriers to using good statistical methods for fitting power law distributions, we developed the powerlaw Python package. This software package provides easy commands for basic fitting and statistical analysis of distributions. Notably, it also seeks to support a variety of user needs by being exhaustive in the options available to the user. The source code is publicly available and easily extensible.

Power laws are probability distributions with the form:

Power law probability distributions are theoretically interesting due to being “heavy-tailed”, meaning the right tails of the distributions still contain a great deal of probability. This heavy-tailedness can be so extreme that the standard deviation of the distribution can be undefined (for

In recent years several statistical methods for evaluating power law fits have been developed

In this report we describe the structure and use of powerlaw. Using powerlaw, we will give examples of fitting power laws and other distributions to data, and give guidance on what factors and fitting options to consider about the data when going through this process.

Example data for power law fitting are a good fit (left column), medium fit (middle column) and poor fit (right column). Data and methods described in text. a) Visualizing data with probability density functions. A typical histogram on linear axes (insets) is not helpful for visualizing heavy-tailed distributions. On log-log axes, using logarithmically spaced bins is necessary to accurately represent data (blue line). Linearly spaced bins (red line) obscure the tail of the distribution (see text). b) Fitting to the tail of the distribution. The best fit power law may only cover a portion of the distribution's tail. Dotted green line: power law fit starting at

The powerlaw package will perform all of these steps automatically. Below is an example of basic usage of powerlaw, with explanation following. Using the populations affected by blackouts:

> import powerlaw

> fit = powerlaw.Fit(data)

Calculating best minimal value for power law fit

> fit.power_law.alpha

2.273

> fit.power_law.sigma

0.167

> fit.distribution_compare(‘power_law’, ‘exponential’)

(12.755, 0.152)

An IPython Notebook and raw Python file of all examples is included in Supporting Information.

The design of powerlaw includes object-oriented and functional elements, both of which are available to the user. The object-oriented approach requires the fewest lines of code to use, and is shown here. The powerlaw package is organized around two types of objects, Fit and Distribution. The Fit object (fit above) is a wrapper around a dataset that creates a collection of Distribution objects fitted to that dataset. A Distribution object is a maximum likelihood fit to a specific distribution. In the above example, a power law Distribution has been created automatically (power_law), with the fitted

The powerlaw package supports easy plotting of the probability density function (PDF), the cumulative distribution function (CDF;

> powerlaw.plot_pdf(data, color = ‘b’)

PDFs require binning of the data, and when presenting a PDF on logarithmic axes the bins should have logarithmic spacing (exponentially increasing widths). Although linear bins maintain a high resolution over the entire value range, the greatly reduced probability of observing large values in the distributions prevents a reliable estimation of their probability of occurrence. This is compensated for by using logarithmic bins, which increases the likelihood of observing a range of values in the tail of the distribution and normalizing appropriately for that increase in bin width. Logarithmic binning is powerlaw's default behavior, but linearly spaced bins can also be dictated with the linear_bins = True option.

> powerlaw.plot_pdf(data, linear_bins = True, color = ‘r’)

As CDFs and CCDFs do not require binning considerations, CCDFs are frequently preferred for visualizing a heavy-tailed distribution. However, if the probability distribution has peaks in the tail this will be more obvious when visualized as a PDF than as a CDF or CCDF. PDFs and CDF/CCDFs also have different behavior if there is an upper bound on the distribution (see Identifying the Scaling Range, below).

Individual Fit objects also include functions for pdf, plot_pdf, and their CDF and CCDF versions. The theoretical PDF, CDF, and CCDFs of the constituent Distribution objects inside the Fit can also be plotted. These are useful for visualizing just the portion of the data using for fitting to the distribution (described below). To send multiple plots to the same figure, pass the matplotlib axes object with the keyword ax.

> fig2 = fit.plot_pdf(color = ‘b’, linewidth = 2)

> fit.power_law.plot_pdf(color = ‘b’, linestyle = ‘–’, ax = fig2)

> fit.plot_ccdf(color = ‘r’, linewidth = 2, ax = fig2)

> fit.power_law.plot_ccdf(color = ‘r’, linestyle = ‘–’, ax = fig2)

PDF, CDF, and CCDF information are also available outside of plotting. Fit objects return the probabilities of the fitted data and either the sorted data (cdf) or the bin edges (pdf). Distribution objects return just the probabilities of the data given. If no data is given, all the fitted data is used.

> x, y = fit.cdf()

> bin_edges, probability = fit.pdf()

> y = fit.lognormal.cdf(data = [300, 350])

> y = fit.lognormal.pdf()

The first step of fitting a power law is to determine what portion of the data to fit. A heavy-tailed distribution's interesting feature is the tail and its properties, so if the initial, small values of the data do not follow a power law distribution the user may opt to disregard them. The question is from what minimal value

As power laws are undefined for

> fit = powerlaw.Fit(data)

Calculating best minimal value for power law fit

> fit.xmin

230.000

> fit.fixed_xmin

False

> fit.power_law.alpha

2.273

> fit.power_law.D

0.061

> fit = powerlaw.Fit(data, xmin = 1.0)

> fit.xmin

1.0

> fit.fixed_xmin

True

> fit.power_law.alpha

1.220

> fit.power_law.D

0.376

The search for the optimal

> fit = powerlaw.Fit(data, xmin = (250.0, 300.0))

Calculating best minimal value for power law fit

> fit.fixed_xmin

False

> fit.given_xmin

(250.000, 300.000)

> fit.xmin

272.0

In some domains there may also be an expectation that the distribution will have a precise upper bound,

> fit = powerlaw.Fit(data, xmax = 10000.0)

Calculating best minimal value for power law fit

> fit.xmax

10000.0

> fit.fixed_xmax

True

For calculating or plotting CDFs, CCDFs, and PDFs, by default Fit objects only use data above

When using an

Datasets are treated as continuous by default, and thus fit to continuous forms of power laws and other distributions. Many data are discrete, however. Discrete versions of probability distributions cannot be accurately fitted with continuous versions [5]. Discrete (integer) distributions, with proper normalizing, can be dictated at initialization:

> fit = powerlaw.Fit(data, xmin = 230.0)

> fit.discrete

False

> fit = powerlaw.Fit(data, xmin = 230.0, discrete = True)

> fit.discrete

True

Discrete forms of probability distributions are frequently more difficult to calculate than continuous forms, and so certain computations may be slower. However, there are faster estimations for some of these calculations. Such opportunities to estimate discrete probability distributions for a computational speed up are described in later sections.

From the created Fit object the user can readily access all the statistical analyses necessary for evaluation of a heavy-tailed distribution. Within the Fit object are individual Distribution objects for different possible distributions. Each Distribution has the best fit parameters for that distribution (calculated when called), accessible both by the parameter's name or the more generic “parameter1”. Using the blackout data:

> fit.power_law

<powerlaw.Power_Law at 0x301b7d0>

> fit.power_law.alpha

2.273

> fit.power_law.parameter1

2.273

> fit.power_law.parameter1_name

> fit.lognormal.mu

0.154

> fit.lognormal.parameter1_name

‘mu’

> fit.lognormal.parameter2_name

‘sigma’

> fit.lognormal.parameter3_name = = None

True

The goodness of fit of these distributions must be evaluated before concluding that a power law is a good description of the data. The goodness of fit for each distribution can be considered individually or by comparison to the fit of other distributions (respectively, using bootstrapping and the Kolmogorov-Smirnov test to generate a p-value for an individual fit vs. using loglikelihood ratios to identify which of two fits is better)

Practically, bootstrapping is more computationally intensive and loglikelihood ratio tests are faster. Philosophically, it is frequently insufficient and unnecessary to answer the question of whether a distribution “really” follows a power law. Instead the question is whether a power law is the best description available. In such a case, the knowledge that a bootstrapping test has passed is insufficient; bootstrapping could indeed find that a power law distribution would produce a given dataset with sufficient likelihood, but a comparative test could identify that a lognormal fit could have produced it with even greater likelihood. On the other hand, the knowledge that a bootstrapping test has failed may be unnecessary; real world systems have noise, and so few empirical phenomena could be expected to follow a power law with the perfection of a theoretical distribution. Given enough data, an empirical dataset with any noise or imperfections will always fail a bootstrapping test for any theoretical distribution. If one keeps absolute adherence to the exact theoretical distribution, one can enter the tricky position of passing a bootstrapping test, but only with few enough data

Thus, it is generally more sound and useful to compare the fits of many candidate distributions, and identify which one fits the best.

> R, p = fit.distribution_compare(‘power_law’, ‘exponential’, normalized_ratio = True)

> print R, p

1.431 0.152

R is the loglikelihood ratio between the two candidate distributions. This number will be positive if the data is more likely in the first distribution, and negative if the data is more likely in the second distribution. The significance value for that direction is p. The normalized_ratio option normalizes R by its standard deviation,

The exponential distribution is the absolute minimum alternative candidate for evaluating the heavy-tailedness of the distribution. The reason is definitional: the typical quantitative definition of a “heavy-tail” is that it is not exponentially bounded

However, the exponential distribution is, again, only the minimum alternative candidate distribution to consider when describing a probability distribution. The fit object contains a list of supported distributions in fit.supported_distributions. Any of these distribution names can be used by distribution_compare. Users who want to test unsupported distributions can write them into powerlaw in a straightforward manner described in the source code. Among the supported distributions is the exponentially truncated power law, which has the power law's scaling behavior over some range but is truncated by an exponentially bounded tail. There are also many other heavy-tailed distributions that are not power laws, such as the lognormal or the stretched exponential (Weibull) distributions. Given the infinite number of possible candidate distributions, one can again run into a problem similar to that faced by bootstrapping: There will always be another distribution that fits the data better, until one arrives at a distribution that describes only the exact values and frequencies observed in the dataset (overfitting). Indeed, this process of overfitting can begin even with very simple distributions; while the power law has only one parameter to serve as a degree of freedom for fitting, the truncated power law and the alternative heavy-tailed distributions have two parameters, and thus a fitting advantage. The overfitting scenario can be avoided by incorporating generative mechanisms into the candidate distribution selection process.

The observed data always come from a particular domain, and in that domain generative mechanisms created the observed data. If there is a plausible domain-specific mechanism for creating the data that would yield a particular candidate distribution, then that candidate distribution should be considered for fitting. If there is no such hypothesis for how a candidate distribution could be created there is much less reason to use it to describe the dataset.

As an example, the number of connections per neuron in the nematode worm

> fit.distribution_compare(‘power_law’, ‘exponential’)

(16.384, 0.024)

However, the worm has a finite size and a limited number of neurons to connect to, so the rich cannot get richer forever. There could be a gradual upper bounding effect on the scaling of the power law. An exponentially truncated power law could reflect this bounding. To test this hypothesis we compare the power law and the truncated power law:

> fit.distribution_compare(‘power_law’, ‘truncated_power_law’)

Assuming nested distributions

(-0.081, 0.687)

In fact, neither distribution is a significantly stronger fit (

The importance of considering generative mechanisms is even greater when examining other heavy-tailed distributions. Perhaps the simplest generative mechanism is the accumulation of independent random variables, the central limit theorem. When random variables are summed, the result is the normal distribution. However, when positive random variables are multiplied, the result is the lognormal distribution, which is quite heavy-tailed. If the generative mechanism for the lognormal is plausible for the domain, the lognormal is frequently just as good a fit as the power law, if not better.

> fit.distribution_compare(‘power_law’, ‘lognormal’)

(0.928, 0.426)

> fig4 = fit.plot_ccdf(linewidth = 3)

> fit.power_law.plot_ccdf(ax = fig4, color = ‘r’, linestyle = ‘–’)

> fit.lognormal.plot_ccdf(ax = fig4, color = ‘g’, linestyle = ‘–’)

There are domains in which the power law distribution is a superior fit to the lognormal (ex.

Creating simulated data drawn from a theoretical distribution is frequently useful for a variety of tasks, such as modeling. Individual Distribution objects can generate random data points with the function generate_random. These Distribution objects can be called from a Fit object or created manually.

> fit = powerlaw.Fit(empirical_data)

> simulated_data = fit.power_law.generate_random(10000)

> theoretical_distribution = powerlaw.Power_Law(xmin = 5.0, parameters = [2.5])

> simulated_data = theoretical_distribution.generate_random(10000)

Such simulated data can then be fit again, to validate the accuracy of fitting software such as powerlaw:

> fit = powerlaw.Fit(simulated_data)

Calculating best minimal value for power law fit

> fit.power_law.xmin, fit.power_law.alpha

(5.30, 2.50)

Validations of powerlaw's fitting of

While the maximum likelihood fit to a continous power law for a given

> fit = powerlaw.Fit(data, discrete = True, estimate_discrete = True)

Calculating best minimal value for power law fit

> fit.power_law.alpha

2.26912

> fit.power_law.estimate_discrete

True

> fit = powerlaw.Fit(data, discrete = True, estimate_discrete = False)

Calculating best minimal value for power law fit

> fit.power_law.alpha

2.26914

> fit.power_law.estimate_discrete

False

Additionally, the discrete forms of some distributions are not analytically defined (ex. lognormal and stretched exponential). There are two available approximations of the discrete form. The first is discretization by brute force. The probabilities for all the discrete values between

> fit = powerlaw.Fit(data, discrete = True, xmin = 230.0, xmax = 9000, discrete_approximation = xmax’)

> fit.lognormal.mu

−44.19

> fit = powerlaw.Fit(data, discrete_approximation = 100000, xmin = 230.0, discrete = True)

> fit.lognormal.mu

0.28

> fit = powerlaw.Fit(data, discrete_approximation = ‘round’, xmin = 230.0, discrete = True)

> fit.lognormal.mu

0.40

Generation of simulated data from a theoretical distribution has similar considerations for speed and accuracy. There is no rapid, exact calculation method for random data from discrete power law distributions. Generated data can be calculated with a fast approximation or with an exact search algorithm that can run several times slower

> theoretical_distribution = powerlaw.Power_Law(xmin = 5.0, parameters = [2.5], discrete = True)

> simulated_data = theoretical_distribution.generate_random(10000, estimate_discrete = True)

If the decision to use an estimation is not explictly assigned when calling generate_random, the default behavior is to use the behavior used in the Distribution object generating the data, which may have been created by the user or created inside a Fit object.

> theoretical_distribution = powerlaw.Power_Law(xmin = 5.0, parameters = [2.5], discrete = True, estimate_discrete = False)

> simulated_data = theoretical_distribution.generate_random(10000)

> fit = powerlaw.Fit(empirical_data, discrete = True, estimate_discrete = True)

Calculating best minimal value for power law fit

> simulated_data = fit.power_law.generate_random(10000)

The fast estimation of random data has an error that scales with the

Random data generation methods for discrete versions of other, non-power law distributions all presently use the slower, exact search algorithm. Estimates of rapid, exact calculations for other distributions can later be implemented by users as they are developed, as described below.

Comparing the likelihoods of distributions that are nested versions of each other requires a particular calculation for the resulting p-value

> fit.distribution_compare(‘power_law’, ‘truncated_power_law’)

Assuming nested distributions

(−0.3818, 0.3821)

> fit.distribution_compare(‘exponential’, ‘stretched_exponential’)

Assuming nested distributions

(−13.0240, 3.3303e-07)

Each Distribution has default restrictions on the range of its parameters may take (ex.

> fit = powerlaw.Fit(data)

Calculating best minimal value for power law fit

> fit.power_law.alpha, fit.power_law.sigma, fit.xmin

(2.27, 0.17, 230.00)

> fit = powerlaw.Fit(data, sigma_threshold = .1)

Calculating best minimal value for power law fit

> fit.power_law.alpha, fit.power_law.sigma, fit.xmin

(1.78, 0.06, 50.00)

More extensive parameter ranges can be set with the keyword parameter_range, which accepts a dictionary of parameter names and a tuple of their lower and upper bounds. Instead of operating as selections on

> parameter_range = {‘alpha’: [2.3, None], ‘sigma’: [None, .2]}

> fit = powerlaw.Fit(data, parameter_range = parameter_range)

Calculating best minimal value for power law fit

> fit.power_law.alpha, fit.power_law.sigma, fit.xmin

(2.30, 0.17, 234.00)

Even more complex parameter ranges can be defined by instead passing parameter_range a function, to do arbitrary calculations on the parameters. To incorporate the custom parameter range in the optimizing of

> parameter_range = lambda(self): self.sigma/self.alpha <.05

> fit = powerlaw.Fit(data, parameter_range = parameter_range)

Calculating best minimal value for power law fit

> fit.power_law.alpha, fit.power_law.sigma, fit.xmin

(1.88, 0.09, 124.00)

The other constituent Distribution objects can be individually given a new parameter range afterward with the parameter_range function, as shown later.

Changes in

> from matplotlib.pylab import plot

> plot(fit.xmins, fit.Ds)

> plot(fit.xmins, fit.sigmas)

> plot(fit.xmins, fit.sigmas/fit.alphas)

The second minima may seem obviously optimal. However,

When fitting a distribution to data, there may be no valid fits. This would most typically arise from user-specified requirements, like a maximum threshold on

> fit = powerlaw.Fit(data, sigma_threshold = .001)

No valid fits found.

> fit.power_law.alpha, fit.power_law.sigma, fit.xmin, fit.noise_flag

(2.27, 0.17, 230.00, True)

User-specified parameter limits can also create calculation difficulties with other distributions. Most other distributions are determined numerically through searching the parameter space from an initial guess. The initial guess is calculated from the data using information about the distribution's form. If an extreme parameter range very far from the optimal fit with a standard parameter range is required, the initial guess may be too far away and the numerical search will not be able to find the solution. In such a case the initial guess will be returned and the noise_flag attribute will also be set to True. This difficulty can be overcome by also providing a set of initial parameters to search from, namely within the user-provided, extreme parameter range.

> fit.lognormal.mu, fit.lognormal.sigma

(0.15, 2.30)

> range_dict = {‘mu’: [11.0, None]}

> fit.lognormal.parameter_range(range_dict)

No valid fits found.

> fit.lognormal.mu, fit.lognormal.sigma, fit.lognormal.noise_flag

(6.22, 0.72, True)

> initial_parameters = (12, .7)

> fit.lognormal.parameter_range(range_dict, initial_parameters)

> fit.lognormal.mu, fit.lognormal.sigma, fit.lognormal.noise_flag

(11.00, 5.72, False)

A fundamental assumption of the maximum likelihood method used for fitting, as well as the loglikelihood ratio test for comparing the goodness of fit of different distributions, is that individual data points are independent

Depending on the nature of the correlation, some datasets can be “decorrelated” by selectively ommitting portions of the data

An alternative to maximum likelihood estimation is minimum distance estimation, which fits the theoretical distribution to the data by minimizing the Kolmogorov-Smirnov distance between the data and the fit. This can be accomplished in the Fit object by using the keyword argument fit_method = ‘KS’ at initialization. However, the use of this option will not solve the problem of correlated data points for the loglikelihood ratio tests used in distribution_compare.

The optimal

> fit = powerlaw.Fit(data, xmin_distance = ‘D’)

> fit = powerlaw.Fit(data, xmin_distance = ‘V’)

> fit = powerlaw.Fit(data, xmin_distance = ‘Asquare’)

Source code and Windows installers of powerlaw are available from the Python Package Index, PyPI, at

pip install powerlaw

Source code is also available on GitHub at

The powerlaw Python package is implemented solely in Python, and requires the packages NumPy, SciPy, matplotlib, and mpmath. NumPy, SciPy and matplotlib are very popular and stable open source Python packages useful for a wide variety of scientific programming needs. SciPy development is supported by Enthought, Inc. and all three are included in the Enthought Python Distribution. Mpmath is required only for the calculation of gamma functions in fitting to the gamma distribution and the discrete form of the exponentially truncated power law. If the user does not attempt fits to the distributions that use gamma functions, mpmath will not be required. The gamma function calculations in SciPy are not numerically accurate for negative numbers. If and when SciPy's implementations of the gamma, gammainc, and gammaincc functions becomes accurate for negative numbers, dependence on mpmath may be removed.

There have been other freely-available software for fitting heavy-tailed distributions

As described in this paper, fitting heavy-tailed distributions involves several complex algorithms, and keeping track of numerous options and features of the fitted data set. powerlaw uses an integrated system of Fit and Distribution objects so that the user needs to interact with only a few lines of code to perform the full analysis pipeline. In other software this integration does not exist, and requires much more elaborate code writing by the user in order to analyze a dataset completely.

In fitting data there are multiple families of distributions that the user may need or wish to consider: power law, exponential, lognormal, etc. And there are different flavors within each family: discrete vs. continuous, with or without an

Lastly, much existing software was not written for code maintenance or expansion. The code architecture of powerlaw was designed for easy navigation, maintenance and extensibility. As the source code is maintained in a git repository on GitHub, it is straightforward for users to submit issues, fork the code, and write patches. The most obvious extensions users may wish to write are additional candidate distributions for fitting to the data and comparing to a power law fit. All distributions are simple subclasses of the Distribution class, and so writing additional custom distributions requires only a few lines of code. Already users have submitted suggestions and written improvements to certain distributions, which were able to slot in seamlessly due to modularly-organized code. Such contributions will continue to be added to powerlaw in future versions.

(PY)

(PY)

(IPYNB)

(TIFF)

The authors would like to thank Andreas Klaus, Mika Rubinov and Shan Yu for helpful discussions. The authors also thank Andreas Klaus and the authors of