^{1}

^{2}

^{3}

^{4}

^{1}

^{5}

^{5}

^{6}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: CS NA DP MP. Performed the experiments: CS. Analyzed the data: CS NA DP MP. Contributed reagents/materials/analysis tools: NA MP. Wrote the paper: CS NA DP MP.

We propose a model that explains the reliable emergence of power laws (e.g., Zipf’s law) during the development of different human languages. The model incorporates the principle of least effort in communications, minimizing a combination of the information-theoretic communication inefficiency and direct signal cost. We prove a general relationship, for all optimal languages, between the signal cost distribution and the resulting distribution of signals. Zipf’s law then emerges for logarithmic signal cost distributions, which is the cost distribution expected for words constructed from letters or phonemes.

Zipf’s law [_{t} is the

Several papers [

If we reject the idea that Zipfian distribution are produced as a result of a process that randomly produces words, then the next logical step is to ask what models can produce such distributions and agrees with our basic assumptions about language? Mandelbrot [

An alternative model by Cancho and Solé [

Thus, to our knowledge, the question of how to achieve power laws in human language from the least effort principle is still not satisfactorily solved. Nevertheless, the idea from [

We should also point out that a power-law often is not the best fit to real data [

Another important consideration is that there in general may be multiple mechanisms generating power laws, and one cannot

The resulting insights may be of interest beyond the confines of power-law structures and offer an opportunity to study optimality conditions in other types of self-organizing coding systems, for instance in the case of the genetic code [

We will use a model, similar to that used by Ferrer i Cancho and Solé [

The model has a set of _{i,j} = 1 if and only if signal _{i} refers to object _{j}.

This model allows one to represent both _{j} is the number of synonyms for object _{j}, that is _{j} = ∑_{i} _{i,j}. Thus, the probability of using a synonym is equally distributed over all synonyms referring to a particular object. Importantly, it is also assumed that _{i} leaves little ambiguity as to what object _{j} is referenced, so there is little chance that the listener misunderstands what the speaker wanted to say. In the model of Ferrer i Cancho and Solé [_{i} is expressed by the conditional entropy:

The effort for the speaker is expressed by the entropy _{S}, which is, as the term in _{λ} as follows:

It can be shown that the cost function Ω_{λ} given by _{R} − _{R∣S} captures the communication efficiency, i.e. how much information the signals contain about the objects. This energy function better accounts for subtle communication efforts [_{S} is arguably both a source of effort for the speaker and the listener because the word frequency affects not only word production but also recognition of spoken and written words [_{S∣R} (a measure of the speaker’s effort of coding objects) and _{R∣S} (i.e., a measure of the listener’s effort of decoding signals). It is easy to see that
_{R} is constant, e.g. under the uniformity condition _{λ}.

We propose instead another cost function that not only produces optimal languages exhibiting power laws, but also retains the clear intuition of generic energy functions which typically reflect the global quality of a solution. Firstly, we represent the communication inefficiency by the information distance, the Rokhlin metric, _{S∣R} + _{R∣S} [

Secondly, we define the signal usage effort by introducing an explicit cost function _{i}), which assigns each signal a specific cost. The signal usage cost for a language is then the weighted average of this signal specific cost:

The overall cost function for a language _{i}, _{j}) is the joint probability. A language can be optimized for different values of

First of all, we establish that all local minimizers, and hence all global minimizers, of the cost function

We need the following lemma as an intermediate step towards deriving the analytical relationship between the specific word cost

_{S∣R} = 0, while _{R} = 1 under the uniformity constraint

_{R∣S} + H_{S}

Using this lemma, and noting that each such solution represented as a function _{S∣R} = 0, we reduce the

Varying with respect to _{i}), under the constraint ∑_{i}) = 1, yields the extremality condition
_{i} such that ∑_{i} = _{i}) correspond to functions _{i}, _{j}) (i.e., minimizer matrices without synonyms). In other words, the marginal probability _{i}, _{j}) that represents a minimizer matrix under the uniformity constraint

Under the condition _{i}) = ^{−βc(si)} which would then allow for arbitrary cost functions

Interestingly, the optimal marginal probability distribution _{i}), while the parameter

Let us now consider some special cases. For the case of equal effort, i.e.

Another important special case is given by the cost function _{i}) = ln _{i}/_{i} is the rank of symbol _{i}, and _{i}/

Zipf’s law (a power law with exponent

The assumption that the cost function is precisely logarithmic results in an exact power law. If, on the other hand, the cost function deviates from being precisely logarithmic, then the resulting dependency would only approximate a power law—this imprecision may in fact account for different degrees of success in fitting power laws to real data.

In summary, the derived relationship expresses the optimal probability

To explain the emergence of power laws for signal selection, we need to explain why the cost function of the signals would increase logarithmically, if the signals are ordered by their cost rank. This can be motivated, across a number of languages, by assuming that signals are in fact words, which are made up of letters from a finite alphabet; or in regard to spoken language, are made of from a finite set of phonemes. Compare [

Lets assume that each letter (or phoneme) has an inherent cost which is approximate to a unit letter cost. Furthermore, assume that the cost of a word roughly equals the sum of its letter costs. A language with an alphabet of size ^{2} two letter words with an approximate cost of two, ^{3} three letter words with a cost of three, etcetera. If we rank these words by their cost, then their cost will increase approximately logarithmically with their cost rank. To illustrate,

Word cost is a sum of individual letter cost, and letter cost is between 1.0 and 2.0 units.

This signal usage cost can be interpreted in different ways. In spoken language it might simply be the time needed to utter a word, which makes it a cost both for the listener and the speaker. In written language it might be the effort to write a word, or the bandwidth needed to transmit it, in which case it is a speaker cost. On the other hand, if one is reading a written text, then the length of the words might translate into “listener” cost again. In general, the average signal usage cost corresponds to the effort of using a specific language to communicate for all involved parties. This differs from the original least effort idea, which balances listener and speaker effort [

We noted earlier that there are other options to produce power laws, which are insensitive to the relationship between objects and signals. Baek et al. [_{cost} = −_{S} + 〈log _{i})log(_{i}), and log(_{i}) is interpreted as the logarithm of the index of _{i} (specifically, its rank). Their argument that this cost function follows from a more general cost function _{R∣S} = −_{R}, where _{R} is constant, is undermined by their unconventional definition of conditional probability (cf. Appendix A [_{cost}

A very similar cost function was offered by Visser [_{S} subject to a constraint 〈log

Finally, we would like to point out that the cost function −_{S} + 〈log _{R∣S} − _{S∣R} + 〈log _{R}. This expression reveals another important drawback of minimizing −_{S} + 〈log _{R∣S} reduces the ambiguity of polysemy, minimizing −_{S∣R} explicitly “rewards” the ambiguity of synonyms. In other words, languages obtained by minimizing such a cost directly do exhibit a power law, but mostly at the expense of potentially unnecessary synonyms.

There may be a number of reasons for the avoidance of synonyms in real languages. While an analysis of synonymy dynamics in child languages or aphasiacs is outside of scope of this paper, it is worth pointing out that some studies have suggested that the learning of new words by children is driven by synonymy avoidance [

Regarding synonyms it should also be noted, that while they exist, their number is usually comparatively low. If we are looking at a natural language, which might have ca. 100.000 words, we will not find a concept that has 95.000 synonyms. Most concepts have synonyms in the single digits, if they have any. The models that look at just the output distribution could produce languages with such an excessive number of synonyms. In our model the ideal solution has no synonyms, but the existing languages, which are constantly adapting, could be seen as close approximations, where out of 100.000 possible synonyms, most concepts have only very few synonyms, if any. As noted earlier, while precise logarithmic cost functions would produce perfect power-law distributions, natural languages do not fit Zipf’s law exactly but only approximately.

These observations support our conjecture that, as languages mature, the communicative efficiency and the balance between speaker’s and listener’s efforts become a more significant driver, and so the simplistic cost function −_{S} + 〈log

In contrast, the cost function proposed in this paper _{R∣S} + _{S∣R} + 〈log _{S} + 〈log

In conclusion, our paper addresses the long-held conjecture that the principle of least effort provides a plausible mechanism for generating power laws. In deriving such a formalization, we interpret the effort in suitable information-theoretic terms and prove that its global minimum produces Zipf’s law. Our formalization enables a derivation of languages which are optimal with respect to both the communication inefficiency and direct signal cost. The proposed combination of these two factors within a generic cost function is an intuitive and powerful method to capture the trade-offs intrinsic to least-effort communication.

In order to prove this theorem, we establish a few preliminary propositions (these results are obtained by Nihat Ay).

The extreme points of

_{i∣j})_{i,j} to the probability vector

Consider the set _{1}, …, _{n}} of signals with _{1}, …, _{m}} of _{i}, _{j}), 1 ≤

_{R∣S}, _{S∣R}, _{S∣R}

_{R∣S}_{R∣S} as
_{R∣S} now follows from the joint convexity of the relative entropy

_{S∣R}_{S∣R} follows by the same arguments as in (1). We now prove the strict concavity of its restriction to _{R∣S} now follows from the strict concavity of the Shannon entropy.

With a number 0 <

We have the following direct implication of Corollary 4.

Together with Proposition 2, this implies Theorem 1, our main result on minimizers of the restriction of

We finish this analysis by addressing the problem of minimizing

^{−1}(