^{1}

^{2}

The authors have declared that no competing interests exist.

A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characterized by the _{Q}(_{Q}(^{β}), with _{Q}(_{Q}(^{−γ}, with an exponent

Can literature be characterized by mathematical laws? According to Zipf [_{i}},

For example, Ebeling and Neimann [

In this article, we apply the return-interval technique (also called peak-over-threshold method) to single-authored texts, for analyzing the arrangement of the rare words in the text. The method itself has been rigorously established in the statistical physics domain, and has been effective in analyzing extremes in natural and financial sciences (see, e.g., [

In the return-interval analysis of extreme events one considers, in records with _{Q} rarest events and investigates the statistics of the intervals between consecutive events. By definition, _{Q}/_{Q} = _{Q} is the mean length of the intervals.

Accordingly, in a text with _{Q}/N of the rarest words_{Q} represents the characteristic length scale. Since the power law relation between rank and frequency of a word observed by Zipf is not strictly universal and changes in different texts [_{Q} is not a universal function of _{Q} = _{Q} fixed. We like to note that our study complements and extends a previous study by Altmann et al. [_{s} times in the text) have been considered. The mean distance characterizing this word is accordingly _{s}/_{Q} considered here is the mean distance between all rare words with rank above

In our study, we have analyzed the following 10 texts: (i) Les Miserables by V. Hugo (French), number of words _{max} = 31659, (ii) Ulysses by J. Joyce (English), _{max} = 34359, (iii) Phänomenologie des Geistes by G. Hegel (German), _{max} = 9866, (iv) Hong Lou Meng by C. Xueqin (Chinese), _{max} = 18311, (v) Magura by K. Yumeno (Japanese), _{max} = 15883, (vi) Essai by M. Montaigne (French), _{max} = 41235, (vii) The Great Boer War by A.C. Doyle (English), _{max} = 13408, (viii) Die Traumdeutung by S. Freud (German), _{max} = 28864, (ix) Journey to the West by C. Wu (Chinese), _{max} = 14061, and (x) Daibosatsu Toge by K. Nakazato (Japanese), _{max} = 49099. The Chinese and Japanese texts were preprocessed into words with the ICTCLAS and MeCab, respectively, which are standard software packages for chunking.

_{Q} = 2 and 4. Words with ranks above _{Q} = 4, 8, and 16. The bars are for words above the respective _{Q} = 16, are not homogeneously distributed, but tend to cluster. This means, short intervals have a tendency to follow short intervals, while long intervals have a tendency to follow long intervals.

(a) shows the word sequence of Les Miserables from word 31096 to word 31116. Punctuations are considered as words. The sequences beneath illustrate how the return intervals between rare words and their lengths are defined: For _{Q} = 2 and 4, only words with ranks above the corresponding _{Q} = 4, 8, and 16, respectively. For _{Q} = 8 and 16, the words are not distributed homogeneously but tend to cluster.

For analyzing the statistics of the intervals, for fixed _{Q}, and discovering the mathematical laws behind them, we have determined (i) how often an interval of length _{Q} − 1, (i) yields the probability distribution _{Q}(_{Q}(_{Q}(_{Q}(0) = 1 and _{Q}(_{Q}(_{Q}(

_{Q}(_{Q} = 2, 4, 8, 16, 32 and 64. The dashed lines show _{Q} for the shuffled texts. It is easy to show that in this case, _{Q}(_{Q})^{r} ≡ exp(−|ln(1 − 1/_{Q})|_{Q}(_{Q}) for _{Q} ≫ 1. Accordingly, deviations from a simple exponential can be viewed as measure of the complexity of a text. The figures show that for _{Q} = 2, i.e. when half of the total words (with ranks above the median rank) are considered, _{Q} is described, for most texts, by a simple exponential. This changes when we increase _{Q}. For _{Q} ≥ 4, in all texts _{Q}(_{Q}. For _{Q} above 4, _{Q}, _{Q} ≥ 16. Stretched exponential functions, sometimes also referred to as Weibull functions, appear in science in many contexts, in materials science [

We consider _{Q} = 64, 32, 16, 8, 4, and 2 (from top to bottom). By definition, _{Q}(0) = 1. For transparency, we have multiplied _{Q} for _{Q} = 32, 16, 8, 4, and 2 by 10^{−2}, 10^{−4}, 10^{−6}, 10^{−8}, and 10^{−10}, respectively, and plotted _{Q} as a function of _{Q}. The dots are the numerical results. The gray lines are the best fit to _{Q})^{β}], with _{Q} ≥ 16 (see SI). The value of _{Q} above 2, stretched exponentials (where _{Q} for _{Q} ≥ 16. The exponent varies only slightly in the different texts: Means and standard deviations were 1.1 and 0.13 for _{Q} = 2, 0.86 and 0.067 for _{Q} = 4, 0.85 and 0.059 for _{Q} = 8, 0.77 and 0.037 for _{Q} = 16, 0.76 and 0.037 for _{Q} = 32, 0.77 and 0.048 for _{Q} = 64. The dashed straight lines are for the shuffled texts. For 20 shuffled texts of Les Miserables, the means were 1.0 for all _{Q}s, with standard deviations of 0.0052, 0.013, 0.0080, 0.012, 0.010, 0.028, for _{Q} = 2, 4, 8, 16, 32, and 64, respectively.

The knowledge of _{Q}(_{Q}(_{Q}(_{Q}(

Combining

For

For a purely random arrangement of rare words, _{Q}(0, Δ_{Q} = 1/64, i.e. we ask what is the probability that directly after a rare word with return period 64 another rare word appears in the text. For a pure random arrangement we have

Next we consider the intrinsic reason for this clustering. We denote the lengths of the consecutive intervals in the text, for fixed _{Q}, by _{i}, _{Q} = _{Q} − 1 and ask, if interval _{i} and interval _{i + s} are correlated. To this end, we study the autocorrelation function
_{Q}(0) = 1. For randomly arranged words (for example, after shuffling the text or the intervals), _{Q}(_{Q}(_{Q}(

_{Q} values as in _{Q}(_{Q}(_{Q} above 4, _{Q}, varying between _{Q} for the Chinese and Japanese texts. The means and standard deviations of _{Q} = 2, 0.31 and 0.040 for _{Q} = 4, 0.34 and 0.035 for _{Q} = 8, 0.34 and 0.049 for _{Q} = 16, 0.33 and 0.084 for _{Q} = 32, 0.35 and 0.12 for _{Q} = 64. For the English and the German text, _{Q}, while it decreases for the French text. The long-range memory is the reason for the clustering of the rare words observed in _{Q} for large _{Q}. Accordingly, literary texts have a more complex structure than purely long-term persistent records. As we show below, the return intervals contain also a large fraction of white noise, which effectively diminishes the long-term correlations, this way leading to a larger value of

The figure shows the autocorrelation function _{Q}(_{Q} values and the same texts as in _{Q} for _{Q} = 16, 8, 4, and 2 by 10^{−1}, 10^{−2}, 10^{−3}, and 10^{−4}, respectively. Since autocorrelation functions are known to show strong finite-size effects [_{Q} − 1)/100. For _{Q} = 2, the first data point was negative for Ulysses, The Great Boer War, Die Traumdeutung, and Daibosatsu Toge. At _{Q} ≥ 4 all texts show clear power-law correlations.

The prefactor _{Q}(1) characterizes the strength of the long-range memory. For _{Q} above 4, _{Q}(1) is well above 0.1 and approximately text independent (see Table A in _{Q}(1) obtained for the 10 texts is below

Accordingly, for each threshold _{i} are a superposition of white noise _{wn}(_{lrm}(_{Q}. For _{Q} between 8 and 64,

To further quantify the clustering of the rare events, we follow [_{Q}, the _{Q} − 1 intervals according to their length. Then we distinguish between intervals below the median (_{Q} as a function of _{Q} = 2, 8, and 32. Without memory, the conditional average is identical to _{Q}. Due to the long-range memory, the conditional average after the short intervals (open circles) is well below 1, while it is well above 1 after the long intervals (full circles). The effect is enhanced when the segment length

For each text, the left-hand graphs show the (conditional) average length of a return interval in units of the mean interval length _{Q}, for _{Q} = 2, 8 and 32, after _{Q} = 32, 8, and 2, respectively. The figure shows that short (long) intervals are more likely followed by short (long) intervals, and quantifies the clustering of rare words for large _{Q} that we observed in

Finally, we like to discuss if the memory quantified for the return intervals can be found directly in the text when each word is substituted by its rank. To this end, we first followed [_{i} of the _{i} around the best polynomial fit of order 2. After averaging

Our results for the 10 texts considered (shown in ^{1/2}. Accordingly, the shape of _{Q}(_{Q} is substituted by the length of the text, _{i} by the rank of the _{Q} by the mean rank. It has been shown in [

The figure shows

_{Q}(_{Q}(^{−γ} is well below the value (1 −

The figure shows the autocorrelation functions for the 10 texts considered. For transparency, we have multiplied ^{−1},10^{−2}, 10^{−3}, and 10^{−4}, respectively. Since autocorrelation functions are known to show strong finite-size effects [

In this article we considered 10 long literary texts from England/Ireland, France, Germany, China, and Japan and studied systematically the occurrence of the rare words in a text. We used techniques from the studies of extreme events which do not require a particular mapping of the words to numbers. We considered the fraction _{Q}/_{Q}(_{Q}(_{Q}(_{Q}(_{Q}(1) we found that the return intervals are not purely long-range correlated, but can be described as a superposition of white noise and a long-range correlated part. The long-range correlated part is responsible for the pronounced clustering of the rare words in a literary text.

We found that the same laws (Weibull functions for the exceedance probability and power-laws for the autocorrelation function of the return intervals) hold, with some variations in the parameters, for all languages considered, showing that the rhythm of a text quantified by the return intervals between the words, is surprisingly universal. This is particularly remarkable since the languages considered belong to different families and vary greatly [

We consider the two laws as important “stylized” facts in languages that complement Zipf’s law. As Zipf’s law, both laws have been obtained empirically and lack a rigorous derivation by first principles. The results are universal in the sense that the same kind of functions describe the statistics of the return intervals, but the exponents are clearly not identical. For large thresholds (with _{Q}/

We concentrated on the arrangements of the rare words in single-authored literary texts. For the quality of the analysis, we had to consider large texts, with more than 200,000 words. It would be interesting to see, if the arrangements of the rare words in single-authored texts is different from the arrangement in speeches. But since typical speeches consist only of few thousand words, a return-interval analysis as performed here may suffer from strong finite size effects.

Further extensive work is needed to see, to which extent the laws we find for single-author texts also hold for multi-author texts, and to which extent language engineering where the properties of rare words are crucial can benefit from our results. Preliminary work on 3 well recognized newspapers (see Fig D in _{Q}(_{Q}(

(PDF)