The authors have declared that no competing interests exist.

Conceived and designed the experiments: AK CMS. Performed the experiments: AK. Analyzed the data: AK. Wrote the paper: AK CMS.

In order to demonstrate why it is important to correctly account for the (serial dependent) structure of temporal data, we document an apparently spectacular relationship between population size and lexical diversity: for five out of seven investigated languages, there is a strong relationship between population size and lexical diversity of the primary language in this country. We show that this relationship is the result of a misspecified model that does not consider the temporal aspect of the data by presenting a similar but nonsensical relationship between the global annual mean sea level and lexical diversity. Given the fact that in the recent past, several studies were published that present surprising links between different economic, cultural, political and (socio-)demographical variables on the one hand and cultural or linguistic characteristics on the other hand, but seem to suffer from exactly this problem, we explain the cause of the misspecification and show that it has profound consequences. We demonstrate how simple transformation of the time series can often solve problems of this type and argue that the evaluation of the plausibility of a relationship is important in this context. We hope that our paper will help both researchers and reviewers to understand why it is important to use special models for the analysis of data with a natural temporal ordering.

In principle, we could start our paper like this: As complex adaptive systems [

The remainder of this paper is organized in the following way: First, we document an apparently spectacular relationship between population size and lexical diversity. We then try to show that this relationship is the result of a misspecified model that does not correctly account for the (serial dependent) structure of temporal data of the data by presenting a similar but nonsensical relationship between the global annual mean sea level and lexical diversity. On this basis, we explain the cause of the misspecification, show that it has profound consequences and demonstrate how simple data transformation can often solve the problem, before arguing that checking for plausibility is important in this context. This paper ends with some concluding remarks.

We ‘investigated’ the correlation between population size and lexical diversity for American English, British English, Chinese (simplified), French, German, Italian, Russian and Spanish on the basis of data on population size and the type-token ratio based on the Google Books dataset (see

Orange circles: raw data. Blue line: linear prediction of the type-token ratio on the population size. Notes on the bottom right: Pearson correlation (all

Orange circles: raw data. Blue line: linear prediction of the type-token ratio on the population size. Notes on the bottom right: Pearson correlation (all

^{th} century. For the analysis of temporal data, this has important ramifications because the following statement is true

Orange lines: raw data. Blue lines: simple weighted moving average with an 11-year window centered on the current value.

One common approach to avoid spurious correlations is to transform the series prior to the analysis, for example by detrending the series (estimating the trend and subtracting it from the actual series). Another more general solution that often results in stationary series, that is a series in which the mean and the variance of the investigated series do not change as a function of time, is to correlate period-to-period changes instead of the actual levels of the two series [_{max} = .10) and insignificant at all common levels of significance (_{min} = .31).

To demonstrate why it is problematic to correlate two trending time-series, we have simulated 10,000 random walks with drift (cf. Materials and Methods). Each resulting time-series has an average upward trend, but otherwise behaves in a completely random manner. This means that the random walks serve as a proxy for time series with a general upward trend. All series are then correlated with the annual global mean sea level. _{mean} = .52). This result is, of course, far from what we should actually expect for the distribution of correlation coefficients where one variable is a random quantity _{mean} = .00).

Top: Histogram of the correlations between levels. Bottom: Histogram of the correlation between year-to-year changes. The height of the bars in both histograms represents the number of cases in the category. Blue lines: scaled normal density.

This leaves little room for debate: Whenever two variables evolve through time, those variables will almost always look highly correlated even if they are not related in any substantial sense. The reason why standard statistical models fail when it comes to the analysis of time-series has to do with the fact that there is basically no such thing as a univariate time-series: analyzing univariate time series is always "the analysis of the bivariate relationship between the variable of interest and time." ([

To demonstrate why it is also important–especially for a spectacular and unexpected result–to remain skeptical and to carefully check plausibility, let us briefly give an example what our initial “analysis” of the relationship between lexical diversity and population size would actually imply: if we regress the level of the lexical diversity in the Spanish Google Books data on the population size of Spain, we obtain a coefficient of determination of ^{2} = .69. This means that almost 70% of the variance of the lexical diversity variable is “explained” by the population size (for the American English data it would be even more than 95% of the variance). This model would also imply that every 10 new inhabitants of Spain are equal to 4.56 additional word types (per 1 million word tokens) in the Spanish Google Books data (that also includes books written and published in Latin America). We believe that this would be an extraordinary result. In fact, this result would be so extraordinary that it seems wise to first ask: is this result plausible? Can we come up with any good theory regarding this relationship?

A few words on the Google Books data are in order here, as they are the basis of all but one [

To check the plausibility of this result, we would have to face the fact that we still do not have any reliable information about the books included in the corpora. According to the FAQs of the Culturomics project behind the GB data [

Returning to our research question–the correlation between population size and lexical diversity–population growth is affected by the birth of children and the influx of immigrants. Babies do not write books, and only a few immigrants publish books which are acquired by libraries shortly after immigration. So, the strong relationship between lexical diversity and population size would indicate that nearly every second new inhabitant (babies and immigrants alike) is "responsible" for one new word type

To drive home this point, if we regress the level of lexical diversity in the German Google Books data to the population size of China, we obtain a very strong correlation of

From a statistical point of view, this demonstrates why it can be a good idea to model a potential relationship between two trending time series with changes instead of levels. This is also important from a methodological point of view: just because two series are trending, does not necessarily imply any substantial relationship [

The general question concerning the Google Books data itself, whether the acquisition strategy of major libraries really can serve as an (temporarily) unbiased proxy for the evolution of subjective or even latent cognitive traits, is an open research question. Again, we are rather skeptical. For example, a change in the acquisition strategy of one major library is not necessarily motivated by one of the factors we might be interested in; nevertheless in aggregation of the frequency counts of different n-grams, it might look like one. Once again, we want to refer to ([

“All empirical research stands on a foundation of measurement. Is the instrumentation actually capturing the theoretical construct of interest? Is measurement stable and comparable across cases and over time? Are measurement errors systematic?”

The outlined problems all have to do with the fact that–in making the data freely available (which is a fantastic thing)–Google wanted to avoid breaking any copyright laws, and it goes without saying that legal restrictions also have to be taken seriously in this case. However, while we are–as many other empirically-minded researchers–fascinated by the possibilities that the analysis of “big data” offers, we believe that the seemingly prevailing view that the size of the (Google Books Ngram) data will stand in for fundamental methodological problems, is not justified.

All recently published studies that we mentioned in the introduction do not explicitly model the underlying temporal structure of the data [

While our analysis indicates that type-token ratios do not dependent on population sizes, this does not imply, of course, that the increase of the type-token ratios over time is not interesting in itself as Harald Baayen (personal communication) points out, because this increase could reflect the fact that onomasiological needs increase with the complexity of modern societies [

^{th} century, except for Chinese, which is restricted to the time span 1950–2000 since the size of the Google Books base corpora is not sufficient (< 1,000,000 tokens) for earlier periods.

Additionally, we simulated _{t,i} is the value of the _{i} is randomly drawn from a uniformly distributed interval [0.02,0.2) and _{t,i} is white noise, normally distributed over the interval [0,1).

For each resulting series this means that the current value of the series depends on its previous value plus a positive drift term and a white noise error term. At each point in time, the series takes one random step away from the last position, but as result of the drift term, the series will have an upward trend in the long-run.

All analyses were carried out using Stata/MP2 14.0 for Windows (64-bit version). To ensure maximal replicability,

From a statistical point of view, temporal autocorrelation is problematic because it biases our estimators. If, for example, we fit a simple time-series regression that can be written as:
_{t} represents the level of our outcome variable in _{1t} is the level of predictor variable, _{0} is the regression constant and _{1} is the regression coefficient, _{t} is the error term. OLS analysis assumes that there is no autocorrelation between the residuals (_{s},_{t}) = 0 for all _{t} can be written as:
_{t} is a white-noise process. In the presence of first-order autocorrelation, the OLS estimators are biased and lead to incorrect statistical inferences [_{mean} = .91), we obtain a normal distribution with a mean close to zero (_{mean} = -.02) for the regression residuals of year-to-year changes.

Top: Histogram for levels. Bottom: Histogram for year-to-year changes. The height of the bars in both histograms represents the number of cases in the category. Blue lines: scaled normal density.

(XLSX)

(TXT)

(TXT)

We would like to thank Sascha Wolfer for valuable comments on earlier drafts of this article and Sarah Signer for proofreading. Also, we are grateful to an anonymous reviewer for helpful suggestions and to Harald Baayen for insightful comments and additional inputs on the interpretation of our analyses as mentioned in the text. The publication of this article was funded by the Open Access fund of the Leibniz Association.