^{1}

^{*}

^{2}

^{1}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: MVJ SV CHE. Performed the experiments: MVJ CHE. Analyzed the data: MVJ SV. Contributed reagents/materials/analysis tools: MVJ CHE. Wrote the paper: MVJ SV CHE.

Computational methods have started playing a significant role in semantic analysis. One particularly accessible area for developing good computational methods for linguistic semantics is in color naming, where perceptual dissimilarity measures provide a geometric setting for the analyses. This setting has been studied first by Berlin & Kay in 1969, and then later on by a large data collection effort: the World Color Survey (WCS). From the WCS, a dataset on color naming by 2 616 speakers of 110 different languages is made available for further research. In the analysis of color naming from WCS, however, the choice of analysis method is an important factor of the analysis. We demonstrate concrete problems with the choice of metrics made in recent analyses of WCS data, and offer approaches for dealing with the problems we can identify. Picking a metric for the space of color naming distributions that ignores perceptual distances between colors assumes a decorrelated system, where strong spatial correlations in fact exist. We can demonstrate that the corresponding issues are significantly improved when using Earth Mover's Distance, or Quadratic

We study the effects of method choice for computational analyses of color semantics data sets.

Different languages have different numbers of color terms: English, for instances, has one main term for Blue:

Remarkably, Berlin & Kay discovered that while languages have different numbers of basic color terms, the borders between them tend to run in approximately the same areas. Working without computational support, Berlin & Kay articulated a hierarchy of seven stages of color systems – establishing an order in which a color area detaches and forms its own color term across languages. Thus, they claim that for a language with only two color terms, the terms would divide the entire color space into a D

Subsequently, Kay, Berlin, Maffi and Merrifield

This data set has subsequently been the subject of several computationally supported investigations into the properties of color naming schemes: Berlin & Kay

We agree with the basic point advanced by Lindsey & Brown. However, the methods they propose to deal with the problem still suffer from significant issues. In this article, we will detail these issues and propose a different analysis approach. We will also present a derivative dataset from the World Color Survey implementing our analysis approach. This will be freely available for future research through figshare

The World Color Survey

All color distribution figures in this paper use this chart as a reference map; with darker grays for higher distribution intensity values and lighter grays for lower.

Every word in the dataset has been used by some speakers for a collection of cells in this grid structure. We will call this distribution of usage in the structure a

The raw data of the responses in the World Color Survey has to be post-processed to enable effective statistical analysis. The method of post-processing is very important, as it determines the range of available statistical tools as well as their descriptive power.

We can note some approaches in use:

Regier and Kay

The work by Lindsey and Brown

In the study by Jäger

While Kay and Regier have used the CIE L*a*b* space and its perceptual dissimilarities, the problem as pointed out by Lindsey and Brown, is that they used averages instead of the entire response distribution. The problem with Lindsey and Brown, and with Jäger, on the other hand is that the statistical methods they use ignore the underlying color space and the perceptual dissimilarities, which can be measured using for instance the CIE L*a*b* space.

We propose to combine the best of these studies: an awareness of perceptual distances in the CIE L*a*b* space with methods for handling distances between response distributions.

All the statistical methods in use depend on some way to compute a numeric

An easily accessible metric choice for response distributions would be the Euclidean distance function:

However, both the Euclidean distance and the Pearson coefficient assumes that each of the chip responses, that is the 0–1 values for speaker response vectors or the response tallies for language response vectors are independent. Both measures are invariant to a permutation of the chips, which means that the responses can be re-ordered without influencing the dissimilarity measures. Clearly this is not the case as this would imply that the structure, the relative position of each chip in the chart, is irrelevant. Obviously this is not true as the chips have been structured such as to reflect the perceptual similarity between the colors they represent. This means that neither of these measures reflects that humans perceive Red as more similar to P

In the remainder of this paper we will discuss the implications of different metrics used to study the World Color Survey data. We will highlight the assumptions that underlies specific metrics and show the implications they have on the results. Further we will suggest, in terms of application to the World Color Survey, a new set of metrics. We will argue for their relative benefits and show how using these metrics the interpretation of the data changes. As well as providing a basis for this paper, these representations are provided to other researchers as a foundation for future analysis of the data.

In this section of the paper we will focus on how to integrate the perceptual dissimilarities encoded in the CIE L*a*b* distance measure with the response data from the World Color Survey, in order to produce a similarity measure between different response charts. We illustrate the relationship between these layers in

At the top, we see a sample of four colors, as well as the CIE L*a*b*

We will focus on two different approaches to achieve this. In the first approach (Distribution Metrics), we leave the response vectors as is. We integrate the CIE L*a*b* space information in the algorithm that computes their similarity measures. In the second approach (De-correlation Mapping), we integrate the CIE L*a*b* space information into a transformation of the data itself, producing new vectors that can be further analyzed with classical statistical techniques.

Each chart can be interpreted as a discrete distribution. There exists a significant body of work defining distance functions that can be applied on discrete distributions

We will describe two candidates for a good distribution metric that takes the perceptual dissimilarity information into account: the Earth Mover's Distance and the Quadratic

Intuitively, if we consider the non-zeros values of the distributions as the mass of the distribution, a relevant distance measure should reflect the distance between the mass in two charts and not just their relative overlap. The Earth Mover's Distance measure

A less computationally taxing method is the Quadratic

In our experiments we will be comparing the performance of these metrics on the World Color Survey dataset.

The distribution metrics introduced above are computationally intense and require specialized statistical techniques that can work with arbitrary metrics. The de-correlation mapping method will let researchers use more familiar tools and techniques for their analyses and is also much faster.

With the de-correlation mapping we seek to create a transformation of the response vectors themselves that encodes the CIE L*a*b* distances in the transformation. Perceptually close color stimuli produce responses that correlate closely. The transformation we introduce here controls for perceptual similarity, leaving only those correlations that are not directly explained by the perception of stimuli. Classical statistical techniques assume that there is no underlying systematic interdependence in the data, and this transformation makes the WCS data amenable to such techniques.

One method for constructing such transformations comes from the kernel family of machine learning techniques: for any bilinear form that behaves like an inner product (is positive semidefinite on all data samples), a linear transformation can be computed such that the bilinear form is the canonical inner product after transforming the vectors

The matrix of distances between the color chips used for elicitation produces such a bilinear form that models the expected correlation behaviour of distributions over these color chips. Thus, by computing a factorization of the color chip distance matrix, we produce a de-correlating mapping directly applicable on all kinds of charts.

This means that by decomposing a single

One added benefit of this added focus on the role of the underlying metric is that we can start comparing datasets that result from different experimental regimes. Both the Earth Mover's Distance and the Quadratic

The original study by Berlin and Kay

On the one hand we will evaluate the family of available distribution metrics. On the other hand, we will demonstrate how to use the de-correlation mapping to revisit previous research on the data set.

We have criticized several preceding research efforts for using inappropriate statistical techniques. We will now present some examples that demonstrate potential pitfalls with the choice of analysis methods. The following example illustrates the problem well.

The language Amuzgo has 21 color words with registered responses in the World Color Survey dataset. We pick as a reference point one of the commonly used words,

Ther permutation invariance of the metric used by Lindsey & Brown is visible in how the yellow/green immediate neighbour to

A good distribution analysis method would rank perceptually similar color words closer than perceptually dissimilar words.

However, as can be seen in

For the correlation metric used by Lindsey & Brown

As for the Earth Mover's Distance, we find the adjacent color terms at ranks 4, 5, 12, 13 while the more remote words come in at ranks 17, 18, 20. This is the overall order we would consider appropriate, and with a good separation between adjacent and remote distributions.

With the Quadratic

Finally, after our de-correlation transformation we find the adjacent color words at ranks 2, 3, 5, 6, while the more remote color words are at ranks 7, 8, 12. A similar penalization for small usage seems to be in effect for this approach as well.

To summarize: the color words of Amuzgo illustrate how distribution metrics that are insensitive to perceptual distances end up ranking diametrically opposed regions of the color space closer to a sample color distribution than immediate neighbors. These ranking issues are especially important when the data is used for analysis methods with a global scope, such as the

Lindsey & Brown

We aggregate the WCS into speaker response vectors. For each speaker and word, there is a 0-1 vector picking out the color chips for which this was the word used by the speaker.

We construct a color chip similarity matrix

After applying the operator

For the experiment performed by Lindsey & Brown, any response that includes any of the 10 achromatic chips are excluded from the analysis. This produces a dataset with 14 236 rows. We have performed a de-correlation transformation on this dataset, followed by a

For the full set of 21 992 speaker response vectors we show the results of

In each of the clustering experiments we have done, we repeated the

As we can see in

Where Lindsey & Brown

Lindsey & Brown state that

We expect that the insensitivity expected with their approach to clustering technique are taken care of by the new metric handling choices we present here – and thus, for comparison, we also computed

Here, we can observe some concrete differences in the color progression. Again, we can observe that the found cluster centers are stable – once they show up, they tend to remind more or less unchanged through the range of cluster counts. Furthermore, we see that the splitting and re-merging behaviour reported by Lindsey & Brown is absent in this approach as well.

For two clusters, the WCS data set produces a split that does not entirely follow from Berlin & Kay's models: one cluster is a

To summarize, we have repeated the

In the experiment performed with the same dataset as Lindsey & Brown we recover a different but similar hierarchy of color clusters as they did. Lindsey & Brown observed a curious behaviour of clusters splitting and merging as the cluster count increases: this problem disappears with the use of the de-correlation transformation.

In the extended experiment, using the entire speaker response vector set including achromatic color words, we can confirm and amend the finding of Kay, Berlin & Merrifield

In this paper, we have presented concrete issues with commonly used approaches to the computational analysis of color naming data sets. As research into computational approaches to color semantics is done without a “cheat sheet” available to test results against, the impeccability of methods in use becomes all the more important to ensure reliability of any conclusions drawn from computational sources.

In particular, we have examined how the attention to distribution proposed by Lindsey & Brown can be combined with attention to the underlying perceptual metrics of color space.

We recommend against using color space agnostic methods – in particular PCA or clustering techniques directly on response distributions. Instead, we recommend techniques that integrate the structure of perceptual color dissimilarities into the analysis – such as is enabled by the distance datasets we publish or by a de-correlation transformation applied to the data.

These recommendations enable an approach to the analysis of color naming systems that can draw from all available data collection efforts – current or future – while maintaining comparability between different datasets.

We believe that the field of color linguistics is ripe for further fruitful inter-disciplinary collaboration efforts between field linguists, psycholinguistics and data sciences.