Giving computers many human languages with multilingual embeddings

Software struggles with language understanding. Large language models are advancing the capabilities of software language understanding. High-quality multilingual embeddings are a key step in that direction.

Assume you wake up one day with a million unread messages in your inbox. How can you possibly get the time to read them let alone react to them appropriately. This is a reality faced by companies who receive significant amounts of messages, emails and social media comments.

What if we can plot our messages? Instead of reading them individually, let’s first explore them visually.

Interactive visualization: 3,000 sentences in many languages. Hover over the points to see their text. Can you identify the topics of the various clusters? (Best viewed on a computer)

This visual immediately shows clusters – clumps of messages that have similar meanings and topics. We can explore such a figure and find out for example, that messages in this cluster are all questions about the weather.


Figure: Sentences in this one region of the figure tend to be about podcasts. Can you identify what other clusters are about?

Not only is this capability useful for exploring text archives, but it's also a cornerstone of language understanding in software systems. Categorizing text by topic, automatically tagging it, building search systems that use meaning in addition and not just keyword matching all rely on having a robust embedding model that translates text into a numeric representation capturing its meaning.

Interactive visualization: Using Cohere’s multilingual model, English texts (in gray) and other languages (in colors) are arranged by topic regardless of their language. Only a subset of the supported languages are shown here.

Because we’ve used a high-quality multilingual model, these topics we’ve identified are able to describe all our messages, regardless of their language.

By Contrast, an English Model does not represent other languages well

Figure: Without multilingual support, representations of non-English texts are arbitrarily scattered even when using a cutting-edge English model. Interactive version

Compare the plot at the top with the plot above where the same text is embedded with a model that is not multilingual (but rather, mainly English). The model seems to do well for English, but other languages are spread as different islands -- without being tied by meaning.


Dataset credit: MASSIVE and SLURP.