Giving computers many human languages with multilingual embeddings
Software struggles with language understanding. Large language models are advancing the capabilities of software language understanding. High-quality multilingual embeddings are a key step in that direction.
Assume you wake up one day with a million unread messages in your inbox. How can you possibly get the time to read them let alone react to them appropriately. This is a reality faced by companies who receive significant amounts of messages, emails and social media comments.
What if we can plot our messages? Instead of reading them individually, let’s first explore them visually.
This visual immediately shows clusters – clumps of messages that have similar meanings and topics. We can explore such a figure and find out for example, that messages in this cluster are all questions about the weather.
Not only is this capability useful for exploring text archives, but it's also a cornerstone of language understanding in software systems. Categorizing text by topic, automatically tagging it, building search systems that use meaning in addition and not just keyword matching all rely on having a robust embedding model that translates text into a numeric representation capturing its meaning.
Because we’ve used a high-quality multilingual model, these topics we’ve identified are able to describe all our messages, regardless of their language.
By Contrast, an English Model does not represent other languages well
Compare the plot at the top with the plot above where the same text is embedded with a model that is not multilingual (but rather, mainly English). The model seems to do well for English, but other languages are spread as different islands -- without being tied by meaning.
Dataset credit: MASSIVE and SLURP.