Cohere for AI's new open source project lets developers build AI applications that span over 100 languages

Ben Wodecki, Jr. Editor

February 21, 2024

2 Min Read
Illustration of people of different races around the world
Getty Images

At a Glance

  • Cohere for AI unveils Aya: A new model and dataset combo for powering multilingual AI workloads.

English is one of the most essential languages used in business. But to serve a global audience more effectively, companies need to be multilingual. Enter Aya, a new AI model that supports 101 different languages. It is from Cohere for AI, the nonprofit research subsidiary of AI startup Cohere.

The Aya model is open source and can be used commercially under its Apache 2.0 license. Aya is designed to also cover languages largely ignored by most advanced models.

Aya could power customer support chatbots or virtual agents. The model could also be used to support content translation or localization of business websites or product marketing.

Cohere claims the model serves double the number of languages covered by existing open source models such as BLOOMZ & mT0. The company also said its natural language understanding, summarization and translation skills outperform rival models.

Cohere said Aya means 'fern' in the Twi language from Ghana and it is a symbol of "endurance and resourcefulness which captures the spirit of our own commitment to accelerate multilingual AI progress." The company pointed out that while only 5% of the world speaks English at home, 63.7% of the internet is in English. A lot of the data used to train AI models comes from the internet.

Related:Google AI Language Model Can Generate Text in 300 Languages

"Unless we address this disproportionate representation head-on, we risk perpetuating this divide and further widening the gap in language access of new technologies," Cohere said in a blog post.

You can access Aya via Hugging Face. You can also experiment with the model via the Cohere Playground. To join Cohere's efforts, connect to its Discord server for the Aya project.

Massive multilingual dataset

Also made available is the underlying dataset used to train Aya. This dataset spans some 513 million prompts across 114 language and includes annotations from native and fluent speakers.

The dataset contains language examples including variations of dialects that make Aya return responses that are organic and natural.

The dataset can also be downloaded from Hugging Face and can power commercial applications.

Upon unveiling the project, Cohere said Aya and its dataset “can effectively serve a broad global audience that have had limited access to-date.”

Cohere joins other research labs trying to democratize AI to encompass underserved societal groups. Meta, for example, has its No Language Left Behind project to support low-resource language translation. And Google’s Universal Speech Model is powering multilingual capabilities in its product lines.

Related:Meta Unveils Open Source AI Model that Translates 200 Languages

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like