April 22, 2022
The set has over one million fully labeled data points.
Amazon has released a ‘massive’ open source dataset in a bid to encourage developers to create apps for its Alexa smart assistant.
The dataset, dubbed MASSIVE – is composed of one million labeled utterances spanning 51 languages. Amazon said the dataset would allow data practitioners to “re-create baseline results for intent classification.”
“We are very excited to share this large multilingual dataset with the worldwide language research community,” says Prem Natarajan, vice president of Alexa AI Natural Understanding.
“We hope that this dataset will enable researchers across the world to drive new advances in multilingual language understanding that expand the availability and reach of conversational-AI technologies.”
It’s kind of MASSIVE
The dataset is designed to enable cross-linguistic training on natural-language understanding (NLU) tasks. NLU is a subdiscipline of NLP (natural language processing) and is essentially a system’s ability to understand the meaning of a text and identify the relevant entities.
For instance, given the utterance “What is the temperature in New York?” an NLU model might classify the intent as “weather_query” and recognize relevant entities as “weather_descriptor: temperature” and “place_name: new york.”
Amazon’s focus on NLU relates as a component of spoken-language understanding (SLU), where audio is converted to text before NLU is performed.
Commonly, massively multilingual NLU models lack labeled data for training. Amazon’s new open source model however contains one million labeled virtual-assistant text utterances. It was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages.
Tools for the new dataset, as well as the modeling code used for baseline results, are available via Github. MASSIVE is licensed under the CC BY 4.0 license, encouraging its broadest possible use across academia and industry.
Alongside the announcement of MASSIVE, Amazon announced a competition using the dataset -Massively Multilingual NLU 2022 (MMNLU-22).
The competition challenges researchers to build translation models using MASSIVE – with the best allowed to present at the Empirical Methods in Natural Language Processing conference in Dubai this December.
There will be no limit on model size, and any data may be used for training so long as it is publicly available. Dev, test and eval splits of the MASSIVE dataset may not be used for model training.
In other Amazon news, the e-commerce giant revealed plans to invest $1 billion in AI and robotics startups working in the logistics and supply chain spaces.