Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
The Perception Test is designed for multimodal models
The Perception Test is designed for multimodal models
DeepMind, the Alphabet-owned company behind DALL-E, has unveiled the Perception Test, a new benchmark to test an AI model’s multimodal perception capabilities.
There are plenty of benchmarks used to test AI models. The likes of the Turing test and ImageNet allow researchers and developers to measure parameters and fine-tune their work.
Now DeepMind has come up with its test focused on perception, an important need for robots and self-driving cars.
A handful of perception-focused benchmarks already exist, like Kinetics for video action recognition, Audioset for audio event classification, MOT for object tracking and or VQA for image question-answering.
But according to DeepMind, prior benchmarks “only targets restricted aspects of perception.”
The Google-owned company contends that multimodal models such as Perceiver, Flamingo and BEiT-3 require evaluation across several different modalities and not solely audio or visual.
Training those models requires multiple specialized datasets as no dedicated multimodal benchmark exists, a process that DeepMind describes as being “slow, expensive and provides incomplete coverage of general perception abilities like memory.”
Enter the company's Perception Test: It houses a dataset capable of measuring six different AI tasks, including object-tracking, point-tracking and multiple-choice video question-answering.
DeepMind said its work was inspired by the way children’s perception is assessed in developmental psychology.
The videos that comprise the benchmark’s dataset, like the examples below, show games or daily activities which would allow developers to define tasks that require a knowledge of semantics, an understanding of physics or abstraction abilities like pattern detection.
Crowd-sourced participants labeled the videos with spatial and temporal annotations. DeepMind's researchers then designed questions that probe a model’s ability to reason counterfactually. The corresponding answers for each video were again provided by crowd-sourced participants.
The crowd-sourced videos were obtained from 13 countries, including Brazil, Mexico and the Philippines. DeepMind said obtaining video data from a diverse set of participants was “a critical consideration when developing the benchmark.”
The video data was also selected to involve different ethnicities and genders and aimed at having diverse representation within each type of video script.
Perception Test: How to access
DeepMind has made the Perception Test benchmark publicly available. It can be accessed here via GitHub.
Further details on the benchmark test are available in DeepMind’s research paper.
DeepMind said it plans to publish a leaderboard so developers are ranked on how well their models performed compared to others using the test.
You May Also Like