AI Business is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 3099067.

AI Practitioner

Machine learning benchmarks revealed to contain ‘pervasive’ labeling errors

by Ben Wodecki
Article ImageIs it a bear? Is it a sloth? This algorithm won’t tell

Some of the most popular datasets used to test machine learning models are riddled with labeling errors, according to a new study by a trio of data scientists.

ChipBrain’s Curtis Northcutt, MIT’s Anish Athalye, and Jonas Mueller of Amazon conducted the study across 10 popular text, audio, and image benchmarks, unearthing an average error rate of more than 3 percent.

Errors found in the datasets ranged from almost 3,000 in one case to more than five million in another.

Image test sets in particular contained a plethora of humorously mislabeled images which included a brown bear identified as a ‘sloth bear,’ a member of the Queen’s Guard labeled an ‘assault rifle,’ and an otter mistaken for a ‘weasel.’

The study, titled Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, suggested that such errors could lead to incorrect assertions being made as to which is the best performing model.

Not very intelligent

Modified National Institute of Standards and Technology image database (MNSIT), as well as text archives from 20news and IMDb were among the datasets analyzed by the trio.

Describing the sheer number of erroneous labels as “pervasive,” the analysts estimated that an average of 3.4 percent of labels across the 10 datasets were incorrect.

CIFAR-100 image dataset was found to comprise 2,916 label errors – 6 percent of its overall sample – while the Amazon Reviews text dataset was found to have contained around 390,000 label errors, around 4 percent of its overall sample.

The MNIST dataset, which the trio said was “assumed to be error-free,” contained just 15 human-validated label errors in its test set.

“Higher-capacity models (like NasNet) undesirably reflect the distribution of systematic label errors in their predictions to a far greater degree than models with fewer parameters (like ResNet-18), and this effect increases with the prevalence of mislabeled test data,” the study reads.

The errors found in the datasets were later fixed by humans, with the team hoping that future research will use this improved test data instead of the original erroneous labels.

Concluding their findings, the analysts suggested machine learning practitioners should correct their test set labels “to measure the real-world accuracy you care about in practice.”

Northcutt, Athalye, and Mueller recommended simpler models for datasets with noisy labels — especially for applications trained and evaluated with labeled data that may be noisier than gold-standard ML benchmark datasets.

The study was supported in part by funding from the MIT-IBM Watson AI Lab, MIT Quanta Lab, and the MIT Quest for Intelligence.

Below, you can see a selection of incorrect labels collected by Curtis Northcutt:


More EBooks

Latest video

More videos

Upcoming Webinars

More Webinars
AI Knowledge Hub

Research Reports

More Research Reports


Smart Building AI

Infographics archive

Newsletter Sign Up

Sign Up