Machine learning benchmarks revealed to contain ‘pervasive’ labeling errors

Is it a bear? Is it a sloth? This algorithm won’t tell

Ben Wodecki

March 30, 2021

2 Min Read

Is it a bear? Is it a sloth? This algorithm won’t tell

Some of the most popular datasets used to test machine learning models are riddled with labeling errors, according to a new study by a trio of data scientists.

ChipBrain’s Curtis Northcutt, MIT’s Anish Athalye, and Jonas Mueller of Amazon conducted the study across 10 popular text, audio, and image benchmarks, unearthing an average error rate of more than 3 percent.

Errors found in the datasets ranged from almost 3,000 in one case to more than five million in another.

Image test sets in particular contained a plethora of humorously mislabeled images which included a brown bear identified as a ‘sloth bear,’ a member of the Queen’s Guard labeled an ‘assault rifle,’ and an otter mistaken for a ‘weasel.’

The study, titled Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks, suggested that such errors could lead to incorrect assertions being made as to which is the best performing model.

Not very intelligent

Modified National Institute of Standards and Technology image database (MNSIT), as well as text archives from 20news and IMDb were among the datasets analyzed by the trio.

Describing the sheer number of erroneous labels as “pervasive,” the analysts estimated that an average of 3.4 percent of labels across the 10 datasets were incorrect.

CIFAR-100 image dataset was found to comprise 2,916 label errors – 6 percent of its overall sample – while the Amazon Reviews text dataset was found to have contained around 390,000 label errors, around 4 percent of its overall sample.

The MNIST dataset, which the trio said was “assumed to be error-free,” contained just 15 human-validated label errors in its test set.

“Higher-capacity models (like NasNet) undesirably reflect the distribution of systematic label errors in their predictions to a far greater degree than models with fewer parameters (like ResNet-18), and this effect increases with the prevalence of mislabeled test data,” the study reads.

The errors found in the datasets were later fixed by humans, with the team hoping that future research will use this improved test data instead of the original erroneous labels.

Concluding their findings, the analysts suggested machine learning practitioners should correct their test set labels “to measure the real-world accuracy you care about in practice.”

Northcutt, Athalye, and Mueller recommended simpler models for datasets with noisy labels — especially for applications trained and evaluated with labeled data that may be noisier than gold-standard ML benchmark datasets.

The study was supported in part by funding from the MIT-IBM Watson AI Lab, MIT Quanta Lab, and the MIT Quest for Intelligence.

Below, you can see a selection of incorrect labels collected by Curtis Northcutt:

Curtis Northcutt

About the Author(s)

Ben Wodecki

Assistant Editor

Stay Ahead of the Curve
Get the latest news, insights and real-world applications from the AI Business newsletter

You May Also Like