Scale AI surveyed more than 1,300 machine learning teams
Data quality is the biggest challenge faced by machine learning (ML) teams when acquiring training data, according to a recent survey of more than 1,300 practitioners in the field.
A third of respondents said they encounter data quality problems, followed by issues with collection, analysis, storage and versioning, according to Zeitgeist: AI Readiness Report by Scale AI.
These problems must be addressed since they have a “significant downstream impact” on ML efforts and teams often cannot model effectively without quality data, the survey said.
In the report, ML teams said they find it difficult to sort through volume, data complexity, and scarcity. Unstructured data poses a particular challenge. Practitioners find that curating data for its models impacts how quickly they can deploy their ML projects. Without high-quality data, teams cannot create robust models.
Variety, volume and noise
Factors contributing to data quality include variety, volume and noise.
In the survey, 37% find it difficult to find the data variety they need to improve model performance. Those working with unstructured data specifically have the biggest challenge getting the variety of data to improve model performance.
Since most of data today is unstructured, ML teams must have a strategy around how they manage this data to enhance data quality.
ML teams working with unstructured data are more likely than those working with semi-structured or structured data to have too little data.
Most respondents report problem with their training data, with data noise as the largest headache (67%), followed by data bias (47%) and domain gaps (47%). Only 9% did not have such issues.
The report offered these five tips for data-centric AI development from Andrew Ng, co-founder of Google Brain.
- Make labels consistent
- Use consensus labeling to spot inconsistencies
- Clarify labeling instructions
- Toss out noisy examples (because more data is not always better)
- Use error analysis to focus on a subset of data to improve
Preparing the data
When it comes to data preparation, curating data is the biggest challenge (33%) followed closely by annotation quality (30%).
Curating data – taking out corrupted data, tagging with metadata and identifying relevant data – is critical to prevent wasting time and money on annotating what could end up being unusable data.
Annotating data means adding context to raw data for ML models to generate predictions, and not annotating well leads to “poor” model performance, according to the survey.
One issue when getting data from external service providers is they may not have the highest quality data feeds, so manual auditing often is necessary.
Scaling is an issue
Scaling is a challenge for most ML teams, with 38% of those surveyed citing deployment as the biggest test. Larger companies find it more difficult to identify issues in their models.
One key trend is that organizations that focused on data annotation infrastructure were able to retrain existing models, deploy new models, and transition to production at a faster pace.
Also, ML teams that collaborated with their data annotation partners were able to deploy models at an accelerated pace.
About 73% of those surveyed used synthetic data for their projects because of inadequate examples of edge cases from real-world data and legal or privacy issues with real-world data.
After acquiring data, the next stage in the ML lifecycle is model development, deployment, and monitoring. A robust ML model needs data augmentation, multiple iterations on a dataset, comparative testing of model architectures, and production testing.
Feature engineering is a big challenge in model development. It is used to create models on structured data for such things as recommendation systems and predictive models.