An opinion piece by the VP analyst at Gartner
Real-world data is seen as the gold standard for analysis but it can be difficult to access, expensive to use and even constrained by regulations. That’s where synthetic data comes into its own. Despite it being generally regarded as a low-quality substitute, it can offer real operational benefits.
Deployed correctly, data and analytics leaders can use it to create more efficient AI models, taking their organizations’ AI applications to the next level. Gartner estimates that by 2030, synthetic data will overshadow real data in a wide range of AI models.
Using synthetic data
Real data generally provides the best data insights. However, it can be expensive, biased or unavailable due to privacy regulations.
Here, synthetic data can be an effective alternative or supplement as it provides access to better annotations that can be used to build accurate and extensible AI models. If correctly combined with whatever real data is available, synthetic data can create enhanced datasets that can help alleviate some of the weaknesses associated with real data.
For example, organizations can employ synthetic data when testing a new system where no live data exists, or when data may be biased. Synthetic data is also advantageous in supplementing small, obtainable data sets that might otherwise be ignored. Alternatively, organizations can use it when real data sets cannot be used, shared or moved. In this way, synthetic data has many applications.
Synthetic data and the future of AI
Synthetic data is of paramount importance to the future of AI as it can be used to help organizations understand the technology’s potential.
As discussed, there are many applications for synthetic data, including data pseudonymization and anonymization, which are must-haves for any modern data science team. But, with synthetic data, data scientists can feed information into their models and then retrieve artificially generated data that is far more valuable than direct observation.
Synthetic data is also beneficial for hackathons, product demonstrations and internal prototyping when a set of data needs to be replicated with the right statistical attributes. For example, financial services institutions, such as banks, often use synthetic data when setting up multiagent simulations to better understand market behaviors, improve their lending decisions or to fight fraud. Similarly, retailers use synthetic data when setting up cashier-free stores or when analyzing customer demographics.
An additional factor that makes synthetic data valuable is the accuracy it offers in machine learning (ML) models. This is because real-world data is coincidence and does not factor in all the impacting permutations of conditions or possible events. Synthetic data can counter this problem by generating data for conditions not yet seen.
The range of synthetic data’s applicability makes it a critical accelerator for AI as it makes AI possible where there is a lack of real data.
Risks of using synthetic data
Although synthetic data has advantages, it also presents significant risks and limitations.
For example, the quality of synthetic data is reliant on the quality of the model that created it and the resulting dataset. Using synthetic data therefore requires additional verification steps, such as a comparison with human-annotated, real-world data to ensure its validity.
Additionally, synthetic data can be misleading as it may lead to inferior results and may not be 100% fail-proof when it comes to privacy.
Due to these challenges, synthetic data faces user skepticism, which stems from the fact that users deem it to be ‘inferior’ or ‘fake.’ As it becomes widely adopted, business leaders may raise questions about the openness of the techniques used to generate data, especially when it comes to transparency and explicability.