What is synthetic data?
Synthetic data is artificially annotated information that computer algorithms or simulations create to stand in for test data in validating mathematical models or training machine learning models. While not based on real-world phenomena, synthetic data is generated using real inputs, meaning the artificial data statistically reflects the real-world data.
Use case for synthetic data in machine learning
Research has shown that synthetic data is often as good as real-world data for training ML models. While the data may not lose its utility, artificially creating data reduces privacy, sensitivity, and regulatory concerns as it's not representative of real people, processes, or events. Synthetic data also allows datasets to be tailored for specific conditions that are not available or possible to collect. This makes synthetic data useful for training models on edge cases that rarely occur but can undermine the generalizability and adaptability of ML models if not included in the training.
Synthetic data challenges
Despite its benefits, synthetic data comes with distinct challenges. Its malleability means that quality can vary widely, requiring verification against human-annotated data or the original dataset. Synthetic data also often lacks the outliers that are sometimes needed in training for more generalizable ML models. Further, if the data used to generate synthetic data is biased, the output can perpetuate or even magnify existing bias. Finally, synthetic data also doesn't eliminate the need for the use of real-world data as it's still required for generation and verification.
Synthetic data generation methods
Synthetic data can be generated through a number of techniques such as:
- Generative modeling automatically learns patterns in the data and creates outputs that match the real-world data's distribution.
- Domain randomization rapidly spawns different colored, lighted, and posed variations of images which improve data randomization for higher model accuracy.
- Agent-based modeling (ABM) creates individual agents to interact with each other. Such simulations then create data based on artificial interactions, which can be useful for assessing interactions between complex systems, processes, and people.
While complex generative models are increasingly used among academics, simulations remain popular because of their support for segmentation and classification tools.
Additional Reading About Synthetic Data
- An open source community for synthetic data research and code.
- An ebook by O’Reilly and NVIDIA on using synthetic data in AI.
- Examples with code samples for synthetic data generation on Omniverse.
- A discussion on how to generate synthetic data for research.
- A tutorial on how to generate synthetic data with GANs.
- Research on how to use ABMs to research.
Devron is a next-generation federated learning and data science platform that enables decentralized analytics. Learn more about our solutions, read more of our knowledge base articles, about our federated learning platform, or schedule a demo with us today.