Synthetic data is artificially annotated information that computer algorithms or simulations create to stand in for test data in validating mathematical models or training machine learning models. While not based on real-world phenomena, synthetic data is generated using real inputs, meaning the artificial data statistically reflects the real-world data.
Research has shown that synthetic data is often as good as real-world data for training ML models. While the data may not lose its utility, artificially creating data reduces privacy, sensitivity, and regulatory concerns as it's not representative of real people, processes, or events. Synthetic data also allows datasets to be tailored for specific conditions that are not available or possible to collect. This makes synthetic data useful for training models on edge cases that rarely occur but can undermine the generalizability and adaptability of ML models if not included in the training.
Despite its benefits, synthetic data comes with distinct challenges. Its malleability means that quality can vary widely, requiring verification against human-annotated data or the original dataset. Synthetic data also often lacks the outliers that are sometimes needed in training for more generalizable ML models. Further, if the data used to generate synthetic data is biased, the output can perpetuate or even magnify existing bias. Finally, synthetic data also doesn't eliminate the need for the use of real-world data as it's still required for generation and verification.
Synthetic data can be generated through a number of techniques such as:
While complex generative models are increasingly used among academics, simulations remain popular because of their support for segmentation and classification tools.
Additional Reading About Synthetic Data