Building and training machine learning algorithms usually requires access to high quality training data. But these data may not be available, for instance due to high costs or privacy concerns.
This is where synthetic data comes in. Synthetic data is generated by algorithms rather than by real-world events. It is important, however, that the synthetic data mirror the statistical properties of the original data. Otherwise, the algorithms will be trained with incorrect information. Or, as Mark Twain said, “it ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”