Synthetic Data: unlocking innovation without compromising privacy
In today’s data-driven world, one of the biggest challenges we face is how to protect individual privacy without slowing down scientific and technological progress. How can we responsibly analyse, develop, and innovate while handling sensitive information? One possible answer could be synthetic data.
What is Synthetic Data?
Synthetic data is artificially created information that imitates real personal data without revealing any personal details. It is artificially generated using AI algorithms and models, rather than being collected from real-world events or human activity. As it serves as a substitute for real data, it allows researchers to work with realistic datasets while safeguarding privacy, for example to train machine learning models to predict disease development.
How is Synthetic Data generated?
Synthetic data is not just made up, it is generated using advanced algorithms that learn generative models from real data. There are several techniques and approaches to generate synthetic data, depending on the use case and the type of data needed. Some of the most common methods are:
- Statistical distribution: One of the earliest approaches involves analysing real data to determine its statistical properties and estimate the relationships between variables. Then, synthetic samples are generated to match these properties and follow the same logic and distributions.
- Model-based simulations: In some cases, especially in behavioural or epidemiological studies, researchers build simulations based on known rules, such as how a disease spreads in a population. These simulations create synthetic datasets that resemble real-world dynamics.
- Deep learning Methods: Today, deep learning techniques, especially Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), are at the forefront of synthetic data generation. These models are trained on real datasets and then generate new, realistic examples that are statistically similar to the originals but contain no trace of real individuals. This is very useful to create complex and multimodal synthetic data, including images, videos or text, which could generate accurate genetic profiles or clinical trajectories in a research project such as HEREDITARY.
The Massachusetts Case
The story of synthetic data begins with one of the most well-known privacy failures: the Massachusetts Group Insurance Commission case. In the 1990s, a privacy breach in Massachusetts showed that simply anonymizing health data wasn’t enough.
The Group Insurance Commission, an agency of the state of Massachusetts, decided to release health records of its employees to the public, believing that simply removing obvious identifiers like names and addresses would be enough to protect individual privacy. Their goal was to enable researchers to analyze health trends without compromising personal information.
Is removing names really enough to keep data anonymous? Using only publicly available information, such as birth dates, zip codes, and gender, researchers managed to re-identify the medical records of the Massachusetts governor by linking the “anonymized” health data with voter registration lists. This revealed a shocking truth: most people are uniquely identifiable by just a few data points like date of birth and zip code, meaning that the so-called anonymized data was far from anonymous — it was vulnerable to re-identification attacks.
Based on advancements in data privacy research in 00s and 10s, synthetic data emerged as a promising solution.
Benefits of Synthetic Data in Health Research
🔹 Protects patient privacy and avoids data linkage attacks.
🔹 Synthetic data can be produced in any quantity, tailored to specific needs, and is often much cheaper and faster to generate than collecting real data.
🔹 It can be engineered to correct imbalances or biases present in real datasets, improving the fairness and accuracy of AI models.
HEREDITARY most recent contribution
These ideas took center stage at a workshop held by Daniele Dell’Aglio and Frederik Marinus Trudslev from Aalborg Universitet on Thursday, May 22nd, as part of the HEREDITARY project. HEREDITARY consortium explored there how synthetic data is created and why it plays a key role in shaping a privacy-aware, ethical future for data science.


Recent Comments