Synthetic Data: unlocking innovation without compromising privacy

In today’s data-driven world, one of the biggest challenges we face is how to protect individual privacy without slowing down scientific and technological progress. How can we responsibly analyse, develop, and innovate while handling sensitive information? One possible answer could be synthetic data.

What is Synthetic Data?

Synthetic data is artificially created information that imitates real personal data without revealing any personal details. It is artificially generated using AI algorithms and models, rather than being collected from real-world events or human activity. As it serves as a substitute for real data, it allows researchers to work with realistic datasets while safeguarding privacy, for example to train machine learning models to predict disease development.

How is Synthetic Data generated?

Synthetic data is not just made up, it is generated using advanced algorithms that learn generative models from real data. There are several techniques and approaches to generate synthetic data, depending on the use case and the type of data needed. Some of the most common methods are:

  • Statistical distribution: One of the earliest approaches involves analysing real data to determine its statistical properties and estimate the relationships between variables. Then, synthetic samples are generated to match these properties and follow the same logic and distributions.
  • Model-based simulations: In some cases, especially in behavioural or epidemiological studies, researchers build simulations based on known rules, such as how a disease spreads in a population. These simulations create synthetic datasets that resemble real-world dynamics.
  • Deep learning Methods: Today, deep learning techniques, especially Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), are at the forefront of synthetic data generation. These models are trained on real datasets and then generate new, realistic examples that are statistically similar to the originals but contain no trace of real individuals. This is very useful to create complex and multimodal synthetic data, including images, videos or text, which could generate accurate genetic profiles or clinical trajectories in a research project such as HEREDITARY.
The Massachusetts Case

The story of synthetic data begins with one of the most well-known privacy failures: the Massachusetts Group Insurance Commission case. In the 1990s, a privacy breach in Massachusetts showed that simply anonymizing health data wasn’t enough.

The Group Insurance Commission, an agency of the state of Massachusetts, decided to release health records of its employees to the public, believing that simply removing obvious identifiers like names and addresses would be enough to protect individual privacy. Their goal was to enable researchers to analyze health trends without compromising personal information.

Is removing names really enough to keep data anonymous? Using only publicly available information, such as birth dates, zip codes, and gender, researchers managed to re-identify the medical records of the Massachusetts governor by linking the “anonymized” health data with voter registration lists. This revealed a shocking truth: most people are uniquely identifiable by just a few data points like date of birth and zip code, meaning that the so-called anonymized data was far from anonymous — it was vulnerable to re-identification attacks.

Based on advancements in data privacy research in 00s and 10s, synthetic data emerged as a promising solution.

Benefits of Synthetic Data in Health Research

🔹 Protects patient privacy and avoids data linkage attacks.

🔹 Synthetic data can be produced in any quantity, tailored to specific needs, and is often much cheaper and faster to generate than collecting real data.

🔹 It can be engineered to correct imbalances or biases present in real datasets, improving the fairness and accuracy of AI models.

HEREDITARY most recent contribution

These ideas took center stage at a workshop held by Daniele Dell’Aglio and Frederik Marinus Trudslev from Aalborg Universitet on Thursday, May 22nd, as part of the HEREDITARY project. HEREDITARY consortium explored there how synthetic data is created and why it plays a key role in shaping a privacy-aware, ethical future for data science.

The European Health Data Space takes off: more control for citizens, more data for science

On March 5, the Regulation on the European Health Data Space (EHDS) has been published in the Official Journal of the EU. This pioneering initiative aims to create a secure and efficient digital health-specific data environment, benefiting all EU citizens and healthcare professionals, researchers and policymakers.

It will make it easier to exchange and access health data at EU level. It promises to improve individuals’ access to and control over their personal electronic health data, while also enabling specific data to be reused for research and innovation purposes for the benefit of European patients. By fostering a more interconnected, patient-centred, and data-driven healthcare system, the EHDS will enhance efficiency, reduce administrative burdens, and support innovation and long-term sustainability of health services.

Trust is also fundamental to the EHDS. The framework builds on existing EU regulations, including the General Data Protection Regulation (GDPR), to provide a trustworthy setting ensuring data protection.

Primary use: citizens and individuals

The EHDS places citizens at the heart of healthcare by granting them better control over their personal health data. Key benefits include:

  • Fast and Free Access: Individuals will be able to swiftly access their electronic health data, facilitating seamless sharing with healthcare professionals or family members in case of need across the EU.
  • Enhanced Control: Citizens will have the ability to add personal health information, restrict access to specific parts of their records or to specific persons, view who accessed their data, and request corrections if errors are found.
  • Security and Privacy: The EHDS requires robust security and privacy protections by default, to align with the EU’s high data protection standards.

Learn more about the primary use of the health data in the EHDS by clicking here.

Secondary use: research and innovation

At the same time, researchers, public health authorities, and policymakers will be able to leverage health data in a secure and privacy-preserving way to accelerate the development of new treatments, improve disease prevention, and strengthen Europe’s crisis preparedness.

For research projects like HEREDITARY, the EHDS offers unprecedented opportunities:

  • Access to High-Quality Data: Researchers will be able to access to large-scale health data, in anonymised or pseudonymised form, crucial for developing life-saving treatments and personalized medicines.
  • Structured data discovery: A clear and structured system allows researchers to discover available data, understand its location, and assess its quality, making research more efficient and impactful.
  • Ensuring interoperability of the data: The new regulation requires all electronic health record (EHR) systems to comply with the specifications of the European electronic health record exchange format, ensuring that they are interoperable at EU level, which is one of the FAIR principles that the HEREDITARY project pursues in its data management.
  • Cost-Efficiency: Streamlined access to high-quality health data reduces research costs, enabling more studies and innovations within available budgets.

Learn more about the secondary use of the health data in the EHDS by clicking here.

Looking ahead 

After the signing by the Council and the European Parliament and its publication in the EU’s Official Journal, the EHDS Regulation will enter into force on 26 March 2025 and will become applicable in different phases over the course of the following years, with target dates of 2029 and 2031 for full implementation.

At HEREDITARY, we are enthusiastic about the possibilities the EHDS brings. By enabling secure and seamless data exchange, the EHDS transforms healthcare for everyone: patients, professionals, researchers, public health institutions and industry alike.

Stay tuned as we continue to explore the benefits of the EHDS for our research and the broader community. Together, we are stepping into a new era of healthcare innovation and citizen empowerment.

Access more information on this promising regulation here.