Synthetic Data: unlocking innovation without compromising privacy

In today’s data-driven world, one of the biggest challenges we face is how to protect individual privacy without slowing down scientific and technological progress. How can we responsibly analyse, develop, and innovate while handling sensitive information? One possible answer could be synthetic data.

What is Synthetic Data?

Synthetic data is artificially created information that imitates real personal data without revealing any personal details. It is artificially generated using AI algorithms and models, rather than being collected from real-world events or human activity. As it serves as a substitute for real data, it allows researchers to work with realistic datasets while safeguarding privacy, for example to train machine learning models to predict disease development.

How is Synthetic Data generated?

Synthetic data is not just made up, it is generated using advanced algorithms that learn generative models from real data. There are several techniques and approaches to generate synthetic data, depending on the use case and the type of data needed. Some of the most common methods are:

  • Statistical distribution: One of the earliest approaches involves analysing real data to determine its statistical properties and estimate the relationships between variables. Then, synthetic samples are generated to match these properties and follow the same logic and distributions.
  • Model-based simulations: In some cases, especially in behavioural or epidemiological studies, researchers build simulations based on known rules, such as how a disease spreads in a population. These simulations create synthetic datasets that resemble real-world dynamics.
  • Deep learning Methods: Today, deep learning techniques, especially Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), are at the forefront of synthetic data generation. These models are trained on real datasets and then generate new, realistic examples that are statistically similar to the originals but contain no trace of real individuals. This is very useful to create complex and multimodal synthetic data, including images, videos or text, which could generate accurate genetic profiles or clinical trajectories in a research project such as HEREDITARY.
The Massachusetts Case

The story of synthetic data begins with one of the most well-known privacy failures: the Massachusetts Group Insurance Commission case. In the 1990s, a privacy breach in Massachusetts showed that simply anonymizing health data wasn’t enough.

The Group Insurance Commission, an agency of the state of Massachusetts, decided to release health records of its employees to the public, believing that simply removing obvious identifiers like names and addresses would be enough to protect individual privacy. Their goal was to enable researchers to analyze health trends without compromising personal information.

Is removing names really enough to keep data anonymous? Using only publicly available information, such as birth dates, zip codes, and gender, researchers managed to re-identify the medical records of the Massachusetts governor by linking the “anonymized” health data with voter registration lists. This revealed a shocking truth: most people are uniquely identifiable by just a few data points like date of birth and zip code, meaning that the so-called anonymized data was far from anonymous — it was vulnerable to re-identification attacks.

Based on advancements in data privacy research in 00s and 10s, synthetic data emerged as a promising solution.

Benefits of Synthetic Data in Health Research

🔹 Protects patient privacy and avoids data linkage attacks.

🔹 Synthetic data can be produced in any quantity, tailored to specific needs, and is often much cheaper and faster to generate than collecting real data.

🔹 It can be engineered to correct imbalances or biases present in real datasets, improving the fairness and accuracy of AI models.

HEREDITARY most recent contribution

These ideas took center stage at a workshop held by Daniele Dell’Aglio and Frederik Marinus Trudslev from Aalborg Universitet on Thursday, May 22nd, as part of the HEREDITARY project. HEREDITARY consortium explored there how synthetic data is created and why it plays a key role in shaping a privacy-aware, ethical future for data science.

Strategic synergies to advance Health Data Innovation: Collaboration Agreement with DataGEMS

The HEREDITARY project is pleased to announce that it has entered into a collaboration agreement with DataGEMS (“Data Discovery Platform with Generalized Exploratory, Management, and Search Capabilities”), a project funded by the European Union’s Horizon Europe Research and Innovation Programme. They also focus on the world of data, aiming to offer a next-generation dataset discovery and management ecosystem that will provide algorithms to make datasets more accessible: discoverable, combinable and explorable.

DataGEMS develops quick and easy access to data with natural language and machine learning technology to discover, link and analyse datasets of different data modalities (such as tabular data, text documents, knowledge graphs, and images). The main goal is to gain new insights into vast amounts of complex and heterogeneous data sets by providing intuitive tools, as HEREDITARY intends to do with large volumes of multi-modal health data. DataGEMS also promotes data FAIRness in key areas such as education, meteorology and linguistics.

This agreement, effective from May 5, 2025, until December 31, 2027, establishes a framework for voluntary cooperation between the two projects. It aims to harness their complementary technical strengths to advance data discovery and integration in health research.

From a technical angle, the collaboration between HEREDITARY and DataGEMS will focus on improving data discovery methods, co-developing advanced data profiling methods, facilitating researcher exchanges, expected joint publications, and promoting the development of use cases based on open data, all while ensuring compliance with GDPR. This collaboration reflects the shared commitment of both projects to leverage their strengths and push the boundaries of data discovery in health research.

The European Health Data Space takes off: more control for citizens, more data for science

On March 5, the Regulation on the European Health Data Space (EHDS) has been published in the Official Journal of the EU. This pioneering initiative aims to create a secure and efficient digital health-specific data environment, benefiting all EU citizens and healthcare professionals, researchers and policymakers.

It will make it easier to exchange and access health data at EU level. It promises to improve individuals’ access to and control over their personal electronic health data, while also enabling specific data to be reused for research and innovation purposes for the benefit of European patients. By fostering a more interconnected, patient-centred, and data-driven healthcare system, the EHDS will enhance efficiency, reduce administrative burdens, and support innovation and long-term sustainability of health services.

Trust is also fundamental to the EHDS. The framework builds on existing EU regulations, including the General Data Protection Regulation (GDPR), to provide a trustworthy setting ensuring data protection.

Primary use: citizens and individuals

The EHDS places citizens at the heart of healthcare by granting them better control over their personal health data. Key benefits include:

  • Fast and Free Access: Individuals will be able to swiftly access their electronic health data, facilitating seamless sharing with healthcare professionals or family members in case of need across the EU.
  • Enhanced Control: Citizens will have the ability to add personal health information, restrict access to specific parts of their records or to specific persons, view who accessed their data, and request corrections if errors are found.
  • Security and Privacy: The EHDS requires robust security and privacy protections by default, to align with the EU’s high data protection standards.

Learn more about the primary use of the health data in the EHDS by clicking here.

Secondary use: research and innovation

At the same time, researchers, public health authorities, and policymakers will be able to leverage health data in a secure and privacy-preserving way to accelerate the development of new treatments, improve disease prevention, and strengthen Europe’s crisis preparedness.

For research projects like HEREDITARY, the EHDS offers unprecedented opportunities:

  • Access to High-Quality Data: Researchers will be able to access to large-scale health data, in anonymised or pseudonymised form, crucial for developing life-saving treatments and personalized medicines.
  • Structured data discovery: A clear and structured system allows researchers to discover available data, understand its location, and assess its quality, making research more efficient and impactful.
  • Ensuring interoperability of the data: The new regulation requires all electronic health record (EHR) systems to comply with the specifications of the European electronic health record exchange format, ensuring that they are interoperable at EU level, which is one of the FAIR principles that the HEREDITARY project pursues in its data management.
  • Cost-Efficiency: Streamlined access to high-quality health data reduces research costs, enabling more studies and innovations within available budgets.

Learn more about the secondary use of the health data in the EHDS by clicking here.

Looking ahead 

After the signing by the Council and the European Parliament and its publication in the EU’s Official Journal, the EHDS Regulation will enter into force on 26 March 2025 and will become applicable in different phases over the course of the following years, with target dates of 2029 and 2031 for full implementation.

At HEREDITARY, we are enthusiastic about the possibilities the EHDS brings. By enabling secure and seamless data exchange, the EHDS transforms healthcare for everyone: patients, professionals, researchers, public health institutions and industry alike.

Stay tuned as we continue to explore the benefits of the EHDS for our research and the broader community. Together, we are stepping into a new era of healthcare innovation and citizen empowerment.

Access more information on this promising regulation here.

JARDIN Hackathon on Health Data Federated Querying: an opportunity to contribute to the HEREDITARY project

The HEREDITARY consortium will take part in the upcoming JARDIN Hackathon on Health Data Federated Querying, an event organized by the European Commission. The Hackathon aims to tackle key challenges in integrating sensitive health data across multiple institutions while exploring innovative solutions. Its objectives align closely with our project’s goals, particularly in the fields of federated analytics and learning. A key focus will be enabling federated queries, allowing researchers to extract valuable insights without compromising patient privacy or data security.

This initiative brings together experts from diverse fields, fostering collaboration and knowledge exchange to address these complex issues effectively.

Key topics to be explored during the hackathon include:

  • Harmonizing data exports from healthcare provider systems.
  • Developing tools and methods for federated data querying.
  • Enhancing semantic representation and ensuring compliance with FAIR data principles.

The event is open to professionals from various disciplines, including clinicians, data stewards, analysts, developers, and semantic web specialists, all of whom play a crucial role in advancing data harmonization and secure querying practices.

Although an official event date has not yet been set, the registration deadline for the hackathon is March 5, 2025. We invite all interested participants to seize this opportunity to contribute to the future of digital healthcare while gaining valuable insights. Check here the preliminary agenda!

HEREDITARY Project launches “Inside Hereditary with Gianmaria Silvello”, a video series about the project’s work

We are delighted to introduce our latest video series, made by our partner Observa, where we delve into the research of the HEREDITARY project, guided by our esteemed coordinator, Gianmaria Silvello from the University of Padua (UNIPD). Gianmaria Silvello is a computer science engineer researcher at Department of Information Engineering of the University of Padua. His research spans knowledge management, intelligent information systems, information access, algorithmic fairness, digital libraries and data provenance and citation. Each video offers a closer look at the project’s objectives, methodology, and layers, helping you to understand the impact of this pioneering work on healthcare and data research. This video series is part of the HEREDITARY voices series.

Episode 1. The project approach

In this first episode, Gianmaria explains the focus of the project, the interaction between the gut and the brain, and its main challenge: integrating multilingual and multimodal data distributed across several centres. To meet this challenge, the project will rely on federated learning and federated analytics techniques.

Episode 2. Federated learning

Federated learning is a machine learning technique that aims to train a model under the principles of collaboration between multiple entities to ensure that information remains decentralized, reinforcing privacy and security.

Episode 3. Semantic data integration

Hereditary aims to simplify the way we interact with data in order to achieve a better understanding of it. One of the project main goals is to make data accessible to everyone through a common, accessible language. Therefore, we’ll be able to treat multiple neurological diseases in a unified way, using the same terms or ideas and getting answers that everyone understands.

Episode 4. Expected results

Gianmaria talks about Hereditary’s approach to achieving its main holistic goal around gut-brain interplay. Connections between different data from many different perspectives are processed and combined with previously obtained patient information and literature knowledge to illuminate specific aspects of the diseases we didn’t know about before. This will provide a better picture to ultimately find new treatments and better diagnoses.

Episode 5. Artificial Intelligence

AI has a central role in the Hereditary project. Deep learning algorithms are being used to process, classify and establish relations between the different elements. Generative AI is also taken into account to extract information from texts, but also to generate new ones. AI is used in many different ways and at different levels, without forgetting that the previous data management and processing is fundamental for the AI to work properly.