Introduction

Noor is a computational biologist studying insulin-dependent diabetes. They want to explore whether there are any genetic mutations associated with an increased likelihood of being diagnosed with this type of diabetes. Noor has found a popular human cohort dataset that could help answer this research question, and they are going to apply for access to the dataset. While they wait to hear back about their application, Noor would like to start developing a computational workflow to analyse genetic data from healthy individuals and individuals diagnosed with insulin-dependent diabetes. Noor has heard about “fake” or synthetic cohort datasets which model the characteristics of real datasets but can be accessed more easily. Noor is interested in exploring how a synthetic cohort dataset could be used to start developing their computational workflow so that it is ready to run when their application for accessing real data is approved.

Sam is a bioinformatics software developer who has built a new tool for aligning human genetic sequence data from millions of individuals across multiple studies to a reference genome. They would like to be able to test this tool and demonstrate it to other researchers, in order to get feedback and ensure it is providing useful results. Sam has so far been unsuccessful in getting approval to access real human data because the datasets Sam is interested in are only allowed to be used for disease-specific research projects. Sam has heard about synthetic cohort datasets which are less restricted to being used for a specific research purpose, and is interested in exploring how synthetic cohort datasets could be used to test the new federated analysis tool and to demonstrate it to potential users.

The pathway below will guide Noor and Sam in the process of finding and accessing the relevant cohort data (either real or synthetic) needed for their aims, and carrying out a federated data analysis. In addition, they will learn about the ethical, legal and societal implications (ELSI) related to the use of human data for research and how they can implement FAIR data principles in their work.

The nature of human data for research is changing to a decentralised model where the vast majority of genetic data will be generated by national scale biobanks and healthcare initiatives. The traditional method of carrying out genetic analysis is to apply for access to the dataset, download the data locally, and run a custom analysis. However, the sheer size of the data, increased security requirements, and jurisdictional restrictions on data export mean that this model is not feasible or scalable. In a federated model, the data analysis migrates to the data source, accessing the data and tool via standardised interfaces. Therefore, it allows for carrying out of analyses within a jurisdiction that data are restricted to, and it avoids moving or copying large amounts of data from one location to another.

On the other hand, in some situations a way to overcome the challenges of accessing data for research is by using synthetic data. Synthetic datasets model the same characteristics, statistical properties, patterns and data fields as real human data, but with all values being computer-generated, and with no link whatsoever with any real individuals and no identifiable data. These datasets are powerful tools to test a system, code or tool without using or having to access real data, which entails higher ethical and legal challenges.

The aim of this learning pathway is to explain and demonstrate how to perform federated analysis of human cohort data. For training purposes synthetic datasets are used through the pathway.

Use the table of contents on your left to access the different sections of this pathway.

Federated data analysis

Introduction