Blood, Big Data and Epigenetics

Artist's interpretation of the human epigenome. Credit: Spencer Phillips, EMBL-EBI

Blood, Big Data and Epigenetics

17 Nov 2016 - 12:56

The BLUEPRINT project’s massive push to understand the blood epigenome is changing research, and EMBL-EBI is making it happen.

17 November 2016 – Blood disease research has been taken to a new level with a suite of 47 scientific papers published in Cell and other high-profile journals by the International Human Epigenome Consortium (IHEC). To support biomedical research, IHEC and the EU-funded BLUEPRINT project have made the 1000+ associated datasets freely available to all.

The work published today is the result of coordinated efforts by scientists in Canada, Japan, Singapore, South Korea, the United States, the United Kingdom, Germany and other EU Member States. EMBL-EBI provided data coordination, analysis and infrastructure, and contributed directly to research findings.

The epigenome comes into focus

Explainer: What is an epigenome?

The human genome contains about 3 billion base pairs, or ‘letters’ of genetic code. This may seem big, but it is dwarfed by the staggering size and complexity of the machinery that controls it: The ‘epigenome’.

Learn more about the epigenome

To understand how diseases develop, researchers need to have an intimate understanding of the mechanisms that control the identity, behaviour and fate of cells.

The IHEC and BLUEPRINT findings published today refine our view of the epigenome, from the general impression gained from studying cell lines to a detailed schematic based on human blood cells.

Today’s publications represent only a fraction of the project results, and in fact much of the data was made publicly available well in advance. The resource will continue to grow as new findings come in.

What do you mean, “blood”?

Video still: Laura Clarke in the Blueprint video: Big data and the first epigenetic atlas of blood cells

In this Blueprint video, EMBL-EBI's Laura Clarke talks about developing a common language in a new field of science, and the importance of clarity for preserving knowledge for the future.

“Blood isn’t just one thing,” says Laura Clarke of EMBL, Data Coordinator for BLUEPRINT. “There are many different types and subtypes of cells, being studied at different stages of development, following different experiments. If you want people to know what you’re talking about – now, or ten years from now – you have to describe it exactly, in a standard way.”

We learn about red blood cells, white blood cells, platelets – but actually there are many different kinds of each. White blood cells, for instance, include macrophages that swallow up pathogens, B-cells, T-cells and more. Each of these derive from different kinds of ‘parent’ cells, that in turn arose from different kinds of stem cell. And each of these has a whole host of names used by people working in different fields and languages.

How many cell types were studied in the BLUEPRINT project? The researchers gathered data on over 50 primary cell types from healthy individuals, the 50+ neoplastic counterparts of those cell types, and several more from patients suffering from type 1 diabetes.

Red blood cells (stock image)

Data wrangling, EMBL-EBI style

Today’s findings represent years of research by people who collaborate remotely. Anyone who’s strained to understand a meeting over a crackling phone line or make out blurry faces on a video conference screen knows that even the basics of collaborating across the miles can be challenging.

Add to that the extreme complexity of the epigenome, experiments on many cell types at different points in their development and encryption of personal data, and you know some serious expertise is required.

One of the great mysteries in developmental biology is how the same genome can be read by cellular machinery to generate the plethora of different cell types required for eukaryotic life.”
H.G. Stunnenberg et al., Cell 2016

“Our challenge is to make sure that when you have thousands of samples and datasets, it’s possible to do an analysis all together,” says Paul Flicek, head of Genes, Genomes and Variation resources at EMBL-EBI and a PI on the BLUEPRINT project.

“It has to be possible for you to go and find which experiments are the same, to have confidence they’re following the same standards, that the samples are the comparable. There are so many details involved in getting this right, particularly when so many people are working on it at once,” he adds.

Video still: Paul Flicek in the Blueprint: Introduction video

In a video about Blueprint introducing the project, Paul Flicek explains how trying to understand the epigenome is like trying to figure out what's going on inside a house just by looking in some of the windows - some of which have curtains drawn.

EMBL-EBI, a publicly funded, intergovernmental organisation, has a mission to support data-driven research like BLUEPRINT.

“We bring a lot to the table: informatics capacity, the ability to store the data, and the knowledge needed to keep information updated and available into the future,” says Flicek. “This kind of science is not just about data storage – lots of companies do that perfectly well. What sets us apart is the biological insight we offer.”

“You can’t underestimate the importance of clarity and consistency,” says Clarke. “One of the things I am really happy about is the way the IHEC community came together to make and use common standards. Because we have consistent descriptions, pipelines and techniques, the results are comparable and we can all benefit from the work as a whole.”

She is a master of understatement.

Data sharing

Like most of the data at EMBL-EBI, much of the data and analysis results accompanying the IHEC papers are freely available without restriction. But some of the raw data behind it, which is also available for research, includes more detailed information about each individual’s genetics, and so is encrypted in the access-controlled European Genome-phenome Archive (EGA).

“From a data-sharing perspective, managing the sensitive data in BLUEPRINT securely is one of the more interesting aspects of the project,” says Clarke. “We rely heavily on the EGA to encrypt and protect the data, which introduces an application process for researchers who wish to access it. That’s different from what we usually do, which is to make data as open as possible.”

It is so important to acknowledge the research participants, who are central to all of it.”
Paul Flicek, EMBL-EBI

The cells were donated by people in the Netherlands, Italy, France, Germany and the United Kingdom.

“It is so important to acknowledge the research participants, who are central to all of it,” adds Flicek. “Their consented contributions allow research into blood disorders to progress, in perpetuity.”

That last bit – ‘in perpetuity’ – it matters a lot, as one question always leads to another. To get to the bottom of disease, researchers need to keep looking, and looking. When patients consent to have their donated cells used in future research, the impact of their participation skyrockets.

Rebuilding the build

As part of their efforts to ensure that the information obtained from research participants is useful not just in the immediate term but for many future studies, EMBL-EBI scientists have redesigned the Ensembl Regulatory Build: an integrative analysis that uses IHEC data to determine genome function. Built by EMBL-EBI scientists with expertise in both epigenomics and big data analysis, the Ensembl Regulatory Build was fully redesigned, from infrastructure to algorithms.

It is now a regularly updated public resource for gene-regulation data that other researchers can use for further investigations.

“We expect a lot more epigenetic data to come in, from different tissues and disease states,” says Flicek. “We want to make sure that as new data arrive, they are comparable to what we have already. It needs to come to us in consistent ways, in its most useful form. Ultimately, this will maximise the impact of publicly funded research by making it possible to reuse, through Ensembl and other resources, well into the future.”

The research

EMBL-EBI researchers were directly involved in several of the studies presented today.

Bridging phenotype and genotype

An IHEC collaboration led by Nicole Sorazano at the Sanger Institute clarifies the relevance of epigenetic differences between individuals and links these differences to gene-expression readouts – measures of how active a gene is.

“This was a unique study about the interplay between the genome and the epigenome, in the context of molecular phenotypes and disease,” commented Oliver Stegle, group leader at EMBL-EBI and a contributor to the paper. “It takes into consideration both genetic and epigenetic differences within and between 200 people, in three different blood tissue types. It bridges a major gap in our understanding of how genotype and the epigenome interrelate, and affect phenotype.”

Making contact

A collaboration between EMBL-EBI researchers and the Fraser group at the Babraham Institute led to a new technique to identify parts of the genome that are in physical contact with one another and regulate genes. They used the technique to pinpoint hundreds of thousands of regions involved in switching genes on and off.

“Mapping the genome’s regulatory interactions establishes the missing link between a genetic change at one part of the genome with the gene it ultimately affects,” explained Mikhail Spivakov of the Babraham Institute, an EMBL-EBI alumnus.

Based on this information, scientists can take a fresh look at genetic changes that have been linked to disease, in more accurate detail. They can investigate whether such changes affect points where different parts of the genome usually come into physical contact with each other.

Forging ahead

IHEC researchers at University College London and EMBL-EBI developed eFORGE: a tool for identifying cell types associated with disease.

“eFORGE helps you identify the specific cell types underlying a particular disease mechanism,” explains Charles Breeze of UCL, an EMBL-EBI alumnus. “If you’re studying a cancer, you can use eFORGE to predict the cell type in which the cancer originated. That kind of information can steer you towards selecting the best treatment for the patient.”

Ready for the future

When the BLUEPRINT project was proposed in 2010, the collective was working from a blank slate. Financed under an FP7 “high-impact science” initiative, the goal was to create a large project with enough risk and ambition to make a real difference.

IHEC and BLUEPRINT scientists have pulled together beautifully to contribute something of tangible utility for research, which will generate value for years to come.”
Ewan Birney, Director of EMBL-EBI

“To see it all come together is gratifying,” says Flicek. “This has been a very good example of a large project with many groups working in concert. All the way from groups with samples to collaborative analysis, BLUEPRINT and IHEC are showing what can be done when you work together to understand biology.”

“These projects are delivering a huge amount of high-quality information about how the human genome is controlled, which will drive the development of new therapeutics,” says Ewan Birney, Director of EMBL-EBI. “EMBL-EBI has been there every step of the way, providing data coordination and analysis, contributing biological insights and ensuring the data remains accessible for future generations. IHEC and BLUEPRINT scientists have pulled together beautifully to contribute something of tangible utility for research, which will generate value for years to come.”

Discover more

Source articles

A collection of 47 coordinated papers now published by scientists from across the International Human Epigenome Consortium (IHEC) takes research in the field of epigenomics a major step forward. A set of 24 manuscripts has been released as a package in Cell and Cell Press-associated journals, and an additional 17 papers have been published in other high-impact journals.

About IHEC

The International Human Epigenome Consortium (IHEC) is a global consortium that aims to provide free access to high-resolution reference human epigenome maps for normal and disease cell types. IHEC members support related projects to improve epigenomic technologies, investigate epigenetic regulation in disease processes, and explore broader gene-environment interactions in human health. Full members of IHEC include: AMED-CREST/IHEC Team Japan; DLR-PT for BMBF German Epigenome Programme DEEP; CIHR Canadian Epigenetics Environment, and Health Research Consortium; European Union FP7 BLUEPRINT Project; Hong Kong Epigenomics Project; KNIH Korea Epigenome Project; the NIH/NHGRI ENCODE Project; the NIH Roadmap Epigenomics Program; and the Singapore Epigenome Project. 


The project BLUEPRINT – A BLUEPRINT of Haematopoietic Epigenomes is a large-scale research project receiving close to 30 million euro funding from the EU. Forty-two leading European universities, research institutes and industry entrepreneurs participate in what is one of the two first ‘ high impact research initiatives’ to receive funding from the EU. The 42 partner organisations represent 33 academic groups and 9 companies from 12 countries.


Comprehensive sets of reference epigenome data relevant to health and disease are available freely through BLUEPRINT and the International Human Epigenome Consortium (IHEC).


For questions about IHEC and its projects, contact Stephanie Weber, Head of Communications & Marketing European Research and Project Office, Germany.

For questions about BLUEPRINT data coordination and to speak with EMBL-EBI contributors, contact the EMBL-EBI External Relations team.

Contact the news team

Vicky Hatch | Communications Officer

Oana Stroe | Senior Communications Officer

Subscribe to the email newsletter

Subscribe to our publications.

Sign up Or stay updated with the RSS feed (EMBL-EBI only).