Biobanks: genetic data in demand

Biobanks supply the physical samples and associated data that form the basis of many experiments into human disease.

Biobanks: genetic data in demand

20 Jul 2017 - 13:14

Feature story

Millions of people generously give their time, blood and information to biobanks – all with the goal of improving research into human disease. Some biobanks relate to a specific disease area, and some have been running for decades (e.g. the Framingham Heart Study in the US has been running since 1948, the Lothian Birth Cohort since 1936).

Biobanks collect information and samples from millions of people so that researchers can run new studies (without recruiting subjects) and deposit the results back into the bank. Over time, that collective effort produces a staggering amount of complex data, all of which needs to be quality-controlled. The data also needs to be collected from many locations, in many languages, and served efficiently to approved scientists throughout the world.

The European Bioinformatics Institute (EMBL-EBI) is working with its global partners to resolve data challenges in biobanking, applying its strengths in genomics and data integration so that these resources may become truly transformative for biomedicine.

Every experiment starts with a sample

Biobanks supply the physical samples that form the basis of, for example, a rare-disease study or an experiment for the Human Cell Atlas. Very often, those samples have been gathered in a healthcare setting. Biobanks are usually built in close collaboration with a healthcare provider or, like the Cambridge BioResource, they are co-located with a provider such as a hospital.

Contemporary collections like the UK Biobank collect detailed information about individuals, their health and environment in a standardised way. Informed donors consent to having their material used in a broad range of research. That way, the samples they give will help advance science as new, previously unimagined technologies come into regular use.

Because of efforts like Europe’s Biobanking and Biomolecular Resources Research Infrastructure (BBMRI), the samples held in contemporary biobanks are very well described, so the accompanying data itself is also of great value. Thanks to ‘return of results’ rules, which users of many biobanks must follow, researchers who use the samples and data deposit new information into the biobank. Sometimes this information includes DNA sequencing, proteomics and other molecular data.

Scaling up

The most obvious technical challenge for biobanks is the sheer scale and diversity of the data. BBMRI has consolidated a huge amount of information about biobanks, and published a directory of 100 million samples.

This brings considerable overheads for long-running epidemiological studies. For example:

  • Data maintenance is as essential as adding new findings. But maintenance goes beyond straightforward updates. If a collection is to have value beyond the confines of a single study, it also needs to conform to data standards – which may change over time – so that its contents can be compared to other collections.
  • Genetic data adds another layer of complexity, especially for population-level studies. Acquiring the data in ‘real time’ and sharing it with approved researchers quickly is a major undertaking that takes robust infrastructure, and it can easily become too much for a single organisation to manage. Coordinated data-sharing such as ELIXIR is vital to success in this area.

Delivering biobank data

A leader in large-scale genomic data sharing, EMBL-EBI is a foundational member of ELIXIR. Its European Genome–phenome Archive (EGA), developed jointly with ELIXIR Spain’s Centre for Genomic Regulation (CRG) in Barcelona, provides the infrastructure to manage controlled-access genomic data. In a new, mutually beneficial partnership, the EGA will now serve genetic data for the UK Biobank.

The partnership allows the UK Biobank, which has 500,000 well-described samples, to combine its own sample and phenotype infrastructure with EMBL-EBI’s data storage, management and integration infrastructure. The result is high-quality, sustainable, large-scale biomolecular data that presents a host of new opportunities for data mining and advancing data-sharing technology.

Data types in the UK Biobank

UK Biobank data types as of July 2017. Figure based on a presentation by Naomi Allen, available here.


Use of the sample collections (and associated data) in biobanks is monitored by Data Access Committees (DAC). Sustainability challenges for both DACs and users of the data arise as datasets become highly accessed and more datasets become available. If a project generates a very large dataset that ends up being in high demand, the DAC will get a lot of access requests. Five years after the project finishes, people will still want access to the data – and those requests may become difficult to accommodate.

Make it snappy

As with all things, better data organisation means better search and more efficient research. Contemporary biobanks are collecting data more consistently, which facilitates large-scale analysis. To ensure these datasets can be mined and interpreted, EMBL-EBI and BBMRI ‘Nodes’ are exploring pilot projects for shared access, metadata standards and harmonisation. These include activities in the CORBEL project, which is coordinated by ELIXIR.

EMBL-EBI is also working on tools that reduce the technical barriers to interacting with biobank data for the ‘average’ biologist or clinician. For example, a very large dataset might have the data you are looking for, but to find out you need to apply for access and download the whole set – perhaps only to be disappointed. In the Beacons project, an ELIXIR–Global Alliance for Genomics and Health (GA4GH) collaboration, the EGA has provided a tool that will give a yes/no answer to questions like these. For instance, it could tell you instantly whether a dataset has information on a particular variant of interest at a specific position in the genome. It lets commercial entities find out quickly whether or not they can access a particular dataset or sample, without having to apply for it.

Simplifying access

Not everyone who gives their material agrees to sharing their information beyond a narrow set of studies. For instance, if a donor gives consent to have their samples used for non-commercial research into a specific disease, and nothing else, the DAC ensures that only authenticated academic researchers who submit a project on that disease can gain access.

Once a researcher finds out that the dataset does indeed contain relevant information, the next step is to gain access to it. The relevant DAC will read each proposal and decide whether to grant access. This process can introduce bottlenecks, particularly when a single study requires access to a large number of datasets. EMBL-EBI is working with its partners to develop resources and tools to streamline this process.

To make this transformation possible, ELIXIR has developed an Authentication and Authorisation Infrastructure (AAI) to authenticate bona fide researchers based on their organisation’s credentials. ELIXIR AAI services can connect data archives to authorized could services, making it easier for DACs to grant and control access to sensitive data. This makes it simpler for researchers to gain access to a very large number of datasets at one go.

UK Biobank: selected approved studies as of July 2017

This image from the UK Biobank website shows all International approved research projects - clicking on the image will take you to the UK Biobank website, where you can zoom in on the pins to find out about research taking place.

Translating standards

Anyone who has been seriously ill and used a healthcare system in another language will understand the value of translation. Healthcare happens in the language of the country, so samples are also described in that language. But research is most commonly carried out in English, so if the information is going to be searchable and comparable in the wider sense, bridges need to be built.

Efforts are underway to translate standards used to describe the data. The Ontology Lookup Service makes it possible to query in different languages, and the Human Phenotype Ontology is producing a layman’s translation in Japanese and several European languages. This makes things easier for clinicians to find.

Such translational efforts also pave the way for patients of all nationalities to enter data directly. This could be invaluable, as patients (or their families) are often extremely well informed about their disease. Direct submission by patients (e.g. the UK Biobank uses iPad-like tablets) mitigates some of bias and bureaucracy introduced by entering metadata that reflects billing concerns, or healthcare systems.

Describing data well and consistently will always be a challenge for biobanks. There are boundless opportunities to develop innovative ways to extract information from medical systems.

Biobanking: room for innovation

Scaling up biobanks presents fascinating challenges and opportunities for commercial innovation, notably in machine learning and text mining. To enable drug development and drug repurposing, we need new methods in text mining, ontology building and machine learning.

On the public infrastructure side, EMBL-EBI is expanding its capacity to accommodate the ever-rising demand for genomic information, and this requires sustained funding from national bodies. It is also deeply involved in international initiatives to streamline access to controlled-access data, notably in the GA4GH and ELIXIR.

The GA4GH is establishing standards and frameworks for improving security and assuring quality. Its Security Working Group is a focal point for technology aspects of data security, user access control, and audit functions. They are developing standards for data security, privacy protection, and user/owner access control. In addition, the work carried out by EMBL-EBI in CORBEL on interoperability standards will optimise both open data and access to secure data for life-science research.

Patient involvement is essential to the success of genomics in biomedicine. Wellcome’s Understanding Patient Data is engaging with patients and donors to ensure they understand what people are doing with their data, and why.

What’s next?

EMBL-EBI plays an important role in making biobank data useful, interoperable and accessible to researchers, all within a secure framework. Our new partnership with the UK Biobank and on-going collaborations with the BBMRI are firmly rooted in our tradition of data-driven biology, and inform our participation in the GA4GH.

As biological science becomes increasingly data driven and public biobanks become more data-rich, ever-more rare and complex conditions are becoming easier to study, with greater statistical power. The impact of these resources can hardly be understated. They enable research that leads to quicker diagnoses for children suffering from a rare disease, more targeted cancer treatments and repurposed medicines, all of which are good for people and society.

Discover more

Contact the news team

Vicky Hatch | Communications Officer

Oana Stroe | Senior Communications Officer

Subscribe to the email newsletter

Subscribe to our publications.

Sign up Or stay updated with the RSS feed (EMBL-EBI only).