Data Coordination: Challenges and Solutions

Laura Clarke of EMBL-EBI, Data Coordinator for the BLUEPRINT project, shares some of the common pitfalls in large-scale research projects and how to manage them.

Map of the world, with RADAR/rainbow overlay — Communication must centre on being absolutely clear about intent, for people from very different backgrounds who may have very different goals.

Opinion | Laura Clarke, Resequencing Informatics Coordinator, EMBL-EBI

Over my eight years running data coordination for large, international projects, the complexity and scope of the biology being studied by these projects has grown in scale. Here, I share are some of the lessons I've learned about how to coordinate these projects well.

Good data coordination must tackle both social and technical challenges, both during consortium creation and throughout the lifetime of the project. Recognising and proposing solutions to the social and technical issues is essential for building fruitful relationships within any large-scale collaboration.

Social challenges centre on communication and intent. If it is to conduct ‘good science’, a consortium needs to establish a shared understanding of the scientific and technical challenges. Developing that common perception will help it move forward effectively.

Building and maintaining good social interactions

Most of the social challenges centre on ‘intent’. Communication must centre on being absolutely clear about this for people from very different backgrounds who may have very different goals.

From the beginning of a project, the goals of communication are to:

Gain an understanding;
Find a common language; and
Set expectations.

The basic structure of communication requires being consistent and agreeing on how you will share information, for example:

In-person meetings;
Phone calls and videoconferences; and
Email-based ticketing systems.

Within that structure, you need to address all of the details involved in:

Creating and supporting standards – these will drive the success of the project beyond its funding period, and can be considered the I-beams of data science;
Facilitating specific collaborations, for example through maintaining an open dialogue and proactively connecting people with matching interests; and
Supporting interactions with key services, such as submission of data to the ENA and ensuring the data flows to Ensembl for use in their annotation pipelines.

Using communication to maximise uptake

In the Functional Annotation of Animal Genomes (FAANG) Project, a very large, distributed consortium, we put a huge amount of effort into establishing data standards and rules. To maximise their adoption, we took communication very seriously, and incorporated user-experience research into our standard operating procedures.

This example shows the visual approach we took to communicating standards in an international project (with members from many language groups), addressing every detail of a typical experience employing data standards.

Standards and rules: Start by being clear (and brief) about the purpose, what is expected, and why.
This screen shot shows an example from the FAANG project: what you might see after uploading your metadata:

Screen shot from FAANG: Validation result

Presentation: When people are producing a lot of data, there are so many details to consider that users can get frustrated. To avoid introducing barriers, stick with your visual approach and ‘show, don’t tell’: Anyone in your project, from any language group, should be able to see at a glance how well they are doing. Rather than presenting a potentially overwhelming wall of text, give users the opportunity to click for details.

This example shows it better than I can tell it: a summary results page from a FAANG metadata upload:

Screen shot from FAANG: Results list

Clicking through gives you more detail about what happened, and why:

Screen shot from FAANG: Error detail

Microcopy: Error messages are a perfect example of microcopy that is often neglected. What do your error messages mean? Giving proper attention to this content really pays off. When something goes wrong, anyone in your project should be able to take in what it was, why it happened and how to avoid it in future. For this service, we provide an explanation for our error messages on a web page.
Incentives: People will be going to a lot of trouble – often changing the way they work – to comply with data standards. Reward them when they get it right by giving them something they want, like a quick-and-easy conversion tool (n.b. do your research!).

One example is FAANG’s tailored template, which matches user needs and offers to convert data into SampleTab: the expected submission format for BioSamples.

Technical challenges: effective solutions

Technical challenges may seem endlessly diverse, but a core set is common to all large consortium projects. These are:

Data life cycles, for example ensuring the raw data for each sample can be connected to the final analysis product (e.g. the 1000 Genomes variant call set);
Storage and access; such as ensuring the BLUEPRINT consortium could get in progress analysis results before a public release through private ftp Dropboxes; and
Discoverability – there are so many aspects to the data, and ways people might be looking for it, that you need to be constantly adapting your approach. The challenges get increasingly complex as the datasets grow.

Done, rebuild, reuse

Finding solutions to these challenges, which are legion, is easiest when you stick to three basic working practices:

Reuse tools: Use existing tools wherever possible: There is no point reinventing the wheel. In FAANG, the ReseqTrack API and database, as well as the Data Portal Toolkit, served us well, these APIs both already existed having been built and used for other projects such at the 1000 Genomes Project, BLUEPRINT and HipSci.

Make small tools for small jobs: We develop focused tools to fill gaps that could block our mission if left untended. For example, in FAANG we developed a File manifest script to help our collaborators when they upload ad hoc analysis files, and a process to produce a file index (with timestamps) of our FTP site, which enables easy mirroring by other groups. These tools should be simple (e.g. our preferred output format is a tsv file) and open (e.g. code on github, expose APIs).

Retire old tools and move on: If a tool is no longer useful, show it no mercy. In BLUEPRINT, we switched from the ReseqTrack event system to Ensembl’s eHive system. This gave us much more flexibility and power in how we run our pipelines.

The Data Portal Toolkit

Our Data Portal Toolkit is a great example of reuse. The complex datasets produced by projects like Blueprint, HipSci and FAANG can be difficult to explore, making discovering the specific information you want a challenge. For the HipSci project, which was creating iPSCs and producing a baseline catalogue of genomic, proteomic and cellular phenotype data on those iPSCs, making all these data sets accessible was very important.

We built the first of our data portals using Elasticsearch’s document store, which can be searched easily as the anchor for this toolkit. Metadata from the archives and other sources is transformed into JSON and stored in the Elasticsearch database. Then we can build views on that data and enable our users to issue complex queries, and find specific pieces of the extensive data collected by HipSci.

This issue is not unique to HipSci, so we extended the toolkit to create a portal for the International Genome Sample Resource (IGSR), which is providing on-going support for the 1000 Genomes Project data. The portal makes this data easier to search, as the data files were previously only discoverable through the FTP site.

We are currently preparing a data portal for the FAANG consortium, which will make the public collection of functional data for livestock species much more discoverable.

Data Portal Toolkit: diagram (Laura Clarke, EMBL-EBI)

Best practice

Who can really say what ‘best practice’ is? When an approach works really well, every time, that’s a pretty good indication. Any consortium will need to establish this together – it’s not a top-down thing. For the projects I’ve coordinated, finding a common language and applying it in the best possible way, following user-experience research principles, has been effective and rewarding.

Ensuring the complex data these high-throughput projects create are accessible relies on getting the metadata correct: both about the samples they collect, and the experiments they do.

This involves working with a consortium to set appropriate standards, supporting it in adopting those standards, and establishing reusable technology to help the wider community benefit from the work. The better structured your data, the easier it will be to spot confounding factors when you do your analysis, and the easier it will be for you and others to reuse it when new questions arise.

Discover more

BLUEPRINT: Blood, Big Data and Epigenetics
The 1000 Genomes Project: data management and community access (Nature Methods 2012)
BLUEPRINT Data Coordination Centre
BLUEPRINT data on Trackhubs (Ensembl)
FAANG: Functional Annotation of Animal Genomes
HipSci: Human Induced-pluripotent Stem Cells

Edit

Tags: BLUEPRINT, Data Coordination, FAANG, HipSci, Laura Clarke,

Data Coordination: Challenges and Solutions