Basic principles of integration

If you plan to integrate data, it needs to have as many similarities as possible. The same entity or concept described in different ways is not amenable to integration. Data integration therefore invariably requires some preparation.

Tools to help integrate data

There are tools available to help – for example, UniProt ID Mapping and Ensembl Biomart allow you to convert a set of identifiers from one format to another. There are also mappings of different controlled vocabularies, but care needs to be taken that you don’t lose data. For example, a term in one ontology might be mapped to a term that is less granular, so you might lose specificity. At EMBL-EBI we use application ontologies, the archetypal example of which is the Experimental Factor Ontology, to solve this problem.

If you and your collaborators submit data to public repositories, the data will be put into a standard format and the data integration will essentially be done for you. If you work in a commercial environment, you may have your own in-house databases, or you may use private instances of the public databases. EMBL-EBI’s Embassy Cloud provides EMBL-EBI’s collaborators with direct access to their datasets hosted at EMBL-EBI, and to the institute’s powerful computing resources. This shared, high-performance workspace allows project partners in many locations to analyse their data alongside public offerings, using their own approaches. Access to the Embassy Cloud is available to collaborators working on projects with EMBL-EBI. The service has been successfully piloted with Europe PMC (partners in Manchester, London and EMBL-EBI) and Tara Oceans (EMBL and global collaborators), and is now more widely available.

Where does the data come from?

It’s important to understand the origin of the data that you are integrating, and to be able to check the evidence for the involvement of each entity in the bigger picture. If, for example, you are integrating different types of omics data to understand the regulation of a pathway and its dysregulation in disease, you need to have a good understanding of the pathway in question and the disease that you’re studying, whilst remaining open and unbiased about what the data might be telling you.

Data integration requires that all the data are annotated in a consistent way. You also need to be absolutely sure that you’re comparing like with like. BioSamples database will allow you to find all the experiments performed on the same sample. To learn more about BioSamples take a look at our BioSamples: Quick tour.

If you’re performing your own experiments, bear in mind that others may want to integrate your data with data from other sources in the future. Providing adequate metadata, and formatting your data in a re-usable way, should become second nature to you.

Toni Kazic’s guide for data provenance (8) is a good place to start. If you’re using other people’s data, check it as though it were your own.

Pre-canned data integration

The good news is that there are now an increasing number of resources that have done a lot of the hard work for you. We have already used one service – EBI Search – that does a lot of the mapping of related entities for you. Another service that integrates a huge amount of public data relevant to discovery is Open Targets. Open targets is a service that is designed to enable exploration and visualisation of drug targets associated with disease. You can learn more about Open Targets in our webinar Open Targets: Mining gene and disease associations for improved drug target identification.