Omics Discovery Index boosts FAIR data sharing
Omics Discovery Index boosts FAIR data sharing
Finding innovative ways to store, organise and access different types of biomedical data, each of which is hosted in different platforms, is a huge challenge – largely thanks to the falling cost of genome sequencing and other ‘omics’ methods. The open-source Omics Discovery Index (Omics DI) offers a solution, making it easier for researchers everywhere to discover, access and share ‘omics datasets.
- The Omics Discovery Index makes data from publicly funded research more discoverable and reusable;
- The index simplifies access to genomics, proteomics, metabolomics and other large-scale datasets, for both individuals and data service providers;
- A single interface to search 81,000 datasets from 11 member repositories represents a major achievement in interoperability.
What’s the challenge?
Currently, scientists need to search many different repositories and sift through a huge number of publications to gain a clear picture of what is known about their particular research question. There is a clear demand for more centralised access to these distributed data sets, much like the coherent view on the literature offered by PubMed Central partners.
How Omics DI provides a solution
Omics DI is the first open-source platform to integrate datasets from many different ‘omics databases into a single framework and interface. It makes life-science data more findable, accessible and interoperable for both people and machines, which in turn supports its reuse (the FAIR Guidelines explain why this is so important).
“Omics DI provides protocols and tools for finding and linking datasets,” says bioinformatician Yasset Perez-Riverol of EMBL-EBI, who played a key role in the international collaboration that delivered Omics DI. “This helps make a huge amount of diverse data searchable through a single interface, whether it’s from genomics, transcriptomics, proteomics or metabolomics experiments in human, plant, bacterial or other species.”
OmicsDI doesn’t just deliver isolated search results – it integrates datasets from 11 member data repositories (see Box), without replicating them. It works because all of the participants are adhering to a common standard for metadata and exchange format.
Flexible, durable, interoperable
“The number of data repositories indexed by Omics DI will continue to grow, if only because it is so flexible,” says Henning Hermjakob, Head of Molecular Systems Services at EMBL-EBI. The platform’s flexibility allows it to accommodate many data models, metadata representation and identifiers. “It addresses interoperability problems by offering harmonisation, for example ontology-based tools for resolving awkward things like different terms being used for the same concept – for example ‘protein’ and ‘gene product,’” he adds.
How it works
Omics DI offers a flexible exchange system based on an XML format and application programming interfaces (APIs). It links datasets in three major ways:
- Datasets are linked via explicit mentions in the metadata. The relationship between original and reanalysed datasets is defined with a cross-reference in the OmicsDI XML, which provides a direct link between datasets in different repositories.
- Datasets from one multi-omics experiment that are deposited in different repositories are linked by the associated publication. At the end of 2016 there were 4476 such datasets.
- ‘Similar datasets’ are approached much in the way ‘related articles’ are in PubMed Central repositories. Omics DI computes similarity at metadata and biological-entity levels to identify, for example, datasets that use similar software or share similar biological entities.
The ‘similar datasets’ feature is the first of its kind, boosting the discoverability of related datasets in different repositories.
A new view on multi-omics data
The OmicsDI website provides an easy way to search, filter and browse multi-omics datasets by species, tissue, disease or other aspects, returning relevant results based on a weighted scoring function. For those who want to get straight to the data, Omics DI also offers a RESTful API for programmatic access.
Omics DI is a sophisticated platform that integrates large, complex datasets and makes them more discoverable, accessible and reusable for scientists throughout the world. It is the perfect interoperability tool for international data-sharing endeavours like ELIXIR and Big Data to Knowledge, which are working to solve some of society’s greatest challenges in health, food supply and environment.
OmicsDI: Facts and Figures
>81,116 omics data sets (as of December 2016)
4 ‘omics data types (67,361 transcriptomics, 6281 proteomics, 8093 genomics and 847 metabolomics).
Human, model-organism and non-model-organism datasets
11 different repositories, hosted on 4 continents:
- The European Genome-Phenome Archive (EGA)
- Expression Atlas
- The Proteomics Identifications (PRIDE) database
- The PeptideAtlas
- The Mass Spectrometry Interactive Virtual Environment (MassIVE)
- The Global Proteome Machine Database (GPMDB)
- The Global Natural Products Social Molecular Networking project (GNPS)
- The Metabolomics Workbench
Perez-Riverol Y, et al. (2017) Discovering and linking public omics data sets using the Omics Discovery Index. Nature Biotechnology 35, 406–409 (2017) doi:10.1038/nbt.3790
Wilkinson MD, et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Nature Scientific Data 3; doi:10.1038/sdata.2016.18
ProteomeXchange: A platform for the globally coordinated submission of mass spectrometry proteomics data
MetabolomeXchange: An international data aggregation and notification service for metabolomics
ELIXIR: the pan-European research infrastructure for life-science data
Big Data to Knowledge (BD2K): An NIH Common Fund programme