BioStudies: The Data Bento Box
BioStudies: The Data Bento Box
5 Sep 2016 - 12:12
- BioStudies makes it easier for journals and researchers to collate data held in many different archives.
- BioStudies provides links to all of the data supporting an article, study, or project, and archives unstructured data for which there is no recommended public repository.
- The new repository helps identify emerging trends in data-driven science.
BioStudies, a new data service at EMBL-EBI, packages all the data supporting a study, giving a home to unstructured data and linking to datasets in established repositories. As a data ‘container’ separate from the published article, a BioStudies record can be updated over time, adding flexibility and value to the published record. BioStudies is also built to help data managers support life-science research, making it easier to identify emerging trends and community requirements.
“When all the data supporting a paper is grouped in one place, it simplifies things for journals and authors – but also other researchers who want to re-examine the data in new contexts,” says Jo McEntyre, head of Literature Services at EMBL-EBI. “It’s easier to cite the data, and easier for authors to add important information after publication. Authors also have a permanent location for their supplemental data, where it’s linked to the relevant articles and data in standard repositories, and available for re-use and discovery.”
"A BioStudies record is a data mirror of a paper, but it doesn't have to have a paper..."
Many data types, one record
Scientific discoveries are often based on new experimental approaches, which produce data in new formats. This data can include, for example, high-resolution microscopy images, toxicology data (typically held in spreadsheets), or novel integrations of genomic and proteomic data. A single article might cite the accession numbers of a variety of these datasets, archived in multiple resources, and these references are often sprinkled throughout the text with additional supplementary materials appended as files.
This ad-hoc approach makes it difficult to explore the data, and to gauge the uptake of new technologies.
“Say you’re submitting a paper about your new study, and it’s based on sequence data in the ENA, a metabolomics dataset in MetaboLights, a spreadsheet with data that doesn’t fit anywhere and a dozen images,” says Jo. “Keeping it in one place makes it coherent: all the links and the unstructured data are archived and presented as one BioStudies record. This makes it much easier to take advantage of the emerging, more robust methods for citing the data in an article.”
Making new data discoverable
BioStudies ensures that datasets submitted in new formats have enough descriptive data (metadata) that they can be found. It provides a way to archive rapidly changing data types, for example imaging.
"BioStudies is like an early-warning system, alerting us to emerging formats and technologies..."
One of the largest datasets in BioStudies is from a high‐throughput imaging study by scientists from EMBL and DKFZ Heidelberg, used to create a map of interactions between genes and small molecules in cancer cells. Because this technology is so new, EMBL-EBI is collaborating with EuroBioImaging and BioImagingUK (with the University of Dundee) to establish data standards. This will be possible when a critical mass of such imaging datasets becomes available, and the research community’s needs become clear. When that happens, a dedicated image repository can be established – and BioStudies records can link to it.
“BioStudies is like an early-warning system, alerting us to emerging formats and technologies for which clear standards have not yet been established,” says Ugis Sarkans, Team Leader at EMBL-EBI. “It gives us an opportunity to ask the community whether a technology is still in flux, or if the time is right to start giving the data some structure. I am really interested to see what kind of datasets people begin to deposit, and how that changes over time.”
Easy data submission
BioStudies is integrated with the Europe PMC literature database, which generates BioStudies records based on text-mined accession numbers and supplemental data for any article in the Europe PMC archive. BioStudies also offers an easy-to-use submission tool for dataset authors.
“A BioStudies record is a data mirror of a paper, but it doesn’t have to have a paper,” adds Ugis. “This is perfect for projects that publish few papers, but which have generated lots of data that could be reanalysed by others.”
McEntyre J , Sarkans U , Brazma A (2015) The BioStudies database. Mol. Sys. Biol. 11(12):847.
Editorial (2016) Where are the data? Nature 537:138. Published online 8 September.
Taylor S (2016) Making data discoverable with figshare. Royal Society blog, published online 7 September.