EMBL-EBI data repository organisation for array and sequencing platforms
This text has been prepared as part of the European Sequencing and Genotyping Initiative (ESGI) in order to lay out the nature of different EMBL-EBI) data resources and their supporting services for the purposes of generators and consumers of genotyping and sequencing information. The document targets those involved in sequencing and array-based genotyping work and selectively covers only those resources, and those parts of those resources, that are directly relevant to this work. Further information on the resources covered is available from EMBL-EBI and linked where possible in this document.
The three main sections of the text cover the core data repositories that provide long-term storage of data, submission tools and services that support the flow of data from providers into the core data repositories and data access routes that present data from the core repositories. While there exist submission and presentation services that are uniquely tied to a specific core data repository, there are others that operate independently of the particular repository that holds the data and are re-used in different instances.
In the final section, examples are given of hypothetical data sets, submission routing, appropriate repositories and access routes.
Core data repositories
SRA: The Sequence Read Archive (raw and early analysis data from next generation sequencing platforms)
The Sequence Read Archive (SRA) is a part of the broader European Nucleotide Archive (ENA) that provides globally comprehensive coverage of nucleic acid sequences and associated information and operates as part of the International Nucleotide Sequence Database Collaboration. SRA accepts data for all next generation platforms and for all applications of sequencing, from classical genomic sequencing assembly projects, to functional genomics.
ArrayExpress: The ArrayExpress Experiment Archive
The ArrayExpress Archive (ArrayExpress) covers raw and analysis data from array platforms. ArrayExpress focuses on functional genomics data, including those derived from quantitative transcriptomics and epigenomics experiments. Typical ArrayExpress records comprise array design annotations, experimental design descriptions and analysis files.
EGA: The European Genome-Phenome Archive
The European Genome-Phenome Archive (EGA) provides a secure repository for data and experimental design information for studies in the area of human molecular medicine, where donor consent requires privacy and authorised access. EGA covers all assay platforms, including those that are array- and sequence-based. Internally, beyond the security layer, much of the technology closely mirrors that of the public archives.
Submission tools and services
A number of submission applications and services have been developed to suit the varying needs of submitters.
Unrestricted sequence data
Two systems are in place to support smaller-scale periodic submissions and larger-scale automated submissions.
Interactive submissions using SRA Webin
The SRA Webin web application offers guided and intuitive spreadsheet-based entry of sample and experimental configuration information and supports the upload of data files into a dropbox system. Users should request a submission account from firstname.lastname@example.org.
Programmatic submissions using the REST/dropbox system
SRA provides direct submission services that are appropriate for fully automated submissions. Based on a private drop-box system and an associated REST service in which transactions such as ‘validate’, ‘load’, 'modify', 'release' and ‘report’ can be requested. New users should request a submission account from email@example.com.
Unrestricted array data
Interactive submissions through Annotare
Spreadsheet-based submissions through MAGE-TAB
ArrayExpress offers a spreadsheet-based submission framework for large-scale studies. Full details are provided here.
Restricted sequence and array data
Interactive sequence submissions using EGA Webin
The EGA Webin web application offers guided and intuitive spreadsheet-based entry of sample and experimental configuration information and supports the secure upload of data files into a dropbox system. Users should request a submission account from firstname.lastname@example.org.
Programmatic sequence submissions using the REST/dropbox system
EGA can provide direct submission services that are appropriate for fully automated submissions. Based on a private drop-box system and an associated REST service in which transactions such as ‘validate’,‘load’, 'modify', 'release' and ‘report’, can be requested. New users should request a submission account from email@example.com.
Array-based data submissions
EGA supports submissions of array-based study data through a template-based system. Contact firstname.lastname@example.org for a submission package, that will contain metadata templates and template completion guides.
Data access services
Example data sets
|Study type||Recommended submissions route(s)||Data repository/ies||Recommended retrieval route(s)|
|Array-based mouse genotyping||MAGE-Tab||ArrayExpress||ArrayExpress|
|Small-scale sequence-based mouse genotyping||MAGE-Tab
|Human (restricted access) genotyping||EGA