EMBL-EBI data repository organisation for array and sequencing platforms

ESGI

This text has been prepared as part of the European Sequencing and Genotyping Initiative (ESGI) in order to lay out the nature of different EMBL-EBI) data resources and their supporting services for the purposes of generators and consumers of genotyping and sequencing information. The document targets those involved in sequencing and array-based genotyping work and selectively covers only those resources, and those parts of those resources, that are directly relevant to this work. Further information on the resources covered is available from EMBL-EBI and linked where possible in this document.

The three main sections of the text cover the core data repositories that provide long-term storage of data, submission tools and services that support the flow of data from providers into the core data repositories and data access routes that present data from the core repositories. While there exist submission and presentation services that are uniquely tied to a specific core data repository, there are others that operate independently of the particular repository that holds the data and are re-used in different instances.

In the final section, examples are given of hypothetical data sets, submission routing, appropriate repositories and access routes.

Core data repositories

SRA: The Sequence Read Archive (raw and early analysis data from next generation sequencing platforms)

The Sequence Read Archive (SRA) is a part of the broader European Nucleotide Archive (ENA) that provides globally comprehensive coverage of nucleic acid sequences and associated information and operates as part of the International Nucleotide Sequence Database Collaboration. SRA accepts data for all next generation platforms and for all applications of sequencing, from classical genomic sequencing assembly projects, to functional genomics.

ArrayExpress: The ArrayExpress Experiment Archive

The ArrayExpress Archive (ArrayExpress) covers raw and analysis data from array platforms. ArrayExpress focuses on functional genomics data, including those derived from quantitative transcriptomics and epigenomics experiments. Typical ArrayExpress records comprise array design annotations, experimental design descriptions and analysis files.

EGA: The European Genome-Phenome Archive

The European Genome-Phenome Archive (EGA) provides a secure repository for data and experimental design information for studies in the area of human molecular medicine, where donor consent requires privacy and authorised access. EGA covers all assay platforms, including those that are array- and sequence-based. Internally, beyond the security layer, much of the technology closely mirrors that of the public archives.

Submission tools and services

A number of submission applications and services have been developed to suit the varying needs of submitters.

Unrestricted sequence data

Two systems are in place to support smaller-scale periodic submissions and larger-scale automated submissions.

Interactive submissions using SRA Webin

The SRA Webin web application offers guided and intuitive spreadsheet-based entry of sample and experimental configuration information and supports the upload of data files into a dropbox system. Users should request a submission account from datasubs@ebi.ac.uk.

Programmatic submissions using the REST/dropbox system

SRA provides direct submission services that are appropriate for fully automated submissions. Based on a private drop-box system and an associated REST service in which transactions such as ‘validate’, ‘load’, 'modify', 'release' and ‘report’ can be requested. New users should request a submission account from datasubs@ebi.ac.uk.

Unrestricted array data

Interactive submissions through Annotare

ArrayExpress offers an online submission tool: Annotare. Full details are provided here.

Spreadsheet-based submissions through MAGE-TAB

ArrayExpress offers a spreadsheet-based submission framework for large-scale studies. Full details are provided here.

Restricted sequence and array data

Interactive sequence submissions using EGA Webin

The EGA Webin web application offers guided and intuitive spreadsheet-based entry of sample and experimental configuration information and supports the secure upload of data files into a dropbox system. Users should request a submission account from ega-helpdesk@ebi.ac.uk.

Programmatic sequence submissions using the REST/dropbox system

EGA can provide direct submission services that are appropriate for fully automated submissions. Based on a private drop-box system and an associated REST service in which transactions such as ‘validate’,‘load’, 'modify', 'release' and ‘report’, can be requested. New users should request a submission account from ega-helpdesk@ebi.ac.uk.

Array-based data submissions

EGA supports submissions of array-based study data through a template-based system. Contact ega-helpdesk@ebi.ac.uk for a submission package, that will contain metadata templates and template completion guides.

Data access services

SRA

SRA data are available from the ENA browser and through REST services (details provided here).

ArrayExpress

ArrayExpress data are available from the website and programmatically (details provided here).

EGA

EGA data can be discovered through the browser and requests for access should be sent to ega-helpdesk@ebi.ac.uk.

Example data sets

Study type Recommended submissions route(s) Data repository/ies Recommended retrieval route(s)
Array-based mouse genotyping MAGE-Tab ArrayExpress ArrayExpress
Small-scale sequence-based mouse genotyping MAGE-Tab

 

SRA-Webin

SRA ArrayExpress

 

SRA

Human (restricted access) genotyping EGA

 

 

EGA EGA

Latest ENA News


20 Aug 2014: Read data through Globus GridFTP

Read data can now be downloaded using Globus GridFTP through ebi#ena Globus Online public endpoint.

18 Aug 2014: Changes to SRA XML 1.5
Small changes to Experiment XML, Analysis XML, EGA Dataset XML, EGA DAC XMLs were deployed on 11th of August 2014.

1 Jul 2014: ENA release 120
Release 120 of ENA's assembled/annotated seqences now available

23 May 2014: Change to date format for advanced search
From 16th June 2014, the date format used in the advanced search will be changed to ISO format (YYYY-MM-DD).

20 May 2014: Update to the ENA SAMPLE checklist
From 10th of June 2014 the ENA SAMPLE checklist XML will be updated and the older version will be deprecated.