Pathogen genome-scale sequence data
This page provides instructions for submitters of genome-scale pathogen sequence data to the European Nucleotide Archive (ENA). It includes a minimal checklist of sample metadata information to be reported associated with sequence data generated in high-throughput genome-scale pathogen surveys or research studies in clinical, organismal and environmental samples. The standard for sample metadata reporting, called MDM (Minimal Data for Mapping) has been developed in collaboration with the Global Microbial Identifier initiative (GMI). On this page submitters will find links to instructions for different categories of submission.
Checklist of required fields
The checklist that we present here is intended to assist practically those preparing their data for submission to the ENA. We do not propose that the information described as mandatory here is necessarily sufficient for successful reproduction of experimental findings and wish to note that the broader reporting standard framework, MIxS, exists that serves this purpose.
Broadly, the components of a submission are the sequence data themselves (raw sequence reads are mandatory, while assembly information is optional) and contextual data. The figure below lays out the fields, highlighting those that are mandatory and those that are recommended. Please note that information reported in these fields, with the exception of the sequence_reads, taxon and organism_name fields, should be directed towards extended sample object attribute fields in sample records (as TAG:VALUE pairs), using the field names given in the figure. Submission route-specific instructions are given for this in the submissions instructions.
Please click on the image below for an enlarged view.
We present this checklist as a living document that we expect over time will be edited and updated according to emerging methods and practises and community feedback, which we welcome at firstname.lastname@example.org.
Both interactive and programmatic tools (Webin) are available to aid in the submission of data to ENA. For general instructions on submissions of raw data alone, please refer to here, instructions upon programmatic XML submissions are here and details on programmatic tabulated submissions are available here. If you would like to use the ENA interactive submission tool please refer to the page here. For submissions of assembly information with raw read data, please refer here. Please note that we welcome submission enquiries and requests for assistance at email@example.com.
Project registration and submission of reads and samples
The read domain Webin interactive submission tool (see here) should be used to register a project and submit both raw sequencing reads and samples. Genome-scale pathogen studies of interest to the Global Microbial Identifier initiative should be flagged with the study keyword ‘GMI:part of GMI’. This can be done both via the interactive submission tool and programmatically. The screenshot below demonstrates where in the study page this can be done:
When submitted programmatically the following attribute tag value pair should be added to the submitted XML files:
<PROJECT_ATTRIBUTES> <PROJECT_ATTRIBUTE> <TAG>study keyword</TAG> <VALUE>GMI:part of GMI</VALUE> </PROJECT_ATTRIBUTE> </PROJECT_ATTRIBUTES>
Minimal sample description using the GMI:MDM sample standard checklist should be provided at the time of sample reporting. The project registration and sample reporting steps can be done prior to submission of reads or assembled sequences.
Genome assemblies submissions of clinical, organismal and environmental samples
The European Nucleotide Archive (ENA) is offering a genome assembly pipeline made available through Webin framework. Although Webin is an interactive tool, the system is designed to support transfers of large data sets. Submissions of several hundred assemblies are well supported. If you are submitting genome assemblies using the GMI sample reporting standard, you can go through all required steps of registering a project (mandatory), reporting a minimal set of sample metadata details (mandatory) and submitting raw reads (recommended) in one go. You can however choose to start with registering your project, reporting the samples and get back for submission of your genome assembly sequences at a later date. Detailed instructions on genome assembly submissions and FAQs are here and here respectively.
Discovery and retrieval
Public data submitted as part of pathogen studies under the Global Microbial Identifier initiative that are labelled as such using the keyword 'GMI:part of GMI' in ENA study records (see above) can be retrieved by ENA search services. Where samples are reported using the GMI sample checklist, associated sample records will be carrying the associated checklist ID (ERC000029) and can be also retrieved by ENA search services. For descriptions and IDs of sample reporting standards currently supported by ENA please refer to here.