Standards for genome assembly submission

Genome assemblies comprise a number of possible layers of information, including reads, contigs, scaffolds and chromosomes (see figure I). This document lays out the requirements for submission of genome assembly information to ENA. For details of the mechanisms of submission, please refer here.

In the figure below, three typical assembly processes are illustrated, along with the layers of information that they each yield: A) clone-based assembly with scaffolding and finishing steps; B) shotgun assembly direct to chromosomes; and C) partial assembly to contigs only.

Assembly layers

New genome assembly submissions

Requirements for new genome assembly submissions are listed in the table below.

Component Level Comment

Reads

Recommended

Complete read and quality data

Read to contig mapping

One of, as appropriate, optional

e.g. BAM alignment of reads to contigs

Read to chromosome mapping

e.g. BAM alignment of reads to new chromosome

Contigs

At least one layer mandatory

 

Scaffolds

Chromosomes

Scaffold to chromosome  mapping

Mandatory if both layers are present

e.g. AGP file

Contig to scaffold mapping

Mandatory if both layers are present

e.g. AGP file

Assembly description

Mandatory

Brief information relating to assembly and future plans

Functional annotation

Optional

 

Updating existing genome assemblies

Requirements for updates to existing genome assemblies are listed in the following table.

Component Level Comment

Reads

Recommended

Complete read and quality data

Read to contig mapping

One of, as appropriate, optional

e.g. BAM alignment of reads to contigs

Read to chromosome mapping

e.g. BAM alignment of reads to new chromosome

Contigs

At least one layer mandatory, with highest layer no lower than for   existing assembly

 

Scaffolds

Chromosomes

Scaffold to chromosome mapping

Mandatory if both layers are present

e.g. AGP file

Contig to scaffold mapping

Mandatory if both layers are present

e.g. AGP file

Assembly description

Mandatory

Brief information relating to assembly and future plans

Regenerated (or lifted-over) functional annotation

Recommended

If associated with existing assembly

Coding annotation mappings between old and new assemblies

Recommended where functional annotation is provided for the updated   assembly

Typically through INSDC protein ID mappings

Third party genome assemblies

Third party genome assembly submissions and updates, in which the submitting group does not hold complete ownership of data, are subject to existing third party data rules, including the requirement for presentation of the new/updated genome assembly in a peer reviewed publication prior to public release from ENA.

Latest ENA news

01 Jul 2015: ENA release 124
Release 124 of ENA's assembled/annotated sequences now available

20 Jun 2015: Sample Checklist Updates - June 2015
ENA are planning to update several sample metadata reporting checklists. Some of these changes have been carried out for harmonisation of attributes/fields between various checklist. Other changes were made to allow a standardised missing/null value reporting. All changes will come into effect as of 3rd August 2015.

03 Jun 2015: Changes to read data submission services 1st of October 2015
ENA will make a number of changes to submission services for raw sequence read data on first of October 2015. We continue to track an ever evolving landscape of available and preferred formats and introduce these changes with a view to overall simplification of the submission system to allow us to provide a more efficient service with faster turnaround.