Standards for genome assembly submission

Genome assemblies comprise a number of possible layers of information, including reads, contigs, scaffolds and chromosomes (see figure I). This document lays out the requirements for submission of genome assembly information to ENA. For details of the mechanisms of submission, please refer here.

In the figure below, three typical assembly processes are illustrated, along with the layers of information that they each yield: A) clone-based assembly with scaffolding and finishing steps; B) shotgun assembly direct to chromosomes; and C) partial assembly to contigs only.

Assembly layers

New genome assembly submissions

Requirements for new genome assembly submissions are listed in the table below.

Component Level Comment

Reads

Recommended

Complete read and quality data

Read to contig mapping

One of, as appropriate, optional

e.g. BAM alignment of reads to contigs

Read to chromosome mapping

e.g. BAM alignment of reads to new chromosome

Contigs

At least one layer mandatory

 

Scaffolds

Chromosomes

Scaffold to chromosome  mapping

Mandatory if both layers are present

e.g. AGP file

Contig to scaffold mapping

Mandatory if both layers are present

e.g. AGP file

Assembly description

Mandatory

Brief information relating to assembly and future plans

Functional annotation

Optional

 

Updating existing genome assemblies

Requirements for updates to existing genome assemblies are listed in the following table.

Component Level Comment

Reads

Recommended

Complete read and quality data

Read to contig mapping

One of, as appropriate, optional

e.g. BAM alignment of reads to contigs

Read to chromosome mapping

e.g. BAM alignment of reads to new chromosome

Contigs

At least one layer mandatory, with highest layer no lower than for   existing assembly

 

Scaffolds

Chromosomes

Scaffold to chromosome mapping

Mandatory if both layers are present

e.g. AGP file

Contig to scaffold mapping

Mandatory if both layers are present

e.g. AGP file

Assembly description

Mandatory

Brief information relating to assembly and future plans

Regenerated (or lifted-over) functional annotation

Recommended

If associated with existing assembly

Coding annotation mappings between old and new assemblies

Recommended where functional annotation is provided for the updated   assembly

Typically through INSDC protein ID mappings

Third party genome assemblies

Third party genome assembly submissions and updates, in which the submitting group does not hold complete ownership of data, are subject to existing third party data rules, including the requirement for presentation of the new/updated genome assembly in a peer reviewed publication prior to public release from ENA.

Latest ENA News

20 Aug 2014: Read data through Globus GridFTP
Read data can now be downloaded using Globus GridFTP through ebi#ena Globus Online public endpoint.

18 Aug 2014: Changes to SRA XML 1.5
Small changes to Experiment XML, Analysis XML, EGA Dataset XML, EGA DAC XMLs were deployed on 11th of August 2014.

1 Jul 2014: ENA release 120
Release 120 of ENA's assembled/annotated seqences now available

23 May 2014: Change to date format for advanced search
From 16th June 2014, the date format used in the advanced search will be changed to ISO format (YYYY-MM-DD).

20 May 2014: Update to the ENA SAMPLE checklist
From 10th of June 2014 the ENA SAMPLE checklist XML will be updated and the older version will be deprecated.