Contig, scaffold and chromosome file formats

A genome assembly submission can contain contigs, scaffolds, chromosomes*1 and assembly description files. The goal of this document is to provide sufficient information for submitters to prepare contig, scaffold and chromosome files for submission.

*1 Chromosomes also include organelle (e.g. mitochondrion and chloroplast) and plasmid sequences.

Please contact datasubs@ebi.ac.uk if you have any questions.

Contig file formats

File Format Description
contig flat file  Contig sequences with or without functional annotation.
fasta file Contig sequences without functional annotation.

Unlocalised contigs: An unlocalised contig is associated with a specific chromosome but its order and orientation is unknown. Only if the assembly has a chromosome defined, it is possible to submit the unlocalised contigs file but this is not mandatory.

Scaffold file formats

File Format Description
AGP file Assembly instructions of contigs into scaffolds.
scaffold flat file Scaffold sequences with or without functional annotation or functional annotation associated with scaffolds submitted using an AGP file.
fasta file Scaffold sequences without functional annotation.

Unlocalised scaffolds: An unlocalised scaffold is associated with a specific chromosome but its order and orientation is unknown. Only if the assembly has a chromosome defined, it is possible to submit the unlocalised scaffolds file but this is not mandatory.

Chromosome file formats

File Format Description
AGP file Assembly instructions of contigs or scaffolds into chromosomes.
chromosome flat file Chromosome sequences with or without functional annotation or functional annotation associated with chromosomess submitted using an AGP file.
fasta file Chromosome sequences without functional annotation.

chromosome list file must also be submitted.

Entry name

All submitted sequences must be identified by an unique short entry name. The entry name must not include any spaces or pipe characters ('|').

File Format Entry name parsing rule
AGP file The entry name is extracted from the 1st ( object ) column.
Fasta file

The entry name is extracted from the line starting with '>' up to but not including the first space, pipe character ('|') or end of line character.

Example:

>entry_name
Flat file

The entry name is extracted from the AC * line up to but not including the first space or end of line character. The entry name must be prefixed with a '_' when using the flat file format.

Example:

AC * _entry_name

Functional annotation

Functional annotation must be submitted using flat files that conform to the INSDC feature table format. We recommend that functional annotation is prepared using Artemis.

Further information about features and qualifiers is available here.

Flat file validation

Before uploading or submitting your flat files please validate them using the ENA validator. Please note that this tool will validate the Feature Table annotation section of your flat files. However, the flat file header must be generated following instructions in the documentation here, before submitting through the genome assembly pipeline.  FAQs also detail the genome pipeline header format requirements.

AGP file validation

Before uploading or submitting your AGP files please validate them using the NCBI AGP validator.

More information

Read about assembly description files or go back to the genome assembly submissions main page.