Genome assemblies

A genome assembly is a collection of data to describe the assembly of multiple sequence records into a genome.  Currently the scope of this collection is mostly limited to eukaryota, bacteria and archaea, however there are some metagenomic and viral assemblies included.

In January 2015, ENA introduced a new XML format for accessing assembly information, with the schema available here.  Prior to this, the records were only available within the browser in HTML format.  Work is ongoing to expand the access of information for these records beyond what is currently available within the XML.  This page will be updated as these new options become available.

Assembly XML

In contrast to all other ENA data types, a genome assembly can have multiple versions public simultaneously. Therefore a single XML record describes one specific version.

The root assembly element contains three attributes. These are the (versioned) accession, alias and center name. The alias is the name of the assembly provided by the submitters and the center name refers to the main submitting body. This could be a single listed organisation, the primary organisation from a supplied list of collaborators, or a consortium.

Fields

Field Description
Identifiers The identifiers block contains the primary ID (the versioned accession) and the submitter ID. The submitter ID is the assembly name provided by the submitter (also known as the alias) and the namespace is set to the center name.
Title A short description of the assembly akin to an article title.
Description A detailed desciption of the assembly akin to an article abstract. This is usually obtained for the the description of the project associated with the assembly.
Name The submitter's name for the assembly.
Assembly level The highest level of assembly for any object in the assembly as described here
Genome representation Whether the goal for the assembly was to represent the whole genome or only part of it. This is further described here.
Taxon The taxon block contains taxonomic information about the sequenced organism, comprising the NCBI taxonomic identifier (e.g. 9606 for human), scientific name and common name.
Sample ref A reference to the sample with the biosample accession provided as the primary ID of this block
Study ref A reference to the project with the project accession provided as the primary ID of this block
WGS set The WGS block provides the prefix and version (also known as build) of any Whole Genome Shotgun (WGS) set contained in the assembly.
Chromosomes A list of assembled pseudomolecules that represents a biological replicon (chromosome, organelle, plasmid). Most of the chromosome is expected to be represented by sequenced bases, although some gaps may still be present. For each chromosome the accession, name and type of the chromosome is provide. The list of replicon types currently used by genome assemblies are:
  • Plastid
  • Kinetoplast
  • Segment
  • Apicoplast
  • Virus
  • Mitochondrial Miscellaneous
  • Plasmid
  • Nucleomorph
  • Macronucleus
  • Chloroplast
  • Mitochondrion
  • Virus Chromosome
  • Extrachromosomal Element
  • Miscellaneous
  • Provirus
  • Chromosome
  • Non-nuclear Miscellaneous
  • Chromatophore
  • Provirus Chromosome
  • Mitochondrial Plasmid
  • Linkage Group
  • Cyanelle
Assembly links A list of links to data external to the XML. This is not yet being used in the public assembly XML as of January 2015.
Assembly attributes An attribute comprises a tag and a value element. The tag gives a standardised name for the type of data being represented by a single attribute, where a value gives the data. As of January 2015, only assembly statistics are being included as assembly attributes.

Assembly Levels

Level Description
contig The highest level of the primary assembly unit consists of contigs.
The contigs are available from gc_unlocalised, gc_unplaced, gc_placed and gc_wgs_set tables. Please note that:
  • contigs in the gc_wgs_set table may also appear in the gc_unlocalised, gc_unplaced and gc_placed tables
  • contigs may only appear in the gc_wgs_set table
  • contigs may only appear in the gc_unlocalised, gc_unplaced or gc_placed tables
  • gc_wgs_set table only contains the wgs set prefix rather than all independent contigs
scaffold The highest level of primary assembly unit consists of gapped contigs (scaffolds).
The scaffolds are available from gc_replicon, gc_unplaced and gc_unlocalised tables. Please note that scaffolds in gc_replicon may or may not have sequence accession numbers associated with them. Only scaffolds with accession numbers in the gc_replicon table are available as sequence entries.
chromosome The highest level of primary assembly unit contains chromosomes. The assembly could consist of a mixture of assembled chromosomes, unlocalised and unplaced scaffolds, or it could contain only gapless chromosomes.
complete genome Every chromosome in the assembly must be gapless, there are no unlocalised or unplaced sequences and the genome representation is "full". The exception is for plasmid sequences, these can have gaps and unlocalised sequences.

Genome representation

Representation Definition
full The data used to generate the assembly was obtained from the whole genome, as in Whole Genome Shotgun (WGS) assemblies for example. There may still be gaps in the assembly.
partial The data used to generate the assembly came from only part of the genome. Most assemblies have full genome representation with a minority being partial genome representation. Reasons for the genome representation being set to partial include:
  • the assembly description indicates that the assembly was targeted to a single chromosome or a subset of the genome
  • the chromosome set in the assembly is less than the expected chromosome complement for the organism, ignoring any plasmids, organelle chromosomes and the small sex chromosome (Y for mammals, W for birds)
  • the genome coverage in a WGS assembly is less than 1
  • the ungapped sequence length of the assembly is less than half the average for other assemblies from the same species

Assembly statistics

Tag Definition
total-length Total length of all top-level sequences
ungapped-length Total length of sequenced bases (minus gaps) for all top-level sequences
spanned-gaps Number of gaps within scaffolds
unspanned-gaps Number of gaps between scaffolds
replicon-count Total number of chromosomes, organelles, and plasmids in the assembly
scaffold-count Number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds. Please note that scaffolds may include gaps and may be represented as WGS sequences, CON/AGP scaffolds or chromosomes.
n50 Scaffold N50: length such that scaffolds of this length or longer include half the bases of the assembly
scaf-n75 Scaffold N75: length such that scaffolds of this length or longer include 75% of the bases of the assembly
scaf-n90 Scaffold N90: length such that scaffolds of this length or longer include 90% of the bases of the assembly
scaf-L50 Scaffold L50: the number of scaffolds that comprise half of the bases of the assembly
count-contig Total number of contigs in the assembly. Please note that contigs do not include gaps and are represented as WGS sequences. Please also note that if the WGS sequences contain gaps then the WGS sequences themselves are scaffolds and the number of contigs will be higher than the number of WGS sequences.
contig-n50 Contig N50: length such that contigs of this length or longer include half the bases of the assembly
contig-n75 Contig N75: length such that contig of this length or longer include 75% of the bases of the assembly
contig-n90 Contig N90: length such that contig of this length or longer include 90% of the bases of the assembly
contig-L50 Contig L50: the number of contigs that comprise half of the bases of the assembly
count-regions Number of genomic regions that contain one or more alternate loci or patch scaffolds
count-alt-loci-units Number of alternate loci units within the assembly
count-patches Number of patch scaffolds within the assembly

Latest ENA news

11 Oct 2017: Read data download issues resolved

Read data download issues previously affecting ftp.sra.ebi.ac.uk and fasp.sra.ebi.ac.uk services now resolved.

06 Oct 2017: ENA read data download issues

Issues with read data download from ftp.sra.ebi.ac.uk and fasp.sra.ebi.ac.uk

04 Oct 2017: ENA Release 133

Release 133 of ENA's assembled/annotated sequences now available