A genome assembly is a collection of data to describe the assembly of multiple sequence records into a genome. Currently the scope of this collection is mostly limited to eukaryota, bacteria and archaea, however there are some metagenomic and viral assemblies included.
In January 2015, ENA introduced a new XML format for accessing assembly information, with the schema available here. Prior to this, the records were only available within the browser in HTML format. Work is ongoing to expand the access of information for these records beyond what is currently available within the XML. This page will be updated as these new options become available.
In contrast to all other ENA data types, a genome assembly can have multiple versions public simultaneously. Therefore a single XML record describes one specific version.
The root assembly element contains three attributes. These are the (versioned) accession, alias and center name. The alias is the name of the assembly provided by the submitters and the center name refers to the main submitting body. This could be a single listed organisation, the primary organisation from a supplied list of collaborators, or a consortium.
|Identifiers||The identifiers block contains the primary ID (the versioned accession) and the submitter ID. The submitter ID is the assembly name provided by the submitter (also known as the alias) and the namespace is set to the center name.|
|Title||A short description of the assembly akin to an article title.|
|Description||A detailed desciption of the assembly akin to an article abstract. This is usually obtained for the the description of the project associated with the assembly.|
|Name||The submitter's name for the assembly.|
|Assembly level||The highest level of assembly for any object in the assembly as described here|
|Genome representation||Whether the goal for the assembly was to represent the whole genome or only part of it. This is further described here.|
|Taxon||The taxon block contains taxonomic information about the sequenced organism, comprising the NCBI taxonomic identifier (e.g. 9606 for human), scientific name and common name.|
|Sample ref||A reference to the sample with the biosample accession provided as the primary ID of this block|
|Study ref||A reference to the project with the project accession provided as the primary ID of this block|
|WGS set||The WGS block provides the prefix and version (also known as build) of any Whole Genome Shotgun (WGS) set contained in the assembly.|
|Chromosomes||A list of assembled pseudomolecules that represents a biological replicon (chromosome, organelle, plasmid). Most of the chromosome is expected to be represented by sequenced bases, although some gaps may still be present. For each chromosome the accession, name and type of the chromosome is provide. The list of replicon types currently used by genome assemblies are:
|Assembly links||A list of links to data external to the XML. This is not yet being used in the public assembly XML as of January 2015.|
|Assembly attributes||An attribute comprises a tag and a value element. The tag gives a standardised name for the type of data being represented by a single attribute, where a value gives the data. As of January 2015, only assembly statistics are being included as assembly attributes.|
|contig||The highest level of the primary assembly unit consists of contigs.
The contigs are available from gc_unlocalised, gc_unplaced, gc_placed and gc_wgs_set tables. Please note that:
|scaffold||The highest level of primary assembly unit consists of gapped contigs (scaffolds).
The scaffolds are available from gc_replicon, gc_unplaced and gc_unlocalised tables. Please note that scaffolds in gc_replicon may or may not have sequence accession numbers associated with them. Only scaffolds with accession numbers in the gc_replicon table are available as sequence entries.
|chromosome||The highest level of primary assembly unit contains chromosomes. The assembly could consist of a mixture of assembled chromosomes, unlocalised and unplaced scaffolds, or it could contain only gapless chromosomes.|
|complete genome||Every chromosome in the assembly must be gapless, there are no unlocalised or unplaced sequences and the genome representation is "full". The exception is for plasmid sequences, these can have gaps and unlocalised sequences.|
|full||The data used to generate the assembly was obtained from the whole genome, as in Whole Genome Shotgun (WGS) assemblies for example. There may still be gaps in the assembly.|
|partial||The data used to generate the assembly came from only part of the genome. Most assemblies have full genome representation with a minority being partial genome representation. Reasons for the genome representation being set to partial include:
|total-length||Total length of all top-level sequences|
|ungapped-length||Total length of sequenced bases (minus gaps) for all top-level sequences|
|spanned-gaps||Number of gaps within scaffolds|
|unspanned-gaps||Number of gaps between scaffolds|
|replicon-count||Total number of chromosomes, organelles, and plasmids in the assembly|
|scaffold-count||Number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds. Please note that scaffolds may include gaps and may be represented as WGS sequences, CON/AGP scaffolds or chromosomes.|
|n50||Scaffold N50: length such that scaffolds of this length or longer include half the bases of the assembly|
|scaf-n75||Scaffold N75: length such that scaffolds of this length or longer include 75% of the bases of the assembly|
|scaf-n90||Scaffold N90: length such that scaffolds of this length or longer include 90% of the bases of the assembly|
|scaf-L50||Scaffold L50: the number of scaffolds that comprise half of the bases of the assembly|
|count-contig||Total number of contigs in the assembly. Please note that contigs do not include gaps and are represented as WGS sequences. Please also note that if the WGS sequences contain gaps then the WGS sequences themselves are scaffolds and the number of contigs will be higher than the number of WGS sequences.|
|contig-n50||Contig N50: length such that contigs of this length or longer include half the bases of the assembly|
|contig-n75||Contig N75: length such that contig of this length or longer include 75% of the bases of the assembly|
|contig-n90||Contig N90: length such that contig of this length or longer include 90% of the bases of the assembly|
|contig-L50||Contig L50: the number of contigs that comprise half of the bases of the assembly|
|count-regions||Number of genomic regions that contain one or more alternate loci or patch scaffolds|
|count-alt-loci-units||Number of alternate loci units within the assembly|
|count-patches||Number of patch scaffolds within the assembly|