ENA data formats

There are three tiers within ENA providing a level of abstraction from the underlying infrastructure that has resulted from the integration of three legacy databases: the EMBL Nucleotide Sequence Database (EMBL-Bank), the Trace Archive and the Sequence Read Archive (SRA). The three ENA data tiers are:

  • Reads: sequencing machine output including base and colour calls, call qualities and signals.

  • Assembly: information relating overlapping fragmented sequence reads to contigs and higher order structures representing complete biological molecules, such as chromosomes.

  • Annotation: interpretations of biological function projected onto coordinate-defined regions of assembled sequence in the form of annotation.

This page describes how these three data tiers are further broken into domains and dataclasses and gives examples of each data format.

ENA data domains

Data from the ENA tiers are organised into domains, each belonging typically to a single data tier but in some cases included in multiple tiers. Data types within ENA are:

  • Assembly: information describing the construction of reads and sequence contigs into higher order scaffolds and chromosomes.

  • Sequence: assembled and, optionally, annotated assembled reads.

  • Coding: a virtual domain* comprising sequence regions reported by data providers as being protein-coding regions.

  • Non-coding: a virtual domain* comprising sequence regions reported by data providers as representing non-protein-coding (RNA) genes.

  • Marker: a virtual domain* comprising information relating to phylogenetic, identification and molecular ecology marker data.

  • Analysis: derived data forms, such as recalibrated aligned reads and metabarcoding identifications.

  • Read: raw sequencing reads from next generation platforms.

  • Trace: raw sequencing data from capillary platforms.

  • Taxon: information relating to the organism that was the source of the sequenced biological sample.

  • Sample: information relating to the biological sample studied in the sequencing experiment.

  • Study: information relating to the scope of the sequencing effort; also known as 'Project', the primary use of study is to unite content otherwise dispersed across the ENA domains.

  • (Submission): an accessory domain that serves to package submitted data; while useful for submitter-ENA communications, this domain has no lasting use beyond a submission transaction.

*Virtual domains represent searchable and retrievable views of ENA data. Data in these domains are submitted as part of other domains from which the views are ultimately created.

ENA data classes

Data domains are further subdivided in some cases into data classes. Within a data class, data are presented uniformly. Please refer to the table below for a summary of ENA data classes and supported formats. The 'Example' column contains example entries retrieved from the ENA Browser in HTML, XML, Fasta, Fastq and flatfile formats. The ENA Browser provides retrieval and visualisation functionality over ENA data and metadata and uses REST URLs to support both interactive and programmatic access. The 'Schema' column points to the XML Schemas that describe the data class specific XML formats. Please note that all XML documents returned by the ENA Browser are included in and validate against the ENA.root.xsd XML Schema.

Data domain Data class Data tier Definition Example Schema
Assembly assembly All A record detailing the construction of reads and sequence contigs into higher order scaffolds and chromosomes.    
Sequence EST Reads A record representing raw expressed sequence tag sequence data (no qualities) and sample/library information Fasta
Flatfile
HTML
XML
ENA.embl.xsd
WGS Assembly, Annotation A record representing data from ongoing whole genome shotgun sequencing projects with optional annotation, typically showing an intermediate level of assembly Fasta
Flatfile
HTML
XML
ENA.embl.xsd
GSS Reads A record representing genome survey sequence; single pass, single direction sequence Fasta
Flatfile
HTML
XML
ENA.embl.xsd
HTC Annotation A record representing high throughput assembled transcriptomic sequence and optional annotation Fasta
Flatfile
HTML
XML
ENA.embl.xsd
HTG Annotation A record representing high throughput assembled genomic sequence and optional annotation Fasta
Flatfile
HTML
XML
ENA.embl.xsd
STD Annotation A record representing standard annotated assembled sequence Fasta
Flatfile
HTML
XML
ENA.embl.xsd
CON Assembly, Annotation A record representing high level (contig upwards) assembly information, constructed sequence and optional annotation Fasta
Flatfile
HTML
XML
ENA.embl.xsd
STS Reads A record representing a sequence tagged site Fasta
Flatfile
HTML
XML
ENA.embl.xsd
PAT Annotation A record representing a sequence associated with a patent process Fasta
Flatfile
HTML
XML
ENA.embl.xsd
TPA Assembly, Annotation A Third PArty sequence data record Fasta
Flatfile
HTML
XML
ENA.embl.xsd
MGA Annotation Mass genome annotation, typically CAGE tag data   Not available
TSA Assembly, Annotation A Transcriptome Shotgun Assembly record Fasta
Flatfile
HTML
XML
ENA.embl.xsd
Coding  CDS Annotation A record representing an annotated coding region derived from assembled sequences Fasta
Flatfile
HTML
XML
ENA.embl.xsd
Non-coding non-coding Annotation A record representing an annotated non-protein-coding region derived from assemble sequences Fasta
Flatfile
HTML
XML
ENA.embl.xsd
Marker marker Annotation A record representing an annotated phylogenetic, identification or molecular ecology marker locus derived from assembled sequences   ENA.embl.xsd
Analysis Analysis Assembly, Annotation An Analysis contains secondary analysis results computed from the primary sequencing results.   SRA.analysis.xsd
Read Experiment Reads A record containing information about a next generation sequencing data set, covering for example library and platform information. HTML
XML
SRA.experiment.xsd
Run Reads A record pointing to and describing a 'Run-file' record. HTML
XML
 SRA.run.xsd
Run-file Reads A record containing raw next generation sequence data including, for example, base calls and per-base quality scores. Fastq
CRAM
BAM
Not available
Trace  Trace info Reads A record providing sequenced sample, library and machine configuration for capillary sequencing data HTML
XML
Not available
  Reads A record containing capillary sequence reads data, including base calls and quality scores. Fasta
Fastq
Not available
Sample Sample Reads, Annotation A Sample contains information about the sample upon which the next-generation sequencing experiments are based. HTML
XML
SRA.sample.xsd
Taxon Taxon All Information relating to the organism that served as the source of material sequenced and its classification HTML
XML
ENA.taxonomy.xsd
Study Study All Record that serves to unite content otherwise dispersed across ENA, typically into read, assembly, transcriptome and targeted locus studies, etc. HTML
XML
ENA.project.xsd
SRA.study.xsd
Submission Submission Reads A record containing submission and update transaction details for the use of submitters during communication with ENA. HTML
XML
SRA.submission.xsd

Latest ENA news

12 Jul 2017: Submission service maintenance - 14/7/17 to 17/7/17

Webin submission services will not be available between Friday 14/7...

07 Jul 2017: Update to Aspera server

EBI has built a new Aspera server on up-dated hardware with the latest Aspera version and configuration. This should improve...

06 Jul 2017: ENA Release 132

Release 132 of ENA's assembled/annotated sequences now available

30 Jun 2017: Taxon support for sequence, WGS and assembly in ENA Browser Tools

You can now download sequence, WGS and assembly data by tax ID using ENA Browser Tools

23 Jun 2017: New tools to download data from ENA

Introducing two new tools to make retrieving data from ENA much easier: enaBrowserTools and ENA FTP Downloader.