ENA data formats
ENA data tiers
Data tiers within ENA provide a level of abstraction from the underlying infrastructure that has resulted from the integration of three databases: the EMBL Nucleotide Sequence Database (EMBL-Bank), the Trace Archive and the Sequence Read Archive (SRA). The three ENA data tiers are:
-
Reads: sequencing machine output including base and colour calls, call qualities and signals. Next generation sequencing reads are submitted to the Sequence Read Archive (SRA).
-
Assembly: information relating overlapping fragmented sequence reads to contigs and higher order structures representing complete biological molecules, such as chromosomes.
-
Annotation: interpretations of biological function projected onto coordinate-defined regions of assembled sequence in the form of annotation.
ENA data types
Data from the ENA tiers are organised into data types which are further subdivided into data classes. Each data class typically belongs to a single data tier but some data classes are included into multiple tiers. Also associated with the data tiers are a number of auxiliary data types that provide integration across ENA and serve to expand the information content of their particular parts. These auxiliary data types are:
-
Sample: information relating to the biological sample studied in the sequencing experiment.
-
Taxon: information relating to the organism that was the source of the sequenced biological sample.
-
Project: information relating to the scope of the sequencing effort. The primary use of the projects is to unite content otherwise dispersed across the ENA data classes.
ENA data classes
Data is presented uniformly within each ENA data class. Please refer to the table below for a summary of ENA data classes and supported formats. The 'Example' column contains example entries retrieved from the ENA Browser in HTML, XML, Fasta, Fastq and flatfile formats. The ENA Browser provides retrieval and visualisation functionality over ENA data and metadata and uses REST URLs to support both interactive and programmatic access. The 'Schema' column points to the XML Schemas that describe the data class specific XML formats. Please note that all XML documents returned by the ENA Browser are included in and validate against the ENA.root.xsd XML Schema.
| Data type | Data class | Data tier | Definition | Example | Schema |
| SRA | Experiment | Reads | A SRA Experiment contains information about next-generation sequencing experiment. | HTML XML |
SRA.experiment.xsd |
| Study | Reads | A SRA Study contains information about the next-generation sequencing project. | HTML XML |
SRA.study.xsd | |
| Run | Reads | A SRA Run contains the next-generation sequencing results from sequencing experiments and is associated with SRA data. | HTML XML |
SRA.run.xsd | |
| Submission | Reads | A SRA Submission contains submission actions to be performed by the archive. | HTML XML |
SRA.submission.xsd | |
| Analysis | Assembly, Annotation | A SRA Analysis contains secondary analysis results computed from the primary sequencing results. | SRA.analysis.xsd | ||
| Sample | Reads, Annotation | A SRA Sample contains information about the sample upon which the next-generation sequencing experiments are based. | HTML XML |
SRA.sample.xsd | |
| Reads | Reads | Next-generation sequence reads that must include base and colour calls and may include call qualities and signals. | Fastq | Not available | |
| Trace | Trace info | Reads | Details of sequenced sample, library and machine configuration for capillary sequencing data | HTML XML |
Not available |
| Reads | Reads | Capillary sequence reads that include base calls and qualiry scores. | Fasta Fastq |
Not available | |
| EMBL-Bank | EST | Reads | Raw expressed sequence tag sequence data (no qualities) and sample/library information | Fasta Flatfile HTML XML |
ENA.embl.xsd |
| WGS | Assembly, Annotation | Data from ongoing whole genome shotgun sequencing projects with optional annotation, typically showing an intermediate level of assembly | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| GSS | Reads | Genome survey sequence; single pass, single direction sequence | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| HTC | Annotation | High throughput assembled transcriptomic sequence and optional annotation | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| HTG | Annotation | High throughput assembled genomic sequence and optional annotation | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| STD | Annotation | Standard annotated assembled sequence | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| CON | Assembly, Annotation | High level (contig upwards) assembly information, constructed sequence and optional annotation | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| STS | Reads | Sequence tagged site | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| PAT | Annotation | Patent records | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| TPA | Assembly, Annotation | Third Party Annotation | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| TSA | Assembly, Annotation | Transcriptome Shotgun Assembly | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| CDS | Annotation | Annotated coding region derived from STD, WGS, HTC, HTG, CON, PAT, TPA dataclasses | Fasta Flatfile HTML XML |
ENA.embl.xsd | |
| MGA | Annotation | Mass genome annotation, typically CAGE tag data | Not available | ||
| Taxon | Taxon | All | Information relating to the organism that served as the source of material sequenced and its classification | HTML XML |
ENA.taxonomy.xsd |
| Project | Project | All | Record that serves to unite content otherwise dispersed across ENA, typically into genome, transcriptome, targfetted locus studies, etc. | HTML XML |
ENA.project.xsd |

