ENA data formats

ENA data tiers

Data tiers within ENA provide a level of abstraction from the underlying infrastructure that has resulted from the integration of three databases: the EMBL Nucleotide Sequence Database (EMBL-Bank), the Trace Archive and the Sequence Read Archive (SRA). The three ENA data tiers are:

  • Reads: sequencing machine output including base and colour calls, call qualities and signals. Next generation sequencing reads are submitted to the Sequence Read Archive (SRA).

  • Assembly: information relating overlapping fragmented sequence reads to contigs and higher order structures representing complete biological molecules, such as chromosomes.

  • Annotation: interpretations of biological function projected onto coordinate-defined regions of assembled sequence in the form of annotation.

ENA data types

Data from the ENA tiers are organised into data types which are further subdivided into data classes. Each data class typically belongs to a single data tier but some data classes are included into multiple tiers. Also associated with the data tiers are a number of auxiliary data types that provide integration across ENA and serve to expand the information content of their particular parts. These auxiliary data types are:

  • Sample: information relating to the biological sample studied in the sequencing experiment.

  • Taxon: information relating to the organism that was the source of the sequenced biological sample.

  • Project: information relating to the scope of the sequencing effort. The primary use of the projects is to unite content otherwise dispersed across the ENA data classes.

ENA data classes

Data is presented uniformly within each ENA data class. Please refer to the table below for a summary of ENA data classes and supported formats. The 'Example' column contains example entries retrieved from the ENA Browser in HTML, XML, Fasta, Fastq and flatfile formats. The ENA Browser provides retrieval and visualisation functionality over ENA data and metadata and uses REST URLs to support both interactive and programmatic access. The 'Schema' column points to the XML Schemas that describe the data class specific XML formats. Please note that all XML documents returned by the ENA Browser are included in and validate against the ENA.root.xsd XML Schema.

 

Data type Data class Data tier Definition Example Schema
SRA Experiment Reads A SRA Experiment contains information about next-generation sequencing experiment. HTML
XML
SRA.experiment.xsd
  Study Reads A SRA Study contains information about the next-generation sequencing project. HTML
XML
SRA.study.xsd
  Run Reads A SRA Run contains the next-generation sequencing results from sequencing experiments and is associated with SRA data. HTML
XML
SRA.run.xsd
  Submission Reads A SRA Submission contains submission actions to be performed by the archive. HTML
XML
SRA.submission.xsd
  Analysis Assembly, Annotation A SRA Analysis contains secondary analysis results computed from the primary sequencing results.   SRA.analysis.xsd
  Sample Reads, Annotation A SRA Sample contains information about the sample upon which the next-generation sequencing experiments are based. HTML
XML
SRA.sample.xsd
  Reads Reads Next-generation sequence reads that must include base and colour calls and may include call qualities and signals. Fastq  Not available
Trace Trace info Reads Details of sequenced sample, library and machine configuration for capillary sequencing data HTML
XML
 Not available
  Reads Reads Capillary sequence reads that include base calls and qualiry scores. Fasta
Fastq
 Not available
EMBL-Bank EST Reads Raw expressed sequence tag sequence data (no qualities) and sample/library information Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  WGS Assembly, Annotation Data from ongoing whole genome shotgun sequencing projects with optional annotation, typically showing an intermediate level of assembly Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  GSS Reads Genome survey sequence; single pass, single direction sequence Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  HTC Annotation High throughput assembled transcriptomic sequence and optional annotation Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  HTG Annotation High throughput assembled genomic sequence and optional annotation Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  STD Annotation Standard annotated assembled sequence Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  CON Assembly, Annotation High level (contig upwards) assembly information, constructed sequence and optional annotation Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  STS Reads Sequence tagged site Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  PAT Annotation Patent records Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  TPA Assembly, Annotation Third Party Annotation Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  TSA Assembly, Annotation Transcriptome Shotgun Assembly Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  CDS Annotation Annotated coding region derived from STD, WGS, HTC, HTG, CON, PAT, TPA dataclasses Fasta
Flatfile
HTML
XML
ENA.embl.xsd
  MGA Annotation Mass genome annotation, typically CAGE tag data   Not available
Taxon Taxon All Information relating to the organism that served as the source of material sequenced and its classification HTML
XML
ENA.taxonomy.xsd
Project Project All Record that serves to unite content otherwise dispersed across ENA, typically into genome, transcriptome, targfetted locus studies, etc. HTML
XML
 ENA.project.xsd