What is the European Nucleotide Archive?

The European Nucleotide Archive (ENA) provides a comprehensive, accessible and publicly available repository for nucleotide sequence data. The ENA attracts users from a multitude of research disciplines and serves as an underlying data infrastructure for other EBI services, including Ensembl, Ensembl Genomes, UniProt and ArrayExpress. Data submitted to the ENA are validated by automated quality checking and, where possible, manual inspection and curation.

The foundation for the ENA was the EMBL Data Library, which was established in EMBL Heidelberg in the early 1980s (later renamed as the EMBL Nucleotide Sequence Database, EMBL-Bank). Once started as a primary database for assembled and annotated sequences, the ENA’s remit has expanded enormously in response to advances in sequencing technology and the broad applications of sequence data. The ENA now incorporates raw data from electrophoresis-based sequencing machines as well as raw reads from next-generation sequencing platforms. By consolidating information from these three tiers, the ENA provides access to the whole scale of sequencing information: from raw data, through assembly and mapping information that relates very fragmented raw sequence reads into contigs and higher order structures, such as scaffolds and chromosomes, through to high-level functional annotation (see Figure 1).

 

Why do we need the ENA?

Nucleotide sequence information is crucial to our understanding of biology, from genetics and molecular interactions through to organism-wide processes. Free access to nucleotide sequence data is therefore essential for life science research, even for basic tasks such as primer design, comparing sequences to those in the public domain and gene expression analysis. As large-scale sequencing becomes faster and cheaper, the need to deposit, search and analyse information in a central archive that is publicly available and easily accessible continues to grow.

The ENA’s three-tiered data architecture

Figure 1 The ENA’s three-tiered data architecture. Individual sequence reads are represented with submitted assemblies and read alignments. Assemblies are annotated with features such as genes and regulatory regions.

Adapted from: Cochrane, G.et al. Public Data Resources as the Foundation for a Worldwide Metagenomics Data Infrastructure. In: Metagenomics: Theory, Methods and Applications (Chapter 5), Caister Academic Press, Universidad Nacional de Cordoba, Argentina. Ed. D. Marco (2010).