Where does the data come from

Sharing data - the INSDC agreement

All nucleotide sequences, including both assembled and raw data, come from direct submissions. However, ENA is not the only resource to accept nucleotide sequence data. In total, there are three major nucleotide sequence resources:

ENA (provided by EBI)
GenBank + the US Trace and Sequence Read Archives (provided by NCBI)
DDBJ + the Japanese Trace and Sequence Read Archives (provided by the National Institute of Genetics)

It is important to have all nucleotide sequence data available within each of these three resources, regardless of where it has been submitted. Therefore, the three partners formed the International Nucleotide Sequence Database Collaboration (INSDC) and agreed to exchange all sequence data on a daily basis and to provide free unrestricted access to the data (Figure 3) (4). As a result, it does not matter to which database a sequence is submitted, all three INSDC databases will obtain the same sequence data.

Daily exchange of data between INSDC partners

Figure 3. Daily exchange of data between INSDC partners.

 Even though the INSDC resources contain the same sequence data, they do differ in how they organise the data, the tools they provide to analyse the data, and their links to external databases that provide supplementary information.

 

Sources of submitted sequence

The ENA resource accepts sequence submissions generated using any type of sequencing technology, whether it is raw sequence reads or assembled data, and with or without annotation. The data is submitted by independent researchers, large sequencing consortia and patent offices.

Sequencing is like making a DNA puzzle

Figure 4. Sequencing is like making a DNA puzzle: the chromosome is fragmented into short segments (library) that can be sequenced (reads), then the data is re-assembled and annotated.

ENA contains sequence in the form of raw sequence reads, assembled data and data annotated with biological information (Figure 4). Therefore, sequence data can differ widely in both length and quality. Ideally, there should be deep coverage where each base is read several times, but this is not always possible.

It is important to note that there is no filtering of the data, therefore all submitted sequence is represented, even if it is identical to that in an existing entry. This means that a certain level of sequence redundancy exists in ENA.

 

Data quality

Some sequence and annotation validation is performed by ENA, including checking taxonomy, describing features and providing tools for identifying any vector contamination. The ENA curation team contact authors to amend data where necessary. It is important to note that, because ENA contains original sequence data, the sequence records can only be updated by the submitter (author). If an author does not correct the data, then errors can persist in the database.

UniProt, the protein sequence archive, contains useful information about the accuracy of ENA coding sequences (CDS). Most of their protein sequence data is derived from translations of CDS in ENA. When creating a curated UniProt/SwissProt protein sequence entry, they must review all the CDS information available for a gene product, and record this information in the entry (Figure 5).

Sections from a UniProtKB/SwissProt entry containing information on CDS in ENA

 Figure  5. Sections from a UniProtKB/SwissProt entry containing information on CDS in ENA.

Notes

[A] Cross-references section contains a list of entries in ENA that code for a gene product.

[B] ENA sequence entries are listed with notes on the accuracy of each sequence; these notes are compiled by UniProt curators.

[C] General annotation section contains comments about a gene product, including any cautions regarding the translated sequences.

[D] Sequence caution details any errors found by the UniProt curator in each translated sequence.