Why should I submit my metagenomics data to ENA?


If you only stored your sequence data on your computer, the wider scientific community would not be able to access it, therefore it couldn't be used for data mining and discovery. This is why most scientific journals, and many funding bodies, require you to submit your sequence data to a public repository. By doing so, the sequences will be archived and will receive recognised accession numbers.

In addition, Next-Generation Sequencing (NGS) output files have become very large, and many institutions cannot ensure safe long-term storage of such large datasets. The large file sizes also complicate data sharing (e.g. with collaborators), as it can take a very long time to transfer data.

How long would it take to share your data?

Imagine that for your project on the human gut metagenome, you sequenced 130 samples and obtained 130 FASTQ files. Each of these files contained on average 28,000,000 sequences and had a compressed size of ~ 2.5 Gb.

The full data set would require ~ 320GB of storage in compressed form. This represents about 400 CDs, 68 DVDs or a third of the storage capacity of a modern hard drive.

From your computer, it would take well over eight hours to transfer the data through a fast ethernet connection (100 Mbit/sec) but less than 2 hours to download it from ENA using their FTP server.

Submitting pre-publication data

The ENA is a specialist resource with capacity to securely store and share sequencing data. Pre-publication data can be kept confidential until published. Once public, data is made accessible for downloading and mining, benefiting the whole scientific community.