Forthcoming changes to WGS and TSA sequences

Forthcoming changes to WGS and TSA sequences

19 Jan 2018 - 15:48

Over the last 35 years that ENA, in all its incarnations, has been in operation, we have seen a massive increase in the volume of public sequence data. We have therefore been making changes at ENA to ensure both that we can continue providing access to this data well into the future and to improve access to sequence data for the international community.

Whole genome shotgun (WGS) and transcriptome shotgun assembly (TSA) sequences make up the bulk of all public assembled and annotated sequences. In the last release (R134), there were over 700 million WGS sequences and over 200 million TSA sequences. In contrast, the remainder of the release consisted of approximately 230 million sequences across all data classes.

Twelve months ago, we improved the access to WGS and TSA sequences by introducing a collated FTP site containing the latest copy of each set, updated daily:
ftp.ebi.ac.uk/pub/databases/ena/wgs/public
ftp.ebi.ac.uk/pub/databases/ena/tsa/public

We also for the first time made suppressed WGS and TSA builds available:
ftp.ebi.ac.uk/pub/databases/ena/wgs/suppressed
ftp.ebi.ac.uk/pub/databases/ena/tsa/suppressed

The directory structure of these are consistent with the latest WGS release directory. That is, the first two letters of the prefix are used as a subdirectory name. For example,
ftp.ebi.ac.uk/pub/databases/ena/wgs/public/aa
ftp.ebi.ac.uk/pub/databases/ena/tsa/public/ga

For each set, three files are available:

  1. EMBL flat file for the set (LLLLVV.dat.gz)
  2. FASTA file for the set (LLLLVV.fasta.gz)
  3. EMBL flat file for the set master (LLLLVV.master.dat)

From the next release (R135; Feb/Mar 2018), WGS and TSA sequences will be excluded and you should fetch these sequences from the FTP locations given above. Not only will this change make it easier to retrieve WGS and TSA sequences due to a consistent location holding the latest version of the sequences, it will also drastically reduce the amount of time required to build the release.

In addition, on Monday 5th March we will be removing all individual WGS and TSA sequences from the browser. We will be keeping the master record and adding links to the three set files described above. We have added direct URLs you can use if you would rather keep this option over building the FTP path. For example,

  1. To get all sequences in the set in EMBL flat file format:
    https://www.ebi.ac.uk/ena/data/view/CAAE01&set=true&display=text
  2. To get all sequences in the set in FASTA format:
    https://www.ebi.ac.uk/ena/data/view/CAAE01&set=true&display=fasta

Note that these URLs will always give you a compressed (gzip) file.

Previously you had to include the full range of the sequences within the set (e.g., https://www.ebi.ac.uk/ena/data/view/CAAE01000001-CAAE01025773&display=fa...). These new URLs will fetch the full set file directly from our FTP site, rather than fetching each sequence in the set individually as was done with the range URL. Therefore, WGS/TSA sets will be significantly faster to download using these new URLs.

Initially only public WGS and TSA sets will be available from the browser but suppressed sets will be added later in 2018.

Both public and suppressed WGS sets can be easily downloaded using the ENA Browser scripts available for download from GitHub:
https://github.com/enasequence/enaBrowserTools

If you have any questions/concerns regarding these upcoming changes, please contact us at datasubs@ebi.ac.uk.

Subscribe to the e-mail newsletter
Get a monthly round-up of the hottest news and features from EMBL, straight to your inbox.
Or stay updated with the RSS feed (EMBL-EBI only).