Supported read file formats

Read data can be submitted in several standard and platform specific formats. We recommend that read data is either submitted in BAM or CRAM format. Please note that tar archives are only accepted for Oxford Nanopore native data and that reads must always be de-multiplexed into separate files prior submission.

If you have any questions please contact datasubs@ebi.ac.uk.

Standard formats

Format File suffix Made available as standard Fastq
CRAM format .cram Yes
BAM format .bam Yes
Fastq format

.fastq.gz

.fastq.bz2

.fq.gz

.fq.bz2

.txt.gz

.txt.bz2

Yes

CRAM format

Submitted CRAM files must be readable with SAMtools and CRAMToolkit and the reference sequences must exist in the CRAM Reference Registry.

CRAM file names are required to end up with the .cram suffix (e.g. 'a.cram').

A CRAM index (CRAI) file is created by the archive for each submitted CRAM file and is available in the same directory as the CRAM file from which is was created.

CRAM index file names start with the CRAM file name and end up with the .crai suffix (e.g. 'a.cram.crai' for CRAM file 'a.cram').

The ArchiveCRAM specification outlines the requirements for BAM and CRAM submissions.

BAM format

Submitted BAM files must be readable with SAMtools and Picard.

BAM file names are required to end up with the .bam suffix (e.g. 'a.bam').

The ArchiveCRAM specification outlines the requirements for BAM and CRAM submissions.

Fastq format

We recommend that read data is either submitted in BAM or CRAM format. However, single and paired reads are accepted as Fastq files that meet the following the requirements:

  • Quality scores must be in Phred scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
  • No technical reads (adapters, linkers, barcodes) are allowed.
  • Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
  • Paired reads must split and submitted using either one or two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads: "^@([a-zA-Z0-9_-]+:[0-9]+:[a-zA-Z0-9]+:[0-9]+:[0-9]+:[0-9-]+:[0-9-]+) ([12]):[YN]:[0-9]*[02468]:[ACGTN]+$").
  • The first line for each read must start with '@'.
  • The base calls and quality scores must be separated by a line starting with '+'.
  • The Fastq files must be compressed using gzip or bzip2.

Example of Fastq file containing single reads:

@read_name
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

 

Example of Fastq file containing paired reads (prior to Casava 1.8):

@read_name/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@read_name/2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

 

With Casava 1.8 the format of the '@' line has changed and we accept this pattern too:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

 

Platform specific formats

Format File suffix Made available as standard Fastq Notes
SFF Format .sff Yes Spot descriptor is required.
PacBio format

.metadata.xml

.bas.h5

.bax.h5

No  
Oxford Nanopore format   No  
Complete Genomics format   No  
SOLiD csfasta/qual format

.csfasta

.csfasta.gz

.csfasta.bz2

.qual

.qual.gz

.qual.bz2

Yes  Support for this format is planned to be depracated in 2015.
Illumina Qseq format   No Support for this format is planned to be depracated in 2015.
Illumina Scarf format    No Support for this format is planned to be depracated in 2015.
SRF Format (Illumina)  .srf Yes Support for this format is planned to be depracated in 2015.

SFF format

The SFF format is supported for the 454 and Ion Torrent platforms.

PacBio format

PacBio data submissions are supported in the platform specific native format. 

One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.

Oxford Nanopore format

Oxford Nanopore native data must be submitted as a single tar.gz archive containing basecalled fast5 files downloaded from Metrichor. An example directory structure for run named XYZ: 

XYZ/reads/downloads/fail/
XYZ/reads/downloads/pass/

How to archive all files in the XYZ downloads directory in a linux command line:

cd <directory containing XYZ directory>
tar -cvzf XYZ.tar.gz XYZ/reads/downloads/

Complete Genomics format

Complete Genomics data submissions are supported in the platform specific native format.

The full Complete Genomics data package should be submitted including the ASM, LIB and MAP subfolders. Each data package should be submitted as a single experiment and run. Please note the data package must not be tarred or gzipped prior submission.

SOLiD csfasta/qual format

Support for this format is planned to be depracated in 2015.

Please note that paired reads require the title to be identical in both csfasta files in order to associate the reads into pairs. Both csfasta and qual files should be compressed using gzip or bzip2.

Illumina qseq format

Support for this format is planned to be depracated in 2015.

Illumina qseq data submissions are not processed or made available in any other formats. We recommend that submitters convert their data from qseq format to Fastq format prior submission. If submitted, qseq files should be compressed using gzip or bzip2.

Illumina scarf format

Support for this format is planned to be depracated in 2015.

Illumina scarf data submissions are not processed or made available in any other formats. We recommend that submitters convert their data from scarf format to Fastq format prior submission. That scarf format typically uses log-odds qualities that should be converted into Phred qualities when preparing the Fastq files. If submitted, scarf files should be compressed using gzip or bzip2.

SRF format 

Support for this format is planned to be depracated in 2015.

The *_seq.txt files can be converted into SRF files using the illumina2srf utility available from the DNA Sequence Read Toolkit.

Each Illumina lane should be submitted as a separate SRF file and runs should be demultiplexed prior SRF file generation.

To produce a SRF submission file for a non-paired lane, change the working directory to the run folder and run:

illumina2srf -R -P -N <run>::: -n : -o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt

 

The -R, -P options are used to exclude intensity, noise and signal data from the generated SRF files. These data series are no longer supported for new data submissions.

The recommended format for the SRF file names is <center_name>_<run>_<lane>.srf, where <center_name> is the center name abbreviation assigned to all submitters, and the <run> and <lane> are the run and the lane identifiers.

To produce a SRF submission file for paired lane, change the working directory to the run folder and run:

illumina2srf -R -P -N <run>::: -n : -2 <cycle> -o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt