Read file formats

Introduction

Sequencing reads can be submitted in several standard and platforms specific data formats ... more information.

Please note that tar archives are not accepted and that multiplexed reads must be de-multiplexed prior submitting data to ENA. 

If you have any questions regarding the submission data formats please contact datasubs@ebi.ac.uk.

Read file formats

Generic formats

Format Made available as standard Fastq
CRAM format (all platforms) Yes
BAM format (all platforms) Yes
Fastq format (all platforms) Yes

Platform specific formats

Format Recommended Made available as standard Fastq
SFF Format (454 and Ion Torrent) Yes Yes
SOLiD csfasta/qual format  Yes Yes
Complete Genomics format Yes No
PacBio HDF format Yes No
Illumina Qseq format

No (please convert to Fastq)

No
Illumina Scarf format  No (please convert to Fastq) No
SRF Format (Illumina)  No Yes


BAM format (all platforms)

The BAM format is our recommended primary sequence data submission format. All submitted BAM files must be readable with SAMtools and Picard. Currently, BAM files must be de-multiplexed prior submission. However, we plan to shortly accept submissions of BAM files containing reads from multiple samples.

Please note that color space BAM submissions are not supported.

The ArchiveCRAM specification outlines the requirements for BAM and CRAM submissions ... more information.

Fastq format (all platforms)

Primary sequence data submissions of single and paired reads are accepted as Fastq files that meet the following the requirements:

  • Quality scores must be in Phred scale. For example, quality scores from early Solexa pipelines must be converted to use this scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
  • No technical reads (adapters, linkers, barcodes) are allowed.
  • Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
  • Paired reads must split and submitted using either one or two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads: "^@([a-zA-Z0-9_-]+:[0-9]+:[a-zA-Z0-9]+:[0-9]+:[0-9]+:[0-9-]+:[0-9-]+) ([12]):[YN]:[0-9]*[02468]:[ACGTN]+$").
  • The first line for each read must start with '@'.
  • The base calls and quality scores must be separated by a line starting with '+'.
  • The Fastq files must be compressed using gzip or bzip2.

Example of Fastq file containing single reads:

@read_name
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

Example of Fastq file containing paired reads (prior to Casava 1.8): 

@read_name/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@read_name/2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

where <cycle> indicates the cycle number that starts the second read.

With Casava 1.8 the format of the '@' line has changed and we accept this pattern too:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

SFF format (454 and Ion Torrent)

The SFF format is the recommended primary data submission format for the 454 and Ion Torrent platforms.

SOLiD csfasta/qual format

The Csfasta/qual format is supported as a primary data submission format for the SOLiD platform. Please note that paired reads require the title to be identical in both csfasta files in order to associate the reads into pairs. Both csfasta and qual files should be compressed using gzip or bzip2.

PacBio HDF5 format

PacBio data submissions are supported in the PacBio HDF5 format. 

Complete Genomics format

Complete Genomics data should be submitted as the full Complete Genomics data package containing the ASM, LIB and MAP subfolders. Each data package should be submitted as a single experiment. Please note that the data package must not be tarred or gzipped prior submission. 

Illumina qseq format

We accept but do not recommend primary data submissions in Illumina qseq format. Currently, qseq data submissions are not processed or made available in any other formats. We recommend that submitters convert their data from qseq format to Fastq format prior submission. If submitted, qseq files should be compressed using gzip or bzip2.

Illumina scarf format

We accept but do not recommend primary data submissions in Illumina scarf format. Currently, scarf data submissions are not processed or made available in any other formats. We recommend that submitters convert their data from scarf format to Fastq format prior submission. Please note, that scarf format typically uses log-odds qualities that should be converted into Phred qualities when preparing the Fastq files. If submitted, scarf files should be compressed using gzip or bzip2.

SRF format (Illumina)

The SRF format continues to be supported as historical primary data submission format for existing submitters only.

Preparing SRF files

The *_seq.txt files can be converted into SRF files using the illumina2srf utility available from the DNA Sequence Read Toolkit.

Each Illumina lane should be submitted as a separate SRF file and runs should be demultiplexed prior SRF file generation.

To produce a SRF submission file for a non-paired lane, change the working directory to the run folder and run:

illumina2srf -R -P -N <run>:%l:%t: -n %x:%y -o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt

The -R, -P options are used to exclude intensity, noise and signal data from the generated SRF files. These data series are no longer supported for new data submissions.

The recommended format for the SRF file names is <center_name>_<run>_<lane>.srf, where <center_name> is the center name abbreviation assigned to all submitters, and the <run> and <lane> are the run and the lane identifiers.

To produce a SRF submission file for paired lane, change the working directory to the run folder and run:

illumina2srf -R -P -N <run>:%l:%t: -n %x:%y -2 <cycle> -o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt