ArchiveCRAM 1.0 specification
The ArchiveCRAM specification defines a set of additional rules respective to BAM and CRAM format targeted to those submitting BAM or CRAM filles into the Sequence Read Archive (SRA). Version 1.0 is the first version of the ArchiveCRAM format specification.
The SRA is operated as part of the International Nucleotide Sequence Database Collaboration (INSDC). The INSDC (http://www.insdc.org) sets policies and goals for the partners. This document is intended to be compatible with INSDC policies.
The following information is provided:
This specification guides users of the ArchiveCRAM format in order to:
- Specify how the BAM and CRAM formats are supported to submit sequence and alignment data into SRA.
- Enable submitters to validate and prepare data prior to submission to avoid unnecessary data transfers.
- Encourage technology providers to support CRAM as an output format for their analysis software.
- Improve the speed of submissions processing at SRA.
- Reduce the probability of failed submissions to SRA.
- Improve other services provided by SRA by freeing up time otherwise spent correcting and transforming data.
Compressed: data will be stored in the effectively compressed CRAM format.
Readable: submitted BAM files must be readable with SAMtools (samtools.sourceforge.net/)and Picard (picard.sourceforge.net/). Submitted CRAM files must be readable with CRAMtools.
Unindexed: submitted files need not be associated with alignment indices. SRA will build indices when appropriate.
Metadata: submitted BAM and CRAM files must be unambiguously associated with samples and reference sequences. Minimally, the reference sequences must be made retrievable using MD5 checksums through the CRAM reference repository. Optimally, the reference sequences would be submitted with related assembly and annotation information into the INSDC archives.
Bases and quality scores required: Reads must be submitted with base calls and quality scores.
Reference sequences required: With the exception of small de novo local assemblies that may be generated for reads that can not be mapped to a reference sequence, reference sequences to which reads have been aligned must be cited using MD5 checksums and must be made available through the CRAM reference repository. Optimally, these sequences would also be submitted with related assembly and annotation information into the INSDC archives.
Mapped and unmapped reads accepted: The submitted BAM and CRAM files may contain both mapped and unmapped reads.
Ordering of mapped reads required: Mapped reads must be ordered using coordinate sort.
Quality filtered reads accepted: When reads are submitted that fail to match filtered qualitythresholds, these reads must have their appropriate bitwise flag set.
Mixed platforms/libraries discouraged: Merging data from different platforms or libraries into a single BAM or CRAM file is discouraged.
Mixed samples dicouraged: Including data from different samples in a single submitted file is discouraged.
Mixed reference assemblies discouraged: Including data from different reference assemblies in a single submitted file is discouraged.
Submission requirements for the BAM/CRAM header section are described below.
Table 1. Submission requirements for BAM/CRAM header section.
|@HD/SO||Alignment sort order.||Must be coordinate sort for mapped reads.|
|@SQ/SN||Unique reference sequence identifier.||Mandatory for mapped reads.|
|@SQ/M5||MD5 checksum of the reference sequence.||Mandatory for mapped reads.|
|@SQ/UR||URI of the reference sequence.||Recommended for mapped reads:
|@RG/SM||Sample name.||Mandatory for files with reads from multiple samples.|
The following header fields are set by the archive. All other fields are preserved exactly as they were submitted.
Table 2. BAM/CRAM header section fields set by the archive.
|@SQ/LN||Reference sequence length.||Set by archive.|
|@SQ/M5||MD5 checksum of the reference sequence.||Set by archive if not provided by submitter (legacy BAM files only).|
|@SQ/UR||URI of the reference sequence.||Set by archive:
|@PG||Program information.||Actions added by the archive.|
The following BAM CIGAR operations are converted to M:
- CIGAR ‘=’: Sequence match
- CIGAR ‘X’: Sequence mismatch
This information can be restored from the CRAM file.
All tags will be preserved by default. The archive will monitor the use of tags, and based on the cost of storage, a fair tag usage policy may be introduced resulting in some tags not being preserved by the archive.
The following tags are computable when reading the CRAM format.
Table 4. Computable tags.
|MD||String for mismatching positions.|
|NM||Edit distance to the reference.|