CRAM development principles

The efficient compression of raw sequence data is essential for the global bioinformatics technology infrastructure. Here we present our code of practice for nurturing the optimal ongoing delivery of CRAM, a mature format and software toolkit that presents sophisticated solutions to challenges in raw data compression.

Open

The more the world’s bioinformatics software tools comply with the CRAM format, the earlier and more broadly the scientific community can roll out data compression in a scientifically judicious way. CRAM is available under the permissive Apache 2.0 license, which maximises opportunities to embed and to re-use CRAM code in third-party academic and commercial applications.

We encourage developers to centre their work on the authoritative central code repository for the CRAM software toolkit: github. We expect too see many re-distributions of CRAM code, and will ourselves endeavour to push new releases to major repositories (e.g. SAM-JDK and SAMTOOLS) in order to maximise their impact. By working closely with external developers, we hope to integrate and commit new CRAM code in a timely and expedient way so as to achieve the broadest possible utility.

Stable

Those who use CRAM, including EMBL-EBI developers, can engineer their diverse informatics platforms to rely on CRAM functionality. Certainly there must be a stable codebase, and changes to it must be well planned and publicised. However, CRAM serves the rapidly moving world of sequencing technology, in which stability is often challenged by new functionality.

We plan to deploy a three-level code release system:

  1. Minor point releases – These can be rolled out comparatively quickly. Functionality and format changes will be backwards-compatible and announced / documented as early as possible.
  2. Step releases – These involve significant format extensions and new functionalities, but retain backwards format and software functional compatibility. They will be announced, with candidate specifications, no later than one month in advance of release.
  3. Major step releases – New functionalities are deployed. Some elements of backwards compatibility may, unfortunately, be lost. These releases will be announced no later than two months in advance. During the ‘release incubation period’ , candidate specifications will be made available for iterative rounds of feedback and updates.

Accessible

We will strive to make CRAM a useful and broadly applicable technology by enabling its integration with the wide spectrum of bioinformatics tools used by the scientific community. CRAM offers aggressive levels of data compression with minimal loss of impactful signal, but we recognise that maximal compression can come at the price of utility. There is a clear need to carefully balance support for utility (e.g streaming, indexing, direct computational access) with sufficiently deep compression to cater for the world ’ s future needs.

Support

While we will try to provide CRAM user support, the reality of finite resources for the project tells us that this will at times challenge us. Our approach, therefore, will be to nurture a self-supporting user community. We will do our utmost to seed this community with hands-on training when possible and, of course, freely available training materials and documentation.

CRAM is an open tool for the bioinformatics community and we warmly encourage open discussion on our own CRAM developers mailing list an online public community forum.

Latest ENA News

9 Dec 2014: ENA Release 122
Release 122 of ENA's assembled/annotated sequences is now available.

12 Nov 2014: Simplification of data release procedures
The European Nucleotide Archive will couple the public release of sequence records and the release of study records that contain these sequence records, with immediate effect.

11 Nov 2014: ENA/EMG Sample Record Annotation Workshop
European Nucleotide Archive (ENA) and EBI Metagenomics Portal (EMG), are organising the ENA/EMG Sample Record Annotation Workshop on the 1-5 December 2014 to enrich the environmental sample records.

24 Sep 2014: ENA Release 121
Release 121 of ENA's assembled/annotated sequences now available.

20 Aug 2014: Read data through Globus GridFTP
Read data can now be downloaded using Globus GridFTP through ebi#ena Globus Online public endpoint.