CRAM development principles
The efficient compression of raw sequence data is essential for the global bioinformatics technology infrastructure. Here we present our code of practice for nurturing the optimal ongoing delivery of CRAM, a mature format and software toolkit that presents sophisticated solutions to challenges in raw data compression.
The more the world’s bioinformatics software tools comply with the CRAM format, the earlier and more broadly the scientific community can roll out data compression in a scientifically judicious way. CRAM is available under the permissive Apache 2.0 license, which maximises opportunities to embed and to re-use CRAM code in third-party academic and commercial applications.
We encourage developers to centre their work on the authoritative central code repository for the CRAM software toolkit: github. We expect too see many re-distributions of CRAM code, and will ourselves endeavour to push new releases to major repositories (e.g. SAM-JDK and SAMTOOLS) in order to maximise their impact. By working closely with external developers, we hope to integrate and commit new CRAM code in a timely and expedient way so as to achieve the broadest possible utility.
Those who use CRAM, including EMBL-EBI developers, can engineer their diverse informatics platforms to rely on CRAM functionality. Certainly there must be a stable codebase, and changes to it must be well planned and publicised. However, CRAM serves the rapidly moving world of sequencing technology, in which stability is often challenged by new functionality.
We plan to deploy a three-level code release system:
Minor point releases –These can be rolled out comparatively quickly. Functionality and format changes will be backwards-compatible and announced / documented as early as possible.
Step releases –These involve significant format extensions and new functionalities, but retain backwards format and software functional compatibility. They will be announced, with candidate specifications, no later than one month in advance of release.
Major step releases –New functionalities are deployed. Some elements of backwards compatibility may, unfortunately, be lost. These releases will be announced no later than two months in advance. During the ‘release incubation period’, candidate specifications will be made available for iterative rounds of feedback and updates.
We will strive to make CRAM a useful and broadly applicable technology by enabling its integration with the wide spectrum of bioinformatics tools used by the scientific community. CRAM offers aggressive levels of data compression with minimal loss of impactful signal, but we recognise that maximal compression can come at the price of utility. There is a clear need to carefully balance support for utility (e.g streaming, indexing, direct computational access) with sufficiently deep compression to cater for the world’s future needs.
While we will try to provide CRAM user support, the reality of finite resources for the project tells us that this will at times challenge us. Our approach, therefore, will be to nurture a self-supporting user community. We will do our utmost to seed this community with hands-on training when possible and, of course, freely available training materials and documentation.
CRAM is an open tool for the bioinformatics community and we warmly encourage open discussion on our ownCRAM developers mailing list an on public community fora.