1. Types of data that can be submitted
Potentially identifiable human data
2. What data files to submit, and how
Raw data files
MD5 checksum of raw files
Sending raw files
Processed data files
3. BAM file requirements
4. Modifying/cancelling a sequencing experiment in ArrayExpress and ENA
ArrayExpress accepts submissions of functional genomics data generated using high throughput sequencing (HTS) assays like RNA-seq and ChIP-seq, mostly from non-human and human non-identifiable samples, with the following exceptions:
- Metagenomic/metatranscriptomic data: please submit to the EBI Metagenomics service for optimised organisation of your meta-data (e.g. sample annotation).
- De novo assembly of transcriptome: the raw RNA-seq reads should be submitted to ArrayExpress still. Once we have finished processing your raw reads, submit the assembled transcriptome file (often in fasta format) directly to the European Nucleotide Archive.
If you're submitting potentially identifiable human data, please see below.
To submit to ArrayExpress, all you need to do is send us meta-data for your experiment (e.g. experiment description, samples and their attributes, all protocols used) and the raw data files; see the submission guide below. Submissions without raw data files will not be accepted unless in exceptional circumstances .
The meta-data about your experiment will be stored at ArrayExpress, and the raw data files (e.g. fastq files) are eventually stored at the Sequence Read Archive (SRA) of the European Nucleotide Archive (ENA). ArrayExpress will transfer the raw data files to the ENA for you so you do not need to submit those files separately to the ENA. You can also send us processed data (i.e. processed from the raw reads, e.g. BAM alignment files, differential expression data, expression values linked to genome coordinates, etc). Depending on the file format, it will either be stored at ArrayExpress or the ENA. Given the split of meta-data and data files between ArrayExpress and ENA, once your submission is fully processed, it is a lengthy process to modify/update it. Some changes (e.g. cancelling an ENA record which has been released to the public) will not be possible. Please take a look at our sequencing experiment update/cancellation policy before proceeding.
Data from human samples and individuals that can potentially lead to the identification of the donors (e.g. genomic DNA sequences) can be submitted to ArrayExpress if the data has been consented for public release. Such approvals typically would be given by the relevant ethics committees and ensuring this is the responsibility of the submitters.
Identifiable data approved for controlled access should be submitted directly to the European Genome-phenome Archive (EGA), not ArrayExpress. Cases are possible where identifiable data (e.g. raw sequences) are submitted to the EGA, while the related processed data (e.g. RPKM values) are submitted to ArrayExpress, but it is up to the submitter to ensure that such a submission copies to the respective ethics requirements. To submit processed data to ArrayExpress, please begin by emailing us at firstname.lastname@example.org with the EGA study accession number. We will import non-human-identifiable meta-data from EGA in a spreadsheet (which the submitter will have the chance to review), and then match the meta-data with processed data.
The following diagram summarises the typical data flow:.
Diagram of the submission of different sequencing data types to ArrayExpress or the European Genome-phenome archive.
Raw data files: Please provide unprocessed files as raw data files. After demultiplexing and trimming of adapter sequences do not remove entire sequence reads or trim by quality score. Prepare your files according to ENA specifications. This is a developing field so please do check the specifications every time you submit a new experiment. Data files which do not satisfy ENA's requirements will not be accepted.
For fastq files, each file must be individually compressed by gzip or bzip2. Do not bundle multiple fastq files into one archive. For paired-end experiments, if the mate pairs are in two separate files (one file for the forward strand, one for the reverse strand), the two files must be named with the same root and end with extensions such as
_2.fq.gz. Examples of naming styles supported by the ENA:
- sampleA_1.fq.gz / sampleA_2.fq.gz
- sampleA_F.fq.gz / sampleA_R.fq.gz
If you are submitting BAM files as raw data files, please read this important documentation on BAM file specification.
MD5 checksum of raw files: As sequence files tend to be very large, we implement a file integrity validation step before sending them to the ENA on your behalf. For each raw file, please calculate its MD5 checksum. The checksum is hexadecimal and expressed as a long string of letters and numbers, which looks something like this:
eef75461035fb66d9173799d4e26ea97. MD5 checksums are like the files' digital "fingerprints" and it is very unlikely that two non-identical files would generate the same "fingerprint", thus allowing us to verify that each file has not been corrupted during FTP transfer. Remember to calculate the checksum from the actual compressed file (e.g. fastq.gz or fastq.bz2) that is sent to us, not from the uncompressed fastq file or a re-compressed version created at a later time.
Sending raw files: You must send the raw files to ArrayExpress by FTP (see FTP transfer instructions). Please transfer the compressed files one by one (and not bundling multiple fastq.gz files in one tar.gz archive) to avoid time-out issues and to allow us to process your files promptly. For Annotare to associate the transferred files with your experiment submission, please go to the
Samples and Data -->
Upload and assign data files section in Annotare, click
FTP Upload..., and follow on-screen instructions to fill in file names and their corresponding MD5 checksums. Annotare will then verify the presence of the files on our FTP site and the MD5 checksums. If verification passes, you will be able to assign data files to each of your samples.
Processed data files: If they are in spreadsheets, e.g. a table of FPKM values for genes with genes in rows and samples in columns, please save them in tab-delimited text (*.txt) format (not Excel). We also accept bam alignment files. There is no need to compress or zip up the processed files one by one or as a bundle. Upload them by FTP and assign to your samples in the same way as you would for raw files.
If you're submitting raw data files in BAM format, please make sure they satisfy ENA specifications as well as the following conditions:
- Each file contains all reads from the sequencing machine, regardless of whether the reads mapped to the reference genome. The reason for this is that we would expect the BAM file to be used to regenerate all the sequencing reads.
- The phred quality score for each base should be included in the file.
- If you have data from paired-end sequencing libraries, for each sequencing run, include data for both mate reads in one single bam file.
If your BAM files contain only mapped reads, then please either create "full" (unfiltered) BAM files, or send us the original read files (e.g. fastq.gz files) as raw data files (again, check ENA specifications).
BAM files containing only mapped reads can be included in your submission as processed files, as long as they satisfy ENA's specification and that the reference genome used for alignment has been accessioned in the International Nucleotide Sequence Database Collaboration (INSDC, involving DDBJ, ENA, and GenBank).
Release date changes are possible as long as the experiment remains private. Citation/publication update is possible at any time. Modification/addition of metadata and/or data files for a private experiment is possible, but often lengthy and tedious. It is therefore important that you get the submission right before you submit.
Private experiments: If your experiment is still private and an update is required, please email us at email@example.com quoting the experiment accesssion and explaining what changes are needed. We will advise the next step accordingly. Depending on your needs, we may append the ENA records with your new data (e.g. addition/removal of samples, sequencing libraries and associated data files), or, in more complicated cases, we may cancel the previously brokered record at ENA and broker your experiment as a brand new submission to ENA (which will generate a new ENA Study accession). In either case, we will keep the same ArrayExpress accession for your data set, and we'll make sure the ArrayExpress record link to the correct ENA submission. Please allow up to 15 working days for the update.
Public experiments: It is ENA's policy not to make a public experiment private again or cancel it unless there is an exceptional reason. (See ENA's Data availability policy for further information.) Modifying the meta-data and/or data files of a public experiment is very tedious, because they would have been mirrored among other INSDC partners already (namely, GenBank and DDBJ). Please contact ENA directly to sort out the existing record at ENA which contains deprecated/incorrect information. If you would like to keep the ArrayExpress and ENA synchronised, please also inform us of any changes which have been agreed and actioned by the ENA, so we can advise further on how to update the ArrayExpress record.