- Types of data that can be submitted
- What data files to submit, and how
ArrayExpress accepts submissions of functional genomics data generated using high throughput sequencing (HTS) assays like RNA-seq and ChIP-seq, mostly from non-human and human non-identifiable samples, with the following exceptions:
- Metagenomic/metatranscriptomic data: please submit to the EBI Metagenomics service for optimised organisation of your meta-data (e.g. sample annotation).
- De novo assembly of transcriptome: the raw RNA-seq reads should be submitted to ArrayExpress. Once we have finished processing your raw reads, submit the assembled transcriptome file (often in fasta format) directly to the European Nucleotide Archive.
If you have potentially identifiable human data, please see below.
To submit to ArrayExpress, all you need to do is send us meta-data for your experiment (e.g. experiment description, samples and their attributes, all protocols used) and the raw data files; see the submission guide below. Submissions without raw data files will not be accepted unless there are exceptional circumstances.
The meta-data about your experiment will be stored at ArrayExpress, and the raw data files (e.g. fastq files) are stored at the Sequence Read Archive (SRA) of the European Nucleotide Archive (ENA). ArrayExpress will transfer the raw data files to the ENA for you so you do not need to submit those files separately to the ENA. You can also send us processed data (i.e. processed from the raw reads, e.g. BAM alignment files, differential expression data, expression values linked to genome coordinates, etc). Depending on the file format, it will either be stored at ArrayExpress or the ENA. Given the split of meta-data and data files between ArrayExpress and ENA, once your submission is fully processed, it is a lengthy process to modify/update it. Some changes (e.g. cancelling an ENA record which has been released to the public) will not be possible. Please take a look at our sequencing experiment update/cancellation policy before proceeding.
Data from human samples and individuals that can potentially lead to the identification of the donors (e.g. genomic DNA sequences) can be submitted to ArrayExpress if consent for public release of the data hs been given. Such approvals typically would be given by the relevant ethics committees and ensuring this is the responsibility of the submitters.
Identifiable data approved for controlled access should be submitted directly to the European Genome-phenome Archive (EGA), not ArrayExpress. Cases are possible where identifiable data (e.g. raw sequences) are submitted to the EGA, while the related processed data (e.g. RPKM values) are submitted to ArrayExpress, but it is up to the submitter to ensure that such a submission complies with the respective ethics requirements. To submit processed data to ArrayExpress, please begin by emailing us at email@example.com with the EGA study accession number. We will import non-human-identifiable meta-data from EGA in a spreadsheet (which the submitter will have the chance to review), and then match the meta-data with processed data.
The following diagram summarises the typical data flow:
Diagram of the submission of different sequencing data types to ArrayExpress or the European Genome-Phenome Archive.
2. What data files to submit, and how
To start your submission go to the Annotare webform submission tool and create a new sequencing submission.
Apart from the experiment description and sample annotation, sequencing experiments require further details describing the sequencing library (as they are needed for ENA submission). Please see this guide for more information about the library specifications.
Raw data file requirements
Please provide individual unprocessed raw data files for each sample, in FASTQ or BAM format, and prepare your files according to ENA specifications. This is a developing field so please do check the specifications every time you submit a new experiment. Data files which do not satisfy ENA's requirements will not be accepted.
- Each file must be compressed by gzip or bzip2.
- Submit individual files per sample and lane (if applicable). Do not bundle multiple FASTQ files into one archive, or split a file into smaller sized chunks.
- Multiplexed libraries should be demultiplexed into separate files.
- No technical adapter sequences are allowed. But do not remove entire sequence reads or trim by quality score.
- For paired-end experiments, if the mate pairs are in two separate files (one file for the forward strand, one for the reverse strand), the two files must be named with the same root and end with extensions such as _1.fq.gz and _2.fq.gz. Examples of naming styles supported by the ENA:
- sampleA_R1.fq.gz / sampleA_R2.fq.gz
- sampleA_1.fq.gz / sampleA_2.fq.gz
- sampleA_F.fq.gz / sampleA_R.fq.gz
- Check ENA specifications for additional information about the accepted FASTQ format.
- Each file must contain all reads from the sequencing machine and all reads must be unaligned. The reason for this is that we expect the BAM file to be used to regenerate all the sequencing reads.
- The phred quality score for each base should be included in the file.
- If you have data from paired-end sequencing libraries, for each sequencing run, include data for both mate reads in one single BAM file.
- Check ENA specifications for additional information about the accepted BAM format.
To ensure your BAM files contain unaligned reads, you can run the following commands:
samtools view -c -F 4 bam_file(counts how many reads are aligned and should return 0)
samtools view -c -f 4 bam_file(counts how many reads are unaligned and should return at least 1)
If your BAM files contain mapped reads, then please either create unmapped BAM files, or send us the original read files (e.g. fastq.gz files) as raw data files (again, check ENA specifications). BAM files containing mapped reads can be included in your submission as processed files, as long as they satisfy ENA's specification and that the reference genome used for alignment has been accessioned in the International Nucleotide Sequence Database Collaboration (INSDC, involving DDBJ, ENA, and GenBank).
Processed data files
If your processed data are in spreadsheets, e.g. a table of FPKM values for genes with genes in rows and samples in columns, please save them in tab-delimited text (*.txt) format (not Excel). We also accept BAM alignment files. There is no need to compress or zip up the processed files one by one or as a bundle. Upload them in Annotare and assign to your samples in the same way as you would for raw files (see below).
UPDATE (17th March 2016):
New direct upload feature in Annotare and new FTP path
You can upload raw and processed sequencing files directly in Annotare. Use "drag-and-drop" to place files in the upload frame or click
Upload files... to select the files to be transferred.
If you prefer, you may still use the FTP upload function, but note that the private FTP directory is unique for each submission. Before starting to transfer files, go to the
Samples and Data >
Upload and assign data files section in Annotare and click
FTP Upload.... The dialogue will show you the FTP directory for your submission (e.g.
ftp-private.ebi.ac.uk/ibtd1rmo-20r7k3g747sup/). Copy the data files to this directory following the FTP transfer instructions. Please transfer the compressed files one by one (and not bundling multiple fastq.gz files in one tar.gz archive) to avoid time-out issues and to allow us to process your files promptly.
To associate the transferred files with your experiment submission and ensure the file integrity, follow the on-screen instructions of the dialogue, and fill in file names and their corresponding MD5 checksums (remember to use the checksum of the actual compressed file that is sent to us). Annotare will then verify the presence of the files on our FTP site and the MD5 checksums. If verification passes, you will be able to assign data files to each of your samples. Here are some examples how to caclulate MD5 checksums: Windows user example, Mac user example, Linux user example.