Welcome to RNASeq-er API - a gateway to systematically updated analysis of public RNA-Seq data
1. Get Started
The RNASeq-er REST API provides easy access to the results of the systematically updated and continually growing analysis of public RNA-seq data in European Nucleotide Archive (ENA). The analysis of each sequencing run is performed by the EMBL-EBI's Gene Expression Team using the iRAP pipeline. Try the following examples for a quick overview of the kind of queries that you will be able to perform by using the RNASeq-er API:
- Retrieve the mean and the standard deviation of mapping quality for each organism available in the API - via wget command, to standard output:
wget -q --output-document /dev/stdout http://www.ebi.ac.uk/fg/rnaseq/api/tsv/getOrganismsMappingQuality
- Retrieve analysed data for human hepatocellular carcinoma samples (cram files, bedgraph files and BigWig files for all studies in ENA with samples annotated as hepatocellular carcinoma in Homo sapiens) - via curl command, to standard output:
curl -H 'Accept: application/json' -X GET http://www.ebi.ac.uk/fg/rnaseq/api/json/0/getRunsByOrganismCondition/homo_sapiens/hepatocellular%20carcinoma
- Retrieve analysed data for study SRP009123 (cram files, bedgraph files and BigWig files for this particular study) - via BioPython:
from bioservices import RNASEQ_EBI
r = RNASEQ_EBI()
results = r.get_run_by_study("SRP009123", mapping_quality=0, frmt='tsv') - Retrieve the annotations for all runs in study SRP009123 (sample attributes and their ontology annotations) - via Perl Bio-EBI-RNAseqAPI package on CPAN :
use 5.10.0;
use Bio::EBI::RNAseqAPI;
my $rnaseqAPI = Bio::EBI::RNAseqAPI->new;
my sampleAttributes= $rnaseqAPI->get_sample_attributes_per_run_by_study( study => "SRP009123" ); - Retrieve analyzed data for study SRP009123 (gene/exon quantification as raw counts, FPKM and TPM for all runs for this particular study) - via plain http:
http://www.ebi.ac.uk/fg/rnaseq/api/tsv/getStudy/SRP009123
2. What does the RNASeq-er pipeline do?
The RNASeq-er REST API automatically discover new public RNA-seq runs in European Nucleotide Archive (ENA) for over 270 species on a daily basis, analyse new public RNA-seq runs with the iRAP pipeline, retrieve metadata from ArrayExpress and BioSamples and automatically annotate to Experimental Factor Ontology (EFO) the metadata using the mapping tool Zooma.
3. How is the RNASeq-er performed?
The analysis of each sequencing run is performed using the iRAP pipeline. The main steps followed by iRAP during the RNA-seq analysis are the following ones:
- Quality control: Raw reads (FASTQ files) undergo quality assessment and filtering using FASTQ QC. This step involves processing the data to remove adaptor sequences (adaptor trimming), low-quality reads, uncalled bases and to filter out contaminants (sequences which don't derive from the source organism).
- Alignment: Quality-filtered reads are aligned to the latest version of the genome reference from Ensembl using TopHat2 or STAR for large genomes such as wheat.
- Conversion of BAM file (output of TopHat2) into CRAM format and generation of bedGraph and bigWig files.
- Quantification of gene/exon expression:The mapped reads are summarized and aggregated over genes and exons via HTSeq or DEXSeq, respectively. As a result, raw counts, FPKM (fragments per kilobase of exon per million fragments mapped) and TPM (transcripts per million) are provided.
iRAP pipeline. Representation of the main steps followed by iRAP in the analysis of each sequencing run. The RNASeq-er API provides the FTP locations for CRAM, bigWig and bedGraph files per ENA run and the gene and exon quantification matrices (raw counts, FPKM, TPM) per ENA study.
4. How to use the RNASeq-er API?
Usually, a REST API works pretty much the same way as the website does. It makes a call from a client to a server and you get data back over the HTTP protocol. In the RNASeq-er API, the API calls are made via the HTTP method GET so you will be able to retrieve data without modifying it. The result of an API call is the data matching a specified query.
The first thing you need to know is how to construct the URL. Let’s have a look at the following URL to explain how to use the RNASeq-er API. You will need to paste this URL into a web browser to see the results:
Let’s break down that URL and see how it’s made up:
- http://www.ebi.ac.uk/fg/rnaseq/api, is the place on the web where the API lives.
- /tsv/, is the format of the data returned. The RNASeq-er REST API returns data in tab-delimited (tsv) or JSON formats.
- /0/, it is an additional filter to specify the minimum percentage of reads mapped to the reference genome (mapping quality).
- /getOrganisms/plants, is the part of the URL used to specify the kind of data we want to retrieve. In this case we are searching for all organisms that are plants that have been analysed by the RNASeq-er REST API.
As a result of this API call you will see a list of plants in two columns:
- First column corresponds to the particular organism (plant) that has been analysed with the RNASeq-er REST API.
- Second column is referred to the organism (plant) that has been used as a reference genome for the alignment of the reads. As you can see, when there is no reference genome available for a particular species, the most related one is used, instead.
5. What are the main classes of API calls?
The main classes of API calls for the RNASeq-er REST API are the following ones:
- Analysis Results Per Run (getRun...): to request the results of the alignment (CRAM, bedGraph and bigWig files) per run (getRun/SRR1042759) or for all runs in a particular ENA study (getRunByStudy/SRP033494).
- Analysis
- Analysis Results Per Study (getStudy...): to retrieve the results of the gene/exon expression quantification (raw counts, gene/exon FPKM and gene/exon TPM) for all runs in a particular ENA study.
- Sample Attributes Per Run (getSampleAttributes...): to retrieve all attributes and their ontology annotations for all the samples in a particular ENA study.
- Baseline expression Per Gene (getExpression...): to retrieve the median of expression of a gene across all runs corresponding to a given condition (such as organism part, cell type, developmental stage, sex or strain).
Main classes of API calls. Examples of the four classes of API calls for the RNASeq-er REST API.
5.1. Analysis Results Per Run
5.1.1. Making per-run API calls
When using per-run API calls you will need to specify the format of the data returned (tsv or JSON) and the minimum percentage of reads mapped to the reference genome (mapping quality):
Let's try the following examples:
Example 1. Give me the location of the results of the RNA-seq alignment (CRAM, bedGraph and bigWig files) for all runs in ENA study SRP049001 in which at least 70% of the reads were successfully mapped to the reference genome as a tab-delimited format.
Example 2. Give me the location of the results of the RNA-seq alignment (CRAM/bedGraph/bigWig) for all runs in ENA from Solanum lycopersicum whatever the mapping quality is as a tab-delimited format.
Example 3. Give me the location of the results of the RNA-seq alignment (CRAM/bedGraph/bigWig)for all runs in ENA on samples of human lung, whatever the mapping quality is, as a tab-delimited format.
5.1.2. Results per-run API calls
Here you have the result of the first RUN from study SRP049001 retrieved after making that particular per-run API call:
5.2 Analysis Results Per Study
5.2.1. Making per-study API calls
When using per-study API calls you will need to specify just the format of the data returned (tsv or JSON). There is no need of choosing the mapping quality because we are including expression data for all runs in a given study.
Let's try the following examples:
Example 1. Give me the location of the results of the RNA-seq analysis (gene/exon quantification as raw counts, FPKM and TPM) for all runs in ENA study SRP049001 as a tab-delimited format.
Example 2. Give me the location of the results of the RNA-seq analysis (gene/exon quantification as raw counts, FPKM and TPM) for all studies in ENA with runs from Arabidopsis thaliana as a tab-delimited format.
5.2.2. Results per-study API calls
If you want to retrieve the location of the results of the RNA-seq analysis (gene/exon quantification as raw counts, FPKM and TPM) for all studies in ENA with runs from Solanum tuberosum as a tab-delimited format you will need to run the following per-study API call:
As a result you will see the results of the analysis of all studies in ENA for the specified organism Solanum tuberosum (potato). Here you have the result of the first study retrieved after making that particular per-study API call:
5.3. Sample Attributes Per Run
5.3.1. Making sample attributes per-run API calls
When using per-study API calls you will need to specify just the format of the data returned (tsv or JSON):
Let's try the following example:
Example 1. Give me the sample attributes and their ontology annotations for all runs in ENA study SRP047482 as a tab-delimited format.
5.3.2. Results of the sample attributes per-run API calls
Here you have the sample attributes and their ontology annotations for all runs in study SRP047482 retrieved after making the corresponding sample attributes per-run API call:
5.4. Baseline Expression Per Gene
5.4.1. Making baseline expression per-gene API calls
When using per-study API calls you will need to specify the format of the data returned (tsv or JSON) and the minimum number of runs that you want to include in the analysis:
Let's try the following example:
Example 1. Give me the median expression (in TPMs) and the coefficient of variation for the human gene SFTPC for all the conditions studied in at least 25 sequencing runs each, as a tab-delimited format.
5.4.2. Results of baseline expression per-gene API calls
Here you have the median expression (TPM) and the coeficient of variation for the human gene SFTPC for all the combined conditions studied in at least 25 sequencing runs each, sorted by high median expression first:
As a result of this kind of call, you will also see a column called 'ALL_SAMPLE_ATTRIBUTES' that returns the API URL that displays all sample attributes and ontology annotations for all runs studying the condition reported. For example: http://www.ebi.ac.uk/fg/rnaseq/api/tsv/getSampleAttributesByCondition/1517
5.5. Mapping Quality Statistics For All Organisms
5.5.1. Retrieving the mean and standard deviation of mapping quality for all organisms
The API call: http://www.ebi.ac.uk/fg/rnaseq/api/tsv/getOrganismsMappingQuality returns the mean and the standard deviation of mapping quality for each organism available in the API.