Reference proteomes - Primary proteome sets for the Quest For Orthologs
RELEASE 2013_04
based on UniProt Release 2013_04, Ensembl release 70 and Ensembl Genome release 17
Introduction
The Reference Proteomes group provides complete non-redundant proteome sets for species chosen by the “Quest for Orthologs” group. It comprises 147 species that are publicly available and are generated using UniProtKB, Ensembl and Ensembl Genomes.
The one gene one protein proteome sets are compiled from species sourced from complete genomes submitted to INSDC with gene model annotations from:
- genome submitters
- Ensembl or Ensembl genomes
The current release is composed of gene2acc, non-redundant fasta and idmapping files. All species are Reference Proteomes: 120 Eukaryotes plus 27 Bacteria/Archea as they were present in previous QfO releases, where:
- the 12 species from The Reference Genome Annotation Project plus Candida albicans and Plasmodium falciparum
- a further 130 selected species in UniProt
Download
The gene2acc, fasta and idmapping files for individual species are available for download here:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/current_release
or as a tarball of all species:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/current.tar.gz
SeqXML versions are documented by our partners and they are available here:
http://www.seqxml.org/xml/Reference_proteomes.html
Predicted Orthologs
InParanoid
Roundup
OMA
Orthoinspector
Current Composition of Primary Protein Sets
The following table describes the status of the species:
* New species added in release 2013_04 compared to release 2012_04
gene2acc format
Column 1 is the gene symbol available from the INSDC genome annotation, the Ensembl or Ensembl genome gene names. This column is non-redundant.
Column 2 is the UniProtKB accession for the longest translation available for each gene. This column will have redundancy when two or more genes have identical translations.
FASTA header format
The fasta files, composed of canonical and additional sets, contain non-redundant FASTA sets for the sequences for each reference proteome.
The additional set contains the variant sequences for a given gene, and its FASTA header adds the information ("Isoform of ...") to which canonical accession it is a variant.
The FASTA format is the standard UniProtKB format. For further references about the standard UniProtKB format, please see:http://www.uniprot.org/help/fasta-headers
http://www.uniprot.org/faq/38E.g. Canonical set:
>sp|A0PJX2|CT118_HUMAN Uncharacterized protein C20orf118 OS=Homo sapiens GN=C20orf118 PE=2 SV=1
E.g. Additional set:
>tr|A2A2J3|A2A2J3_HUMAN Isoform of A0PJX2, Uncharacterized protein C20orf118 (Fragment) OS=Homo sapiens GN=C20orf118 PE=4 SV=1
UniProt accession = column 2 of gene2acc file
GN = Primary UniProt Gene Name
PE = Protein Existence
SV = Sequence Version
Description = The UniProtKB Description
Idmapping format
This file has three columns, delimited by tab:
- UniProtKB-AC
- ID_type
- ID
where ID_type is the database name as appearing in UniProtKB cross-references, and as supported by the ID mapping tool on the UniProt web site, and where ID is the identifier in that cross-referenced database.
References:
Joining forces in the quest for orthologs
Genome Biology 2009, 10:403 doi:10.1186/gb-2009-10-9-403
Published: 29 September 2009
Toward community standards in the quest for orthologs
Bioinformatics 2012, 28:900 doi:10.1093/bioinformatics/bts050
Published: 12 February 2012
