Reference proteomes - Primary proteome sets for the Quest For Orthologs

RELEASE 2013_04

based on UniProt Release 2013_04, Ensembl release 70 and Ensembl Genome release 17

Introduction

The Reference Proteomes group provides complete non-redundant proteome sets for species chosen by the “Quest for Orthologs” group. It comprises 147 species that are publicly available and are generated using UniProtKB, Ensembl and Ensembl Genomes.

The one gene one protein proteome sets are compiled from species sourced from complete genomes submitted to INSDC with gene model annotations from:

  1. genome submitters
  2. Ensembl or Ensembl genomes


The current release is composed of gene2acc, non-redundant fasta and idmapping files. All species are Reference Proteomes: 120 Eukaryotes plus 27 Bacteria/Archea as they were present in previous QfO releases, where:

  • a further 130 selected species in UniProt

 

Download


The gene2acc, fasta and idmapping files for individual species are available for download here:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/current_release

or as a tarball of all species:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/current.tar.gz

SeqXML versions are documented by our partners and they are available here:
http://www.seqxml.org/xml/Reference_proteomes.html

Predicted Orthologs


InParanoid

Roundup

OMA

Orthoinspector

Current Composition of Primary Protein Sets

 

The following table describes the status of the species:

Species Number of Genes Number of Proteins
*7159 Aedes aegypti 15509 15147
*9646 Ailuropoda melanoleuca 19343 19302
*447093 Ajellomyces capsulata 9254 9214
*400682 Amphimedon queenslandica 29863 29513
*28377 Anolis carolinensis 17805 17722
*43151 Anopheles darlingi 11430 11424
7165 Anopheles gambiae 19969 12072
*7460 Apis mellifera 10656 10560
224324 Aquifex aeolicus 1579 1553
3702 Arabidopsis thaliana 27398 27270
*284811 Ashbya gossypii 7307 4760
*12957 Atta cephalotes 18051 18043
*5865 Babesia bovis 3703 3687
224308 Bacillus subtilis 4228 4193
226186 Bacteroides thetaiotaomicron 4832 4782
684364 Batrachochytrium dendrobatidis 8700 8610
*7091 Bombyx mori 14616 14583
9913 Bos taurus 19981 19799
*15368 Brachypodium distachyon 26551 26468
224911 Bradyrhizobium japonicum 8314 8253
7739 Branchiostoma floridae 28624 28529
*6238 Caenorhabditis briggsae 21295 21113
6239 Caenorhabditis elegans 21420 20791
*281687 Caenorhabditis japonica 29964 29043
*9483 Callithrix jacchus 20972 20883
237561 Candida albicans 18825 9096
9615 Canis familiaris 19824 19662
*10141 Cavia porcellus 18673 18635
272561 Chlamydia trachomatis 896 895
3055 Chlamydomonas reinhardtii 14339 14195
324602 Chloroflexus aurantiacus 3853 3850
7719 Ciona intestinalis 16657 16641
*51511 Ciona savignyi 11604 11575
*246410 Coccidioides immitis 9588 9581
*240176 Coprinopsis cinerea 13355 13334
214684 Cryptococcus neoformans 6249 6242
*237895 Cryptosporidium hominis 3886 3886
7955 Danio rerio 26095 25632
6669 Daphnia pulex 30589 30118
243230 Deinococcus radiodurans 3101 3085
515635 Dictyoglomus turgidum 1744 1743
44689 Dictyostelium discoideum 20400 12729
5786 Dictyostelium purpureum 12399 12347
7227 Drosophila melanogaster 23946 13907
*46245 Drosophila pseudoobscura 15643 15568
227321 Emericella nidulans 17169 10525
*284813 Encephalitozoon cuniculi 2023 2004
5759 Entamoeba histolytica 8157 7954
*9796 Equus caballus 20420 20228
83333 Escherichia coli 4223 4189
190304 Fusobacterium nucleatum 2063 2046
9031 Gallus gallus 16727 16568
*69293 Gasterosteus aculeatus 20786 17996
243231 Geobacter sulfurreducens 3432 3401
184922 Giardia intestinalis 7364 7154
*229533 Gibberella zeae 13031 13030
251221 Gloeobacter violaceus 4427 4406
*3847 Glycine max 53235 52897
*9595 Gorilla gorilla 20868 20551
64091 Halobacterium salinarum 2354 2325
9606 Homo sapiens 22242 20249
6945 Ixodes scapularis 20471 20458
374847 Korarchaeum cryptofilum 1602 1602
*7897 Latimeria chalumnae 18977 18969
*5660 Leishmania braziliensis 8202 8102
*5671 Leishmania infantum 8178 8047
5664 Leishmania major 8312 8035
189518 Leptospira interrogans 3785 3675
*9785 Loxodonta africana 20033 19990
9544 Macaca mulatta 21897 21774
*9103 Meleagris gallopavo 14125 14110
243232 Methanocaldococcus jannaschii 1788 1787
188937 Methanosarcina acetivorans 4539 4468
13616 Monodelphis domestica 21309 21221
81824 Monosiga brevicollis 9100 9084
10090 Mus musculus 22573 21941
1773 Mycobacterium tuberculosis 4035 3986
*59463 Myotis lucifugus 19727 19660
*660122 Nectria haematococca 15475 15446
45351 Nematostella vectensis 24773 24428
*330879 Neosartorya fumigata 9627 9627
367110 Neurospora crassa 16253 9824
*61853 Nomascus leucogenys 18161 18119
*8128 Oreochromis niloticus 21430 18624
9258 Ornithorhynchus anatinus 21697 21571
*9986 Oryctolagus cuniculus 19010 18901
*39947 Oryza sativa 65816 61986
*8090 Oryzias latipes 19686 17419
*30611 Otolemur garnettii 19506 19435
9598 Pan troglodytes 18721 18321
*5888 Paramecium tetraurelia 39642 39350
*121224 Pediculus humanus 10773 10761
*556484 Phaeodactylum tricornutum 10391 10321
321614 Phaeosphaeria nodorum 15989 15979
145481 Physcomitrella patens 35026 34704
403677 Phytophthora infestans 17797 17609
*164328 Phytophthora ramorum 15604 15348
*5823 Plasmodium berghei 11764 11656
*5825 Plasmodium chabaudi 14725 14614
36329 Plasmodium falciparum 5329 5310
*5851 Plasmodium knowlesi 5101 5100
*126793 Plasmodium vivax 5389 5386
*73239 Plasmodium yoelii 7812 7756
*13642 Polysphondylium pallidum 12361 12345
*9601 Pongo abelii 20332 20212
*3694 Populus trichocarpa 42300 41764
54126 Pristionchus pacificus 29201 29078
208964 Pseudomonas aeruginosa 5577 5562
418459 Puccinia graminis 15800 15688
178306 Pyrobaculum aerophilum 2639 2590
69014 Pyrococcus kodakaraensis 2305 2301
10116 Rattus norvegicus 25464 24737
*246409 Rhizopus delemar 17456 16968
243090 Rhodopirellula baltica 7325 7271
559292 Saccharomyces cerevisiae 6494 6434
*9305 Sarcophilus harrisii 18788 18779
6183 Schistosoma mansoni 11764 11711
284812 Schizosaccharomyces pombe 5666 5078
665079 Sclerotinia sclerotiorum 14446 14400
*4081 Solanum lycopersicum 34672 34551
*4113 Solanum tuberosum 39019 38772
*4558 Sorghum bicolor 32890 32714
*43179 Spermophilus tridecemlineatus 18823 18777
100226 Streptomyces coelicolor 8108 7999
7668 Strongylocentrotus purpuratus 28497 28417
273057 Sulfolobus solfataricus 2994 2938
*9823 Sus scrofa 21597 21008
1111708 Synechocystis 3184 3138
*59729 Taeniopygia guttata 17487 17328
31033 Takifugu rubripes 18519 15813
312017 Tetrahymena thermophila 24723 24696
*99883 Tetraodon nigroviridis 19602 16340
35128 Thalassiosira pseudonana 11672 11611
*5874 Theileria annulata 3792 3787
*5875 Theileria parva 4079 4068
289376 Thermodesulfovibrio yellowstonii 2033 1982
243274 Thermotoga maritima 1853 1852
*5811 Toxoplasma gondii 13144 7838
5722 Trichomonas vaginalis 59681 50191
10228 Trichoplax adhaerens 11518 11501
999953 Trypanosoma brucei 8790 8561
*5693 Trypanosoma cruzi 10843 10805
*656061 Tuber melanosporum 7496 7494
237631 Ustilago maydis 6544 6520
*29760 Vitis vinifera 27218 27192
8364 Xenopus tropicalis 18413 18263
284591 Yarrowia lipolytica 6392 6386


* New species added in release 2013_04 compared to release 2012_04

gene2acc format


Column 1 is the gene symbol available from the INSDC genome annotation, the Ensembl or Ensembl genome gene names. This column is non-redundant.

Column 2 is the UniProtKB accession for the longest translation available for each gene. This column will have redundancy when two or more genes have identical translations.

FASTA header format


The fasta files, composed of canonical and additional sets, contain non-redundant FASTA sets for the sequences for each reference proteome.
The additional set contains the variant sequences for a given gene, and its FASTA header adds the information ("Isoform of ...") to which canonical accession it is a variant.
The FASTA format is the standard UniProtKB format. For further references about the standard UniProtKB format, please see:http://www.uniprot.org/help/fasta-headers
http://www.uniprot.org/faq/38E.g. Canonical set:

>sp|A0PJX2|CT118_HUMAN Uncharacterized protein C20orf118 OS=Homo sapiens GN=C20orf118 PE=2 SV=1

E.g. Additional set:

>tr|A2A2J3|A2A2J3_HUMAN Isoform of A0PJX2, Uncharacterized protein C20orf118 (Fragment) OS=Homo sapiens GN=C20orf118 PE=4 SV=1

UniProt accession = column 2 of gene2acc file
GN = Primary UniProt Gene Name
PE = Protein Existence
SV = Sequence Version
Description = The UniProtKB Description

Idmapping format


This file has three columns, delimited by tab:

  1. UniProtKB-AC
  2. ID_type
  3. ID

where ID_type is the database name as appearing in UniProtKB cross-references, and as supported by the ID mapping tool on the UniProt web site, and where ID is the identifier in that cross-referenced database.

References:

Joining forces in the quest for orthologs

Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella, Erik LL Sonnhammer and Suzanna Lewis

Genome Biology 2009, 10:403 doi:10.1186/gb-2009-10-9-403

Published: 29 September 2009

Toward community standards in the quest for orthologs

Christophe Dessimoz, Toni Gabaldón, David S. Roos, Erik LL Sonnhammer and Javier Herrero

Bioinformatics 2012, 28:900 doi:10.1093/bioinformatics/bts050

Published: 12 February 2012