Reference proteomes - Primary proteome sets for the Quest For Orthologs

Please consider taking part in our quick survey to provide us with your feedback.


RELEASE 2016_04

based on UniProt Release 2016_04, Ensembl release 84 and Ensembl Genome release 31


The Reference Proteomes group provides complete non-redundant proteome sets for species chosen by the “Quest for Orthologs” group. It comprises 66 species that are publicly available and are generated using UniProtKB, Ensembl and Ensembl Genomes.

The one gene one protein proteome sets are compiled from species sourced from complete genomes submitted to INSDC with gene model annotations from:

  1. genome submitters
  2. Ensembl or Ensembl genomes

The current release is composed of gene2acc, non-redundant fasta and idmapping files for:

  • a further 44 selected species in UniProt

*if you have any questions and suggestions, please contact us at


The gene2acc, fasta and idmapping files for individual species are available for download here:

or as a tarball of all species:

SeqXML versions are documented by our partners and they are available here:

Predicted Orthologs





Current Composition of Primary Protein Sets

The following table describes the status of the species:

Species Number of Genes/Proteins
UP000007062 7165 ANOGA Anopheles gambiae 11988
UP000000798 224324 AQUAE Aquifex aeolicus 1552
UP000006548 3702 ARATH Arabidopsis thaliana 27064
UP000001570 224308 BACSU Bacillus subtilis 4197
UP000001414 226186 BACTN Bacteroides thetaiotaomicron 4775
UP000009136 9913 BOVIN Bos taurus 20055
UP000002526 224911 BRADU Bradyrhizobium diazoefficiens 8024
UP000001554 7739 BRAFL Branchiostoma floridae 28538
UP000001940 6239 CAEEL Caenorhabditis elegans 20137
UP000000559 237561 CANAL Candida albicans 8264
UP000002254 9615 CANLF Canis lupus 19644
UP000000431 272561 CHLTR Chlamydia trachomatis 895
UP000002008 324602 CHLAA Chloroflexus aurantiacus 3819
UP000008144 7719 CIOIN Ciona intestinalis 16641
UP000002149 214684 CRYNJ Cryptococcus neoformans 6602
UP000000437 7955 DANRE Danio rerio 24821
UP000002524 243230 DEIRA Deinococcus radiodurans 3079
UP000007719 515635 DICTD Dictyoglomus turgidum 1731
UP000002195 44689 DICDI Dictyostelium discoideum 12731
UP000000803 7227 DROME Drosophila melanogaster 13707
UP000000625 83333 ECOLI Escherichia coli 4306
UP000002521 190304 FUSNN Fusobacterium nucleatum 2043
UP000000539 9031 CHICK Gallus gallus 15775
UP000000577 243231 GEOSL Geobacter sulfurreducens 3395
UP000001548 184922 GIAIC Giardia intestinalis 7154
UP000000557 251221 GLOVI Gloeobacter violaceus 4318
UP000000554 64091 HALSA Halobacterium salinarum 2415
UP000005640 9606 HUMAN Homo sapiens 21006
UP000001555 6945 IXOSC Ixodes scapularis 20463
UP000001686 374847 KORCO Korarchaeum cryptofilum 1599
UP000000542 5664 LEIMA Leishmania major 8031
UP000001408 189518 LEPIN Leptospira interrogans 3418
UP000006718 9544 MACMU Macaca mulatta 21726
UP000000805 243232 METJA Methanocaldococcus jannaschii 1787
UP000002487 188937 METAC Methanosarcina acetivorans 4296
UP000002280 13616 MONDO Monodelphis domestica 21181
UP000001357 81824 MONBE Monosiga brevicollis 9188
UP000000589 10090 MOUSE Mus musculus 22136
UP000001584 83332 MYCTU Mycobacterium tuberculosis 3987
UP000001593 45351 NEMVE Nematostella vectensis 24428
UP000002530 330879 ASPFU Neosartorya fumigata 9649
UP000001805 367110 NEUCR Neurospora crassa 9756
UP000002279 9258 ORNAN Ornithorhynchus anatinus 21122
UP000002277 9598 PANTR Pan troglodytes 18656
UP000001055 321614 PHANO Phaeosphaeria nodorum 15993
UP000006727 3218 PHYPA Physcomitrella patens 34793
UP000001450 36329 PLAF7 Plasmodium falciparum 5159
UP000002438 208964 PSEAE Pseudomonas aeruginosa 5550
UP000002494 10116 RAT Rattus norvegicus 21330
UP000001025 243090 RHOBA Rhodopirellula baltica 6999
UP000002311 559292 YEAST Saccharomyces cerevisiae 6721
UP000008854 6183 SCHMA Schistosoma mansoni 10716
UP000002485 284812 SCHPO Schizosaccharomyces pombe 5121
UP000001312 665079 SCLS1 Sclerotinia sclerotiorum 14400
UP000001973 100226 STRCO Streptomyces coelicolor 8005
UP000001974 273057 SULSO Sulfolobus solfataricus 2924
UP000001425 1111708 SYNY3 Synechocystis sp. 3424
UP000005226 31033 TAKRU Takifugu rubripes 18492
UP000001449 35128 THAPS Thalassiosira pseudonana 11706
UP000000536 69014 THEKO Thermococcus kodakarensis 2290
UP000000718 289376 THEYD Thermodesulfovibrio yellowstonii 1970
UP000008183 243274 THEMA Thermotoga maritima 1851
UP000001542 5722 TRIVA Trichomonas vaginalis 50188
UP000000561 237631 USTMA Ustilago maydis 6788
UP000008143 8364 XENTR Xenopus tropicalis 18252
UP000001300 284591 YARLI Yarrowia lipolytica 6448

Gene mapping files (*.gene2acc)

Column 1 is a unique gene symbol that is chosen with the following order of preference from the annotation found in:

  1. Model Organism Database (MOD)
  2. Ensembl or Ensembl Genomes database
  3. UniProt Ordered Locus Name (OLN)
  4. UniProt Open Reading Frame (ORF)
  5. UniProt Gene Name

A dash symbol (-) is used when the gene encoding a protein is unknown.

Column 2 is the UniProtKB accession or isoform identifier for the given gene symbol. This column may have redundancy when two or more genes have identical translations.

Protein FASTA files (*.fasta and *_additional.fasta)

These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of ..."). The FASTA format is the standard UniProtKB format.

For further references about the standard UniProtKB format, please see:

E.g. Canonical set:

    >sp|Q9H6Y5|MAGIX_HUMAN PDZ domain-containing protein MAGIX OS=Homo sapiens GN=MAGIX PE=1 SV=3

E.g. Additional sets:

    >sp|Q9H6Y5-2|MAGIX_HUMAN Isoform of Q9H6Y5, Isoform 2 of PDZ domain-containing protein MAGIX OS=Homo sapiens GN=MAGIX
>tr|C9J123|C9J123_HUMAN Isoform of Q9H6Y5, PDZ domain-containing protein MAGIX (Fragment) OS=Homo sapiens GN=MAGIX PE=1 SV=2

Coding DNA Sequence FASTA files (*_DNA.fasta)

These files contain the coding DNA sequences (CDS) for the protein sequences where it was possible. The format is as in the following example (UP000005640_9606_DNA.fasta):


The 3 fields of the FASTA header are:

  1. sp (Swiss-Prot reviewed) or tr (TrEMBL)
  2. UniProtKB Accession
  3. EMBL Protein ID or Ensembl/Ensembl Genome ID

Unsuccessful Coding DNA Sequence mapping files (*_DNA.miss)

For the species that did not have a perfect mapping for all protein sequences to a CDS, these files contain the entries that could not be mapped. The format is as in the following example (UP000005640_9606_DNA.miss):

    sp A6NF01 CAUTION: Could be the product of a pseudogene.

The 3 fields are:

  1. sp (Swiss-Prot reviewed) or tr (TrEMBL)
  2. UniProtKB accession
  3. Reason why the protein could not be mapped to a CDS

Database mapping files (*.idmapping)

These files contain mappings from UniProtKB to other databases for each reference proteome. The format consists of three tab-separated columns:

  1. UniProtKB accession
  2. ID_type:
  3. ID:
    • Identifier in the cross-referenced database.

SeqXML files (*.xml)

The xml files contain all the information from fasta (canonical and additional), idmapping and CDS in SeqXML format (see
E.g. (from UP000005640_9606.xml, header and one entry)

    <?xml version="1.0" encoding="utf-8"?> <seqXML xmlns:xsi="" speciesName="Homo
    sapiens" xsi:noNamespaceSchemaLocation="" seqXMLversion="0.4" sourceVersion="2016_04"
    source="QfO http://w" ncbiTaxID="9606"> <entry source="UniProtKB" id="U3KQE9">
    <DBRef source="UniProtKB-ID" id="U3KQE9_HUMAN"/> <DBRef source="UniRef100" id="UniRef100_U3KQE9"/> <DBRef
    source="UniRef90" id="UniRef90_U3KQE9"/> <DBRef source="UniRef50" id="UniRef50_U3KQE9"/> <DBRef source="UniParc"
    id="UPI00038BAF87"/> <DBRef source="EMBL" id="AC096887"/> <DBRef source="EMBL" id="AC099667"/> <DBRef source="EMBL"
    id="AC103589"/> <DBRef source="EMBL-CDS" id="-"/> <DBRef source="NCBI_TaxID" id="9606"/> <DBRef source="STRING"
    id="9606.ENSP00000296292"/> <DBRef source="Ensembl" id="ENSG00000272305"/> <DBRef source="Ensembl_TRS" id="ENST00000607283"/>
    <DBRef source="Ensembl_PRO" id="ENSP00000475819"/> <DBRef source="UCSC" id="uc062krn.1"/> <DBRef source="eggNOG"
    id="KOG2864"/> <DBRef source="eggNOG" id="ENOG410Y1D5"/> <DBRef source="GeneTree" id="ENSGT00390000011390"/>
    <DBRef source="OMA" id="ERPRKPW"/> <DBRef source="UniProt" id="U3KQG8"/> <DBRef source="UniProt" id="U3KQK6"/>
    <property name="DNAsource" value="ENSP00000475819"/> <property name="ensemblVersion" value="84,31"/> <property
    <property name="PE" value="4"/> </entry>

Joining forces in the quest for orthologs

Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella,Erik LL Sonnhammer and Suzanna Lewis

Genome Biology 2009, 10:403

Published: 29 September 2009

Toward community standards in the quest for orthologs

Christophe Dessimoz, Toni Gabaldón, David S. Roos, Erik LL Sonnhammer, Javier Herrero and the Quest for Orthologs consortium

Bioinformatics 2012, 28:900

Published: 12 February 2012

Big data and other challenges in the quest for orthologs

Erik LL Sonnhammer, Toni Gabaldón, Alan W Sousa da Silva, Maria Martin, Marc Robinson-Rechavi, Brigitte Boeckmann, Paul D Thomas, Christophe Dessimoz and the Quest for Orthologs consortium

Bioinformatics 2014, 30:2993

Published: 26 July 2014

Standardized benchmarking in the quest for orthologs

Adrian M Altenhoff, Brigitte Boeckmann, Salvador Capella-Gutierrez, Daniel A Dalquen, Todd DeLuca, Kristoffer Forslund, Jaime Huerta-Cepas, Benjamin Linard, Cécile Pereira, Leszek P Pryszcz, Fabian Schreiber, Alan Sousa da Silva, Damian Szklarczyk, Clément-Marie Train, Peer Bork, Odile Lecompte, Christian von Mering, Ioannis Xenarios, Kimmen Sjölander, Lars Juhl Jensen, Maria J Martin, Matthieu Muffato, Quest for Orthologs consortium, Toni Gabaldón, Suzanna E Lewis, Paul D Thomas, Erik Sonnhammer and Christophe Dessimoz

Nature Methods 2016

Published online: 4 April 2016