Reference proteomes

Reference proteomes - Primary proteome sets for the Quest For Orthologs

RELEASE 2025_04 based on UniProt Release 2025_04, Ensembl release 114 and Ensembl Genome release 61

Introduction

The Reference Proteomes group provides complete non-redundant proteome sets for species chosen by the “Quest for Orthologs” group. It comprises 81 species that are publicly available and are generated using UniProtKB, Ensembl and Ensembl Genomes.

The one gene one protein proteome sets are compiled from species sourced from complete genomes submitted to INSDC with gene model annotations from:

genome submitters
Ensembl or Ensembl genomes

Download

The gene2acc, fasta and idmapping files for individual species are available for download here:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO

or as a tarball of all species:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/QfO_release_2025_04.tar.gz

Additional tarball containing gbff & gff/gff3 for all species:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/QfO_release_2025_04_additional.tar.gz

SeqXML versions are documented by our partners and they are available here:
https://www.seqxml.org/xml/Reference_proteomes.html

Predicted Orthologs

InParanoid

OMA

Orthoinspector

Current Composition of Primary Protein Sets

The following table describes the status of the species:

Species	Number of Genes/Proteins
UP000007062 7165 ANOGA Anopheles gambiae	13029
UP000000798 224324 AQUAE Aquifex aeolicus	1553
UP000006548 3702 ARATH Arabidopsis thaliana	27448
UP000001570 224308 BACSU Bacillus subtilis	4271
UP000001414 226186 BACTN Bacteroides thetaiotaomicron	4782
UP000007241 684364 BATDJ Batrachochytrium dendrobatidis	8610
UP000009136 9913 BOVIN Bos taurus	23617
UP000002526 224911 BRADU Bradyrhizobium diazoefficiens	8253
UP000001554 7739 BRAFL Branchiostoma floridae	26629
UP000001940 6239 CAEEL Caenorhabditis elegans	19831
UP000000559 237561 CANAL Candida albicans	6035
UP000805418 9615 CANLF Canis lupus	20992
UP000000431 272561 CHLTR Chlamydia trachomatis	895
UP000006906 3055 CHLRE Chlamydomonas reinhardtii	17614
UP000002008 324602 CHLAA Chloroflexus aurantiacus	3850
UP000008144 7719 CIOIN Ciona intestinalis	16680
UP000002149 214684 CRYNJ Cryptococcus neoformans	6604
UP000000437 7955 DANRE Danio rerio	26701
UP000076858 35525 None Daphnia magna	26600
UP000002524 243230 DEIRA Deinococcus radiodurans	3085
UP000007719 515635 DICTD Dictyoglomus turgidum	1743
UP000002195 44689 DICDI Dictyostelium discoideum	12718
UP000000803 7227 DROME Drosophila melanogaster	13824
UP000000625 83333 ECOLI Escherichia coli	4402
UP000241660 190304 FUSNN Fusobacterium nucleatum	1975
UP000000539 9031 CHICK Gallus gallus	18372
UP000000577 243231 GEOSL Geobacter sulfurreducens	3402
UP000001548 184922 GIAIC Giardia intestinalis	4900
UP000000557 251221 GLOVI Gloeobacter violaceus	4406
UP000001519 9595 GORGO Gorilla gorilla	21788
UP000000554 64091 HALSA Halobacterium salinarum	2424
UP000000429 85962 HELPY Helicobacter pylori	1554
UP000015101 6412 HELRO Helobdella robusta	23328
UP000005640 9606 HUMAN Homo sapiens	20659
UP000001555 6945 IXOSC Ixodes scapularis	20503
UP000001686 374847 KORCO Korarchaeum cryptofilum	1602
UP000000542 5664 LEIMA Leishmania major	8038
UP000018468 7918 LEPOC Lepisosteus oculatus	18323
UP000001408 189518 LEPIN Leptospira interrogans	3676
UP000006718 9544 MACMU Macaca mulatta	21893
UP000000805 243232 METJA Methanocaldococcus jannaschii	1787
UP000002487 188937 METAC Methanosarcina acetivorans	4468
UP000002280 13616 MONDO Monodelphis domestica	21225
UP000001357 81824 MONBE Monosiga brevicollis	9156
UP000000589 10090 MOUSE Mus musculus	21856
UP000001584 83332 MYCTU Mycobacterium tuberculosis	3996
UP000000807 243273 MYCGE Mycoplasma genitalium	483
UP000000425 122586 NEIMB Neisseria meningitidis	2001
UP000001593 45351 NEMVE Nematostella vectensis	24428
UP000002530 330879 ASPFU Neosartorya fumigata	9647
UP000001805 367110 NEUCR Neurospora crassa	9759
UP000000792 436308 NITMS Nitrosopumilus maritimus	1795
UP000059680 39947 ORYSJ Oryza sativa	43673
UP000001038 8090 ORYLA Oryzias latipes	23617
UP000002277 9598 PANTR Pan troglodytes	23027
UP000000600 5888 PARTE Paramecium tetraurelia	39461
UP000663193 321614 PHANO Phaeosphaeria nodorum	17429
UP000006727 3218 PHYPA Physcomitrella patens	31366
UP000005238 164328 PHYRM Phytophthora ramorum	15350
UP000001450 36329 PLAF7 Plasmodium falciparum	5361
UP000002438 208964 PSEAE Pseudomonas aeruginosa	5563
UP000008783 418459 PUCGT Puccinia graminis	15688
UP000002494 10116 RAT Rattus norvegicus	21800
UP000001025 243090 RHOBA Rhodopirellula baltica	7271
UP000002311 559292 YEAST Saccharomyces cerevisiae	6065
UP000002485 284812 SCHPO Schizosaccharomyces pombe	5206
UP000001312 665079 SCLS1 Sclerotinia sclerotiorum	14445
UP000001973 100226 STRCO Streptomyces coelicolor	8035
UP000001974 273057 SACS2 Sulfolobus solfataricus	2937
UP000001425 1111708 SYNY3 Synechocystis sp.	3507
UP000001449 35128 THAPS Thalassiosira pseudonana	11611
UP000000536 69014 THEKO Thermococcus kodakarensis	2301
UP000000718 289376 THEYD Thermodesulfovibrio yellowstonii	1982
UP000008183 243274 THEMA Thermotoga maritima	1852
UP000007266 7070 TRICA Tribolium castaneum	16568
UP000001542 412133 TRIV3 Trichomonas vaginalis	50190
UP000000561 5270 USTMA Ustilago maydis	6788
UP000186698 8355 XENTR Xenopus laevis	36391
UP000008143 8364 XENTR Xenopus tropicalis	22019
UP000001300 284591 YARLI Yarrowia lipolytica	6449
UP000007305 4577 MAIZE Zea mays	39209

Gene mapping files (*.gene2acc)

Column 1 is a unique gene symbol that is chosen with the following order of preference from the annotation found in:

Model Organism Database (MOD)
Ensembl or Ensembl Genomes database
UniProt Ordered Locus Name (OLN)
UniProt Open Reading Frame (ORF)
UniProt Gene Name

A dash symbol (-) is used when the gene encoding a protein is unknown.

Column 2 is the UniProtKB accession or isoform identifier for the given gene symbol. This column may have redundancy when two or more genes have identical translations.

Column 3 is the gene symbol of the canonical accession used to represent the respective gene group and the first row of the sequence is the canonical one.

Protein FASTA files (*.fasta and *_additional.fasta)

These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of ..."). The FASTA format is the standard UniProtKB format.

For further references about the standard UniProtKB format, please see:

https://www.uniprot.org/help/fasta-headers
https://www.uniprot.org/help/retrieve_sets

E.g. Canonical set:

    >sp|Q9H6Y5|MAGIX_HUMAN PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4
    MEPRTGGAANPKGSRGSRGPSPLAGPSARQLLARLDARPLAARAAVDVAALVRRAGATLR
    LRRKEAVSVLDSADIEVTDSRLPHATIVDHRPQHRWLETCNAPPQLIQGKAHSAPKPSQA
    SGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHINGE
    STQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGGP
    EVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGSP
    GPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH

E.g. Additional sets:

    >sp|Q9H6Y5-2|MAGIX-2_HUMAN Isoform of Q9H6Y5, Isoform 2 of PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4
    MPLLWITGPRYHLILLSEASCLRANYVHLCPLFQHRWLETCNAPPQLIQGKAHSAPKPSQ
    ASGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHING
    ESTQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGG
    PEVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGS
    PGPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH

    >tr|C9J123|C9J123_HUMAN Isoform of Q9H6Y5, PDZ domain-containing protein MAGIX (Fragment) OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=2
    MSPNSPLHCFYLPAVSVLDSADIEVTDSRLPHATIVDHRPQVGDLVLHINGESTQGLTHA
    QAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVDRSPDPGGPEVTGSRSSSTSLV
    QHPPSRTTLKKTRGSPEPSPEAA

Coding DNA Sequence FASTA files (*_DNA.fasta)

These files contain the coding DNA sequences (CDS) for the protein sequences where it was possible. The format is as in the following example (UP000005640_9606_DNA.fasta):

    >sp|A0A183|ENSP00000411070
    ATGTCACAGCAGAAGCAGCAATCTTGGAAGCCTCCAAATGTTCCCAAATGCTCCCCTCCC
    CAAAGATCAAACCCCTGCCTAGCTCCCTACTCGACTCCTTGTGGTGCTCCCCATTCAGAA
    GGTTGTCATTCCAGTTCCCAAAGGCCTGAGGTTCAGAAGCCTAGGAGGGCTCGTCAAAAG
    CTGCGCTGCCTAAGTAGGGGCACAACCTACCACTGCAAAGAGGAAGAGTGTGAAGGCGAC
    TGA

The 3 fields of the FASTA header are:

sp (Swiss-Prot reviewed) or tr (TrEMBL)
UniProtKB Accession
EMBL Protein ID or Ensembl/Ensembl Genome ID

Unsuccessful Coding DNA Sequence mapping files (*_DNA.miss)

For the species that did not have a perfect mapping for all protein sequences to a CDS, these files contain the entries that could not be mapped. The format is as in the following example (UP000005640_9606_DNA.miss):

    sp A6NF01 CAUTION: Could be the product of a pseudogene
    sp A4QN01 NOT_ANNOTATED_CDS

The 3 fields are:

"sp" (Swiss-Prot reviewed) or "tr" (TrEMBL)
UniProtKB accession
Reason why the protein could not be mapped to a CDS

Database mapping files (*.idmapping)

These files contain mappings from UniProtKB to other databases for each reference proteome. The format consists of three tab-separated columns:

UniProtKB accession
ID_type:
- Database name as shown in UniProtKB cross-references and supported by the ID mapping tool on the UniProt web site (https://www.uniprot.org/id-mapping )
ID:
- Identifier in the cross-referenced database.

SeqXML files (*.xml)

The xml files contain all the information from fasta (canonical and additional), idmapping and CDS in SeqXML format (see https://seqxml.org.)
E.g. (from UP000005640_9606.xml, header and one entry)

<?xml version="1.0" encoding="utf-8"?> 
<seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" speciesName="Homo sapiens" xsi:noNamespaceSchemaLocation="http://www.seqxml.org/0.4/seqxml.xsd" seqXMLversion="0.4" sourceVersion="2016_04"
source="QfO http://w ww.ebi.ac.uk/reference_proteomes/" ncbiTaxID="9606"> 
<entry id="A0A075B6H9" source="UniProtKB">
<description>Immunoglobulin lambda variable 4-69</description>
<AAseq>MAWTPLLFLTLLLHCTGSLSQLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRYLMKLNSDGSHSKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYCQTWGTGI</AAseq>
<DBRef id="LV469_HUMAN" source="UniProtKB-ID"></DBRef>
<DBRef id="IGLV4-69" source="Gene_Name"></DBRef>
<DBRef id="1133968632" source="GI"></DBRef>
<DBRef id="UniRef100_A0A075B6H9" source="UniRef100"></DBRef>
<DBRef id="UniRef90_A0A075B6H9" source="UniRef90"></DBRef>
<DBRef id="UniRef50_A0A0B4J1Y8" source="UniRef50"></DBRef>
<DBRef id="UPI0000F30329" source="UniParc"></DBRef>
<DBRef id="AC245452" source="EMBL"></DBRef>
<DBRef id="-" source="EMBL-CDS"></DBRef>
<DBRef id="9606" source="NCBI_TaxID"></DBRef>
<DBRef id="ENSG00000211637" source="Ensembl"></DBRef>
<DBRef id="ENST00000390282" source="Ensembl_TRS"></DBRef>
<DBRef id="ENSP00000374817" source="Ensembl_PRO"></DBRef>
<DBRef id="uc062cba.1" source="UCSC"></DBRef>
<DBRef id="HostDB:ENSG00000211637.2" source="EuPathDB"></DBRef>
<DBRef id="IGLV4-69" source="GeneCards"></DBRef>
<DBRef id="HGNC:5921" source="HGNC"></DBRef>
<DBRef id="NX_A0A075B6H9" source="neXtProt"></DBRef>
<DBRef id="ENSGT00900000140867" source="GeneTree"></DBRef>
<DBRef id="GPRYLMK" source="OMA"></DBRef>
<DBRef id="6C449213D2CD44D7" source="CRC64"></DBRef>
<property name="GN" value="IGLV4-69"></property>
<property name="SV" value="1"></property>
<property name="DNAsource" value="ENSP00000374817"></property>
<property name="ensemblVersion" value="91,38"></property>
<property name="UPID" value="UP000005640"></property>
<property name="DNAseq" value="ATGGCTTGGACCCCACTCCTCTTCCTCACCCTCCTCCTCCACTGCACAGGGTCTCTCTCCCAGCTTGTGCTGACTCAATCGCCCTCTGCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGCAGCTACGCCATCGCAT
GGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGATGGCAGCCACAGCAAGGGGGACGGGATCCCTGATCGCTTCTCAGGCTCCAGCTCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTCA"></property>
<property name="PE" value="1"></property>
</entry>

Joining forces in the quest for orthologs

Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella,Erik LL Sonnhammer and Suzanna Lewis

Genome Biology 2009, 10:403

Published: 29 September 2009

Toward community standards in the quest for orthologs

Christophe Dessimoz, Toni Gabaldón, David S. Roos, Erik LL Sonnhammer, Javier Herrero and the Quest for Orthologs consortium

Bioinformatics 2012, 28:900

Published: 12 February 2012

Big data and other challenges in the quest for orthologs

Erik LL Sonnhammer, Toni Gabaldón, Alan W Sousa da Silva, Maria Martin, Marc Robinson-Rechavi, Brigitte Boeckmann, Paul D Thomas, Christophe Dessimoz and the Quest for Orthologs consortium

Bioinformatics 2014, 30:2993

Published: 26 July 2014

Standardized benchmarking in the quest for orthologs

Adrian M Altenhoff, Brigitte Boeckmann, Salvador Capella-Gutierrez, Daniel A Dalquen, Todd DeLuca, Kristoffer Forslund, Jaime Huerta-Cepas, Benjamin Linard, Cécile Pereira, Leszek P Pryszcz, Fabian Schreiber, Alan Sousa da Silva, Damian Szklarczyk, Clément-Marie Train, Peer Bork, Odile Lecompte, Christian von Mering, Ioannis Xenarios, Kimmen Sjölander, Lars Juhl Jensen, Maria J Martin, Matthieu Muffato, Quest for Orthologs consortium, Toni Gabaldón, Suzanna E Lewis, Paul D Thomas, Erik Sonnhammer and Christophe Dessimoz

Nature Methods 2016

Published online: 25 April 2018