Reference proteomes - Primary proteome sets for the Quest For Orthologs

RELEASE 2017_04

based on UniProt Release 2017_04, Ensembl release 87 and Ensembl Genome release 34

Introduction

The Reference Proteomes group provides complete non-redundant proteome sets for species chosen by the “Quest for Orthologs” group. It comprises 78 species that are publicly available and are generated using UniProtKB, Ensembl and Ensembl Genomes.

The one gene one protein proteome sets are compiled from species sourced from complete genomes submitted to INSDC with gene model annotations from:

  1. genome submitters
  2. Ensembl or Ensembl genomes

The current release is composed of gene2acc, non-redundant fasta and idmapping files for:

  • a further 64 selected species in UniProt

*if you have any questions and suggestions, please contact us at help@uniprot.org

Download

The gene2acc, fasta and idmapping files for individual species are available for download here:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO

or as a tarball of all species:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/QfO_release_2017_04.tar.gz

SeqXML versions are documented by our partners and they are available here:
http://www.seqxml.org/xml/Reference_proteomes.html

Predicted Orthologs

InParanoid

Roundup

OMA

Orthoinspector

Current Composition of Primary Protein Sets

The following table describes the status of the species:

Species Number of Genes/Proteins
UP000007062 7165 ANOGA Anopheles gambiae 11930
UP000000798 224324 AQUAE Aquifex aeolicus 1553
UP000006548 3702 ARATH Arabidopsis thaliana 27502
UP000001570 224308 BACSU Bacillus subtilis 4197
UP000001414 226186 BACTN Bacteroides thetaiotaomicron 4782
UP000007241 684364 BATDJ Batrachochytrium dendrobatidis 8610
UP000009136 9913 BOVIN Bos taurus 21987
UP000002526 224911 BRADU Bradyrhizobium diazoefficiens 8253
UP000001554 7739 BRAFL Branchiostoma floridae 28542
UP000001940 6239 CAEEL Caenorhabditis elegans 20057
UP000000559 237561 CANAL Candida albicans 6153
UP000002254 9615 CANLF Canis lupus 20141
UP000000431 272561 CHLTR Chlamydia trachomatis 895
UP000006906 3055 CHLRE Chlamydomonas reinhardtii 14271
UP000002008 324602 CHLAA Chloroflexus aurantiacus 3850
UP000008144 7719 CIOIN Ciona intestinalis 16678
UP000002149 214684 CRYNJ Cryptococcus neoformans 6603
UP000000437 7955 DANRE Danio rerio 25043
UP000002524 243230 DEIRA Deinococcus radiodurans 3085
UP000007719 515635 DICTD Dictyoglomus turgidum 1743
UP000002195 44689 DICDI Dictyostelium discoideum 12735
UP000000803 7227 DROME Drosophila melanogaster 13757
UP000000625 83333 ECOLI Escherichia coli 4306
UP000002521 190304 FUSNN Fusobacterium nucleatum 2046
UP000000539 9031 CHICK Gallus gallus 18557
UP000000577 243231 GEOSL Geobacter sulfurreducens 3402
UP000001548 184922 GIAIC Giardia intestinalis 7154
UP000000557 251221 GLOVI Gloeobacter violaceus 4406
UP000001519 9595 GORGO Gorilla gorilla 20946
UP000000554 64091 HALSA Halobacterium salinarum 2426
UP000000429 85962 HELPY Helicobacter pylori 1553
UP000015101 6412 HELRO Helobdella robusta 23328
UP000005640 9606 HUMAN Homo sapiens 21042
UP000001555 6945 IXOSC Ixodes scapularis 20469
UP000001686 374847 KORCO Korarchaeum cryptofilum 1602
UP000000542 5664 LEIMA Leishmania major 8038
UP000018468 7918 LEPOC Lepisosteus oculatus 18314
UP000001408 189518 LEPIN Leptospira interrogans 3676
UP000000805 243232 METJA Methanocaldococcus jannaschii 1787
UP000002487 188937 METAC Methanosarcina acetivorans 4468
UP000002280 13616 MONDO Monodelphis domestica 21271
UP000001357 81824 MONBE Monosiga brevicollis 9188
UP000000589 10090 MOUSE Mus musculus 22262
UP000001584 83332 MYCTU Mycobacterium tuberculosis 3993
UP000000807 243273 MYCGE Mycoplasma genitalium 483
UP000000425 122586 NEIMB Neisseria meningitidis 2001
UP000001593 45351 NEMVE Nematostella vectensis 24428
UP000002530 330879 ASPFU Neosartorya fumigata 9648
UP000001805 367110 NEUCR Neurospora crassa 9759
UP000000792 436308 NITMS Nitrosopumilus maritimus 1795
UP000059680 39947 ORYSJ Oryza sativa 44321
UP000001038 8090 ORYLA Oryzias latipes 19663
UP000002277 9598 PANTR Pan troglodytes 18980
UP000000600 5888 PARTE Paramecium tetraurelia 39461
UP000001055 321614 PHANO Phaeosphaeria nodorum 15998
UP000006727 3218 PHYPA Physcomitrella patens 34813
UP000005238 164328 PHYRM Phytophthora ramorum 15349
UP000001450 36329 PLAF7 Plasmodium falciparum 5360
UP000002438 208964 PSEAE Pseudomonas aeruginosa 5562
UP000008783 418459 PUCGT Puccinia graminis 15688
UP000002494 10116 RAT Rattus norvegicus 21412
UP000001025 243090 RHOBA Rhodopirellula baltica 7271
UP000002311 559292 YEAST Saccharomyces cerevisiae 6722
UP000002485 284812 SCHPO Schizosaccharomyces pombe 5142
UP000001312 665079 SCLS1 Sclerotinia sclerotiorum 14445
UP000001973 100226 STRCO Streptomyces coelicolor 8038
UP000001974 273057 SULSO Sulfolobus solfataricus 2938
UP000001425 1111708 SYNY3 Synechocystis sp. 3507
UP000001449 35128 THAPS Thalassiosira pseudonana 11717
UP000000536 69014 THEKO Thermococcus kodakarensis 2301
UP000000718 289376 THEYD Thermodesulfovibrio yellowstonii 1982
UP000008183 243274 THEMA Thermotoga maritima 1852
UP000007266 7070 TRICA Tribolium castaneum 16563
UP000001542 5722 TRIVA Trichomonas vaginalis 50190
UP000000561 237631 USTMA Ustilago maydis 6788
UP000008143 8364 XENTR Xenopus tropicalis 24177
UP000001300 284591 YARLI Yarrowia lipolytica 6448
UP000007305 4577 MAIZE Zea mays 39476

Gene mapping files (*.gene2acc)

Column 1 is a unique gene symbol that is chosen with the following order of preference from the annotation found in:

  1. Model Organism Database (MOD)
  2. Ensembl or Ensembl Genomes database
  3. UniProt Ordered Locus Name (OLN)
  4. UniProt Open Reading Frame (ORF)
  5. UniProt Gene Name

A dash symbol (-) is used when the gene encoding a protein is unknown.

Column 2 is the UniProtKB accession or isoform identifier for the given gene symbol. This column may have redundancy when two or more genes have identical translations.

Column 3 is the gene symbol of the canonical accession used to represent the respective gene group and the first row of the sequence is the canonical one.

Protein FASTA files (*.fasta and *_additional.fasta)

These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of ..."). The FASTA format is the standard UniProtKB format.

For further references about the standard UniProtKB format, please see:

http://www.uniprot.org/help/fasta-headers
http://www.uniprot.org/faq/38

E.g. Canonical set:

    >sp|Q9H6Y5|MAGIX_HUMAN PDZ domain-containing protein MAGIX OS=Homo sapiens GN=MAGIX PE=1 SV=3
    
MEPRTGGAANPKGSRGSRGPSPLAGPSARQLLARLDARPLAARAAVDVAALVRRAGATLR
LRRKEAVSVLDSADIEVTDSRLPHATIVDHRPQHRWLETCNAPPQLIQGKAHSAPKPSQA
SGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHINGE
STQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGGP
EVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGSP
GPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH

E.g. Additional sets:

    >sp|Q9H6Y5-2|MAGIX_HUMAN Isoform of Q9H6Y5, Isoform 2 of PDZ domain-containing protein MAGIX OS=Homo sapiens GN=MAGIX
    
MPLLWITGPRYHLILLSEASCLRANYVHLCPLFQHRWLETCNAPPQLIQGKAHSAPKPSQ
ASGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHING
ESTQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGG
PEVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGS
PGPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH
>tr|C9J123|C9J123_HUMAN Isoform of Q9H6Y5, PDZ domain-containing protein MAGIX (Fragment) OS=Homo sapiens GN=MAGIX PE=1 SV=2
MSPNSPLHCFYLPAVSVLDSADIEVTDSRLPHATIVDHRPQVGDLVLHINGESTQGLTHA
QAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVDRSPDPGGPEVTGSRSSSTSLV
QHPPSRTTLKKTRGSPEPSPEAA

Coding DNA Sequence FASTA files (*_DNA.fasta)

These files contain the coding DNA sequences (CDS) for the protein sequences where it was possible. The format is as in the following example (UP000005640_9606_DNA.fasta):

    >sp|A0A183|ENSP00000411070
    
ATGTCACAGCAGAAGCAGCAATCTTGGAAGCCTCCAAATGTTCCCAAATGCTCCCCTCCC
CAAAGATCAAACCCCTGCCTAGCTCCCTACTCGACTCCTTGTGGTGCTCCCCATTCAGAA
GGTTGTCATTCCAGTTCCCAAAGGCCTGAGGTTCAGAAGCCTAGGAGGGCTCGTCAAAAG
CTGCGCTGCCTAAGTAGGGGCACAACCTACCACTGCAAAGAGGAAGAGTGTGAAGGCGAC
TGA

The 3 fields of the FASTA header are:

  1. sp (Swiss-Prot reviewed) or tr (TrEMBL)
  2. UniProtKB Accession
  3. EMBL Protein ID or Ensembl/Ensembl Genome ID

Unsuccessful Coding DNA Sequence mapping files (*_DNA.miss)

For the species that did not have a perfect mapping for all protein sequences to a CDS, these files contain the entries that could not be mapped. The format is as in the following example (UP000005640_9606_DNA.miss):

    sp A6NF01 CAUTION: Could be the product of a pseudogene.
    
sp A6NFI3 NOT_ANNOTATED_CDS

The 3 fields are:

  1. "sp" (Swiss-Prot reviewed) or "tr" (TrEMBL)
  2. UniProtKB accession
  3. Reason why the protein could not be mapped to a CDS

Database mapping files (*.idmapping)

These files contain mappings from UniProtKB to other databases for each reference proteome. The format consists of three tab-separated columns:

  1. UniProtKB accession
  2. ID_type:
  3. ID:
    • Identifier in the cross-referenced database.

SeqXML files (*.xml)

The xml files contain all the information from fasta (canonical and additional), idmapping and CDS in SeqXML format (see http://seqxml.org.)
E.g. (from UP000005640_9606.xml, header and one entry)

    <?xml version="1.0" encoding="utf-8"?> <seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" speciesName="Homo
    sapiens" xsi:noNamespaceSchemaLocation="http://www.seqxml.org/0.4/seqxml.xsd" seqXMLversion="0.4" sourceVersion="2016_04"
    source="QfO http://w ww.ebi.ac.uk/reference_proteomes/" ncbiTaxID="9606"> <entry source="UniProtKB" id="U3KQE9">
    <description>Uncharacterized protein (Fragment)</description> <AAseq>XLLLAINGVTECFTFAAMSKEEVDRYNFVMLALSSSFLVLSYLLTRWCGSVGFILANCFNMGIRITQSLCFIHRYYRRSPHRPLAGLHLSPVLLGTFALSGGVTAVSERPRKPWRSSGPWCPC</AAseq>
    <DBRef source="UniProtKB-ID" id="U3KQE9_HUMAN"/> <DBRef source="UniRef100" id="UniRef100_U3KQE9"/> <DBRef
    source="UniRef90" id="UniRef90_U3KQE9"/> <DBRef source="UniRef50" id="UniRef50_U3KQE9"/> <DBRef source="UniParc"
    id="UPI00038BAF87"/> <DBRef source="EMBL" id="AC096887"/> <DBRef source="EMBL" id="AC099667"/> <DBRef source="EMBL"
    id="AC103589"/> <DBRef source="EMBL-CDS" id="-"/> <DBRef source="NCBI_TaxID" id="9606"/> <DBRef source="STRING"
    id="9606.ENSP00000296292"/> <DBRef source="Ensembl" id="ENSG00000272305"/> <DBRef source="Ensembl_TRS" id="ENST00000607283"/>
    <DBRef source="Ensembl_PRO" id="ENSP00000475819"/> <DBRef source="UCSC" id="uc062krn.1"/> <DBRef source="eggNOG"
    id="KOG2864"/> <DBRef source="eggNOG" id="ENOG410Y1D5"/> <DBRef source="GeneTree" id="ENSGT00390000011390"/>
    <DBRef source="OMA" id="ERPRKPW"/> <DBRef source="UniProt" id="U3KQG8"/> <DBRef source="UniProt" id="U3KQK6"/>
    <property name="DNAsource" value="ENSP00000475819"/> <property name="ensemblVersion" value="84,31"/> <property
    name="UPID" value="UP000005640"/> <property name="SV" value="1"/> <property name="DNAseq" value="NTTCTCCTGCTTGCCATCAATGGAGTGACAGAGTGTTTCACATTTGCTGCCATGAGCAAAGAGGAGGTCGACAGGTACAATTTTGTGATGCTGGCCCTGTCCTCCTCATTCCTGGTGTTATCCTATCTCTTGACCCGTTGGTGTGGCAGCGTGGGCTTCATCTTGGCCAACTGCTTTAACATGGGCATTCGGATCACGCAGAGCCTTTGCTTCATCCACCGCTACTACCGAAGGAGCCCCCACAGGCCCCTGGCTGGCCTGCACCTATCGCCAGTCCTGCTCGGGACATTTGCCCTCAGTGGTGGGGTTACTGCTGTTTCGGAGAGGCCAAGAAAACCATGGAGGAGCAGTGGACCTTGGTGTCCCTGCTGA"/>
    <property name="PE" value="4"/> </entry>

Joining forces in the quest for orthologs

Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella,Erik LL Sonnhammer and Suzanna Lewis

Genome Biology 2009, 10:403

Published: 29 September 2009

Toward community standards in the quest for orthologs

Christophe Dessimoz, Toni Gabaldón, David S. Roos, Erik LL Sonnhammer, Javier Herrero and the Quest for Orthologs consortium

Bioinformatics 2012, 28:900

Published: 12 February 2012

Big data and other challenges in the quest for orthologs

Erik LL Sonnhammer, Toni Gabaldón, Alan W Sousa da Silva, Maria Martin, Marc Robinson-Rechavi, Brigitte Boeckmann, Paul D Thomas, Christophe Dessimoz and the Quest for Orthologs consortium

Bioinformatics 2014, 30:2993

Published: 26 July 2014

Standardized benchmarking in the quest for orthologs

Adrian M Altenhoff, Brigitte Boeckmann, Salvador Capella-Gutierrez, Daniel A Dalquen, Todd DeLuca, Kristoffer Forslund, Jaime Huerta-Cepas, Benjamin Linard, Cécile Pereira, Leszek P Pryszcz, Fabian Schreiber, Alan Sousa da Silva, Damian Szklarczyk, Clément-Marie Train, Peer Bork, Odile Lecompte, Christian von Mering, Ioannis Xenarios, Kimmen Sjölander, Lars Juhl Jensen, Maria J Martin, Matthieu Muffato, Quest for Orthologs consortium, Toni Gabaldón, Suzanna E Lewis, Paul D Thomas, Erik Sonnhammer and Christophe Dessimoz

Nature Methods 2016

Published online: 12 April 2017