Reference proteomes - Primary proteome sets for the Quest For Orthologs
RELEASE 2022_02 based on UniProt Release 2022_02, Ensembl release 105 and Ensembl Genome release 52
Introduction
The Reference Proteomes group provides complete
non-redundant proteome sets for species chosen by
the “Quest for Orthologs” group. It comprises 78 species that are publicly
available and are generated using UniProtKB, Ensembl
and Ensembl Genomes.
The one gene one protein proteome sets are compiled
from species sourced from complete genomes submitted
to INSDC with gene model annotations from:
- genome submitters
- Ensembl or Ensembl genomes
Download
The gene2acc, fasta and idmapping files for
individual species are available for download
here:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO
or as a tarball of all species:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/QfO_release_2022_02.tar.gz
SeqXML versions are documented by our partners and
they are available here:
https://www.seqxml.org/xml/Reference_proteomes.html
Predicted Orthologs
Current Composition of Primary Protein Sets
The following table describes the status of the species:
Gene mapping files (*.gene2acc)
Column 1 is a unique gene symbol that is chosen with the following order of preference from the annotation found in:
- Model Organism Database (MOD)
- Ensembl or Ensembl Genomes database
- UniProt Ordered Locus Name (OLN)
- UniProt Open Reading Frame (ORF)
- UniProt Gene Name
A dash symbol (-) is used when the gene encoding a protein is unknown.
Column 2 is the UniProtKB accession or isoform identifier for the given gene symbol. This column may have redundancy when two or more genes have identical translations.
Column 3 is the gene symbol of the canonical accession used to represent the respective gene group and the first row of the sequence is the canonical one.
Protein FASTA files (*.fasta and *_additional.fasta)
These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of ..."). The FASTA format is the standard UniProtKB format.
For further references about the standard UniProtKB format, please see:
https://www.uniprot.org/help/fasta-headers
https://www.uniprot.org/faq/38
E.g. Canonical set:
>sp|Q9H6Y5|MAGIX_HUMAN PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4 MEPRTGGAANPKGSRGSRGPSPLAGPSARQLLARLDARPLAARAAVDVAALVRRAGATLR LRRKEAVSVLDSADIEVTDSRLPHATIVDHRPQHRWLETCNAPPQLIQGKAHSAPKPSQA SGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHINGE STQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGGP EVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGSP GPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH
E.g. Additional sets:
>sp|Q9H6Y5-2|MAGIX-2_HUMAN Isoform of Q9H6Y5, Isoform 2 of PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4 MPLLWITGPRYHLILLSEASCLRANYVHLCPLFQHRWLETCNAPPQLIQGKAHSAPKPSQ ASGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHING ESTQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGG PEVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGS PGPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH >tr|C9J123|C9J123_HUMAN Isoform of Q9H6Y5, PDZ domain-containing protein MAGIX (Fragment) OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=2 MSPNSPLHCFYLPAVSVLDSADIEVTDSRLPHATIVDHRPQVGDLVLHINGESTQGLTHA QAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVDRSPDPGGPEVTGSRSSSTSLV QHPPSRTTLKKTRGSPEPSPEAA
Coding DNA Sequence FASTA files (*_DNA.fasta)
These files contain the coding DNA sequences (CDS) for the protein sequences where it was possible. The format is as in the following example (UP000005640_9606_DNA.fasta):
>sp|A0A183|ENSP00000411070 ATGTCACAGCAGAAGCAGCAATCTTGGAAGCCTCCAAATGTTCCCAAATGCTCCCCTCCC CAAAGATCAAACCCCTGCCTAGCTCCCTACTCGACTCCTTGTGGTGCTCCCCATTCAGAA GGTTGTCATTCCAGTTCCCAAAGGCCTGAGGTTCAGAAGCCTAGGAGGGCTCGTCAAAAG CTGCGCTGCCTAAGTAGGGGCACAACCTACCACTGCAAAGAGGAAGAGTGTGAAGGCGAC TGA
The 3 fields of the FASTA header are:
- sp (Swiss-Prot reviewed) or tr (TrEMBL)
- UniProtKB Accession
- EMBL Protein ID or Ensembl/Ensembl Genome ID
Unsuccessful Coding DNA Sequence mapping files (*_DNA.miss)
For the species that did not have a perfect mapping for all protein sequences to a CDS, these files contain the entries that could not be mapped. The format is as in the following example (UP000005640_9606_DNA.miss):
sp A6NF01 CAUTION: Could be the product of a pseudogene sp A4QN01 NOT_ANNOTATED_CDS
The 3 fields are:
- "sp" (Swiss-Prot reviewed) or "tr" (TrEMBL)
- UniProtKB accession
- Reason why the protein could not be mapped to a CDS
Database mapping files (*.idmapping)
These files contain mappings from UniProtKB to other databases for each reference proteome. The format consists of three tab-separated columns:
- UniProtKB accession
-
ID_type:
- Database name as shown in UniProtKB cross-references and supported by the ID mapping tool on the UniProt web site (https://www.uniprot.org/mapping )
-
ID:
- Identifier in the cross-referenced database.
SeqXML files (*.xml)
The xml files contain all the information from fasta
(canonical and additional), idmapping and CDS in
SeqXML format (see
https://seqxml.org.)
E.g. (from UP000005640_9606.xml, header and one
entry)
<?xml version="1.0" encoding="utf-8"?> <seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" speciesName="Homo sapiens" xsi:noNamespaceSchemaLocation="http://www.seqxml.org/0.4/seqxml.xsd" seqXMLversion="0.4" sourceVersion="2016_04" source="QfO http://w ww.ebi.ac.uk/reference_proteomes/" ncbiTaxID="9606"> <entry id="A0A075B6H9" source="UniProtKB"> <description>Immunoglobulin lambda variable 4-69</description> <AAseq>MAWTPLLFLTLLLHCTGSLSQLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRYLMKLNSDGSHSKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYCQTWGTGI</AAseq> <DBRef id="LV469_HUMAN" source="UniProtKB-ID"></DBRef> <DBRef id="IGLV4-69" source="Gene_Name"></DBRef> <DBRef id="1133968632" source="GI"></DBRef> <DBRef id="UniRef100_A0A075B6H9" source="UniRef100"></DBRef> <DBRef id="UniRef90_A0A075B6H9" source="UniRef90"></DBRef> <DBRef id="UniRef50_A0A0B4J1Y8" source="UniRef50"></DBRef> <DBRef id="UPI0000F30329" source="UniParc"></DBRef> <DBRef id="AC245452" source="EMBL"></DBRef> <DBRef id="-" source="EMBL-CDS"></DBRef> <DBRef id="9606" source="NCBI_TaxID"></DBRef> <DBRef id="ENSG00000211637" source="Ensembl"></DBRef> <DBRef id="ENST00000390282" source="Ensembl_TRS"></DBRef> <DBRef id="ENSP00000374817" source="Ensembl_PRO"></DBRef> <DBRef id="uc062cba.1" source="UCSC"></DBRef> <DBRef id="HostDB:ENSG00000211637.2" source="EuPathDB"></DBRef> <DBRef id="IGLV4-69" source="GeneCards"></DBRef> <DBRef id="HGNC:5921" source="HGNC"></DBRef> <DBRef id="NX_A0A075B6H9" source="neXtProt"></DBRef> <DBRef id="ENSGT00900000140867" source="GeneTree"></DBRef> <DBRef id="GPRYLMK" source="OMA"></DBRef> <DBRef id="6C449213D2CD44D7" source="CRC64"></DBRef> <property name="GN" value="IGLV4-69"></property> <property name="SV" value="1"></property> <property name="DNAsource" value="ENSP00000374817"></property> <property name="ensemblVersion" value="91,38"></property> <property name="UPID" value="UP000005640"></property> <property name="DNAseq" value="ATGGCTTGGACCCCACTCCTCTTCCTCACCCTCCTCCTCCACTGCACAGGGTCTCTCTCCCAGCTTGTGCTGACTCAATCGCCCTCTGCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGCAGCTACGCCATCGCAT GGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGATGGCAGCCACAGCAAGGGGGACGGGATCCCTGATCGCTTCTCAGGCTCCAGCTCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTCA"></property> <property name="PE" value="1"></property> </entry>
Joining forces in the quest for orthologs
Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella,Erik LL Sonnhammer and Suzanna Lewis
Genome Biology 2009, 10:403
Published: 29 September 2009
Toward community standards in the quest for orthologs
Christophe Dessimoz, Toni Gabaldón, David S. Roos, Erik LL Sonnhammer, Javier Herrero and the Quest for Orthologs consortium
Bioinformatics 2012, 28:900
Published: 12 February 2012
Big data and other challenges in the quest for orthologs
Erik LL Sonnhammer, Toni Gabaldón, Alan W Sousa da Silva, Maria Martin, Marc Robinson-Rechavi, Brigitte Boeckmann, Paul D Thomas, Christophe Dessimoz and the Quest for Orthologs consortium
Bioinformatics 2014, 30:2993
Published: 26 July 2014
Standardized benchmarking in the quest for orthologs
Adrian M Altenhoff, Brigitte Boeckmann, Salvador Capella-Gutierrez, Daniel A Dalquen, Todd DeLuca, Kristoffer Forslund, Jaime Huerta-Cepas, Benjamin Linard, Cécile Pereira, Leszek P Pryszcz, Fabian Schreiber, Alan Sousa da Silva, Damian Szklarczyk, Clément-Marie Train, Peer Bork, Odile Lecompte, Christian von Mering, Ioannis Xenarios, Kimmen Sjölander, Lars Juhl Jensen, Maria J Martin, Matthieu Muffato, Quest for Orthologs consortium, Toni Gabaldón, Suzanna E Lewis, Paul D Thomas, Erik Sonnhammer and Christophe Dessimoz
Nature Methods 2016
Published online: 25 April 2018