- Why are the UniProt association files on the GOA and GO sites different?
- I want to look only at mouse/rat/zebrafish/Arabidopsis GO annotations - which file should I use to get the fullest set of GO annotations for this species?
- Why can I/can't I see an annotation in a UniProt record when it appears in the association file?
- Why are the species-specific files and multi-species (UniProt) gene association files different?
- How do I change between UniProt accessions and other identifiers, e.g. Ensembl, EMBL, RefSeq Gene ID, UniGene?
- What is in the PDB gene association file?
- What GO tools can I use to display/compare annotations to my selected proteins/genes?
- Is it correct to assume that genes that belong to a child category will automatically be a member of the parent term?
- Theres an annotation error in your file (from SWIS/PINC/external MODs/electronic sources)
- How do I download a bulk set of GO annotations? What is the format and I need GO term names to be included.
- How is GOA created?
- GO slim question. what they are, how to make one
- What do the evidence codes mean?
- Why has this annotation disappeared from this entry? (InterPro2GO mapping changes)
- How do I cite GOA?
Within the GO Consortium a number of groups have been given the responsibility for being the authoritative repository for GO annotations to certain model species. For instance MGI is responsible for collecting all mouse annotations, as is GOA for human, bovine and chicken. There are two folders on the GO ftp site, the main folder ( ftp://ftp.geneontology.org/pub/go/gene-associations/ ) contains a non-redundant set of annotations for the model species - annotations are filtered by taxon id to remove annotations to a species that another group is responsible for. Therefore for example, as MGI is responsible for annotating to mouse, GOA files in this folder have now been filtered to remove all mouse annotations and any new annotations we make for this species are collected by MGI directly from us. This filtering has been applied since September 2005. The filtering procedure has made a big difference to the GOA UniProt association file, as in addition to producing our own annotations we also integrate manual annotations from many of the model organism groups to provide the UniProt database with a comprehensive set of GO annotation. To see a list of the species that are filtered from the gene association files on the GO ftp site, please see:
Gene association files not filtered by taxon id are available from the ftp://ftp.geneontology.org/pub/go/gene-associations/submission directory of the GO ftp site.
The main section of the GO Current Annotation web page displays links to the filtered ( http://www.geneontology.org/GO.current.annotations.shtml?all#filter ) and unfiltered ( http://www.geneontology.org/GO.current.annotations.shtml?all#unfilter ) versions of gene association files.
Of course, alternatively you can download the unfiltered UniProt association file (and species-specific files) directly from the GOA ftp site:
ftp://ftp.ebi.ac.uk/pub/databases/GO/GOA/UNIPROT/gene_association.goa_uniprot.gz (please note the the size of this file (~120 MB))
2. I want to look only at mouse/rat/zebrafish/Arabidopsis GO annotations - which file should I use to get the fullest set of GO annotations for this species?
In the GO Consortium there are a number of model organism groups which provide association files containing annotations for their species, these groups are the authoritative repository for their particular species. In these files, annotations will be associated to the protein/gene identifier of the corresponding model organism group. These groups also integrate annotations from other GO annotation sources such as GOA (a multi-species resource) on a regular basis.
The GOA group also provides a number of species specific files (including human, mouse, rat, zebrafish, Arabidopsis, chicken, cow and Drosophila proteomes), these are created using the UniProtKB Complete Proteome sets (http://www.uniprot.org/faq/15) which consist of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. At each monthly release, GOA integrates annotations from all other GO Consortium groups, as well as a number of external annotating groups (such as Human Protein Atlas which provide subcellular localisations and the IntAct protein-protein interaction database). Alongside this file we also supply a file of cross-references which can be used to map between a number of different sequence identifiers - including Ensembl, UniProtKB and RefSeq identifiers.
Both model organism group and GOA species-specific files are available from; http://www.geneontology.org/GO.current.annotations.shtml
When using the GOA UniProt gene association file, we recommend using the unfiltered versions available from either the GOA ftp site (or from the submission folder on the GO Consortium site ftp://ftp.geneontology.org/pub/go/gene-associations/submission ) (for details on the GO taxon filtering script please see FAQ 1).
GOA xref files are available from the GOA home page ( http://www.ebi.ac.uk/GOA/ ) or our ftp site ( ftp://ftp.ebi.ac.uk/pub/databases/GO/GOA/ ).
Reasons for annotations to a species not being present in the GOA gene association file include:
- IEA annotations applied by the external model organism database are not integrated into the GOA files as we provide our own IEA data.
- When there is a formatting error in an annotation
- If there is no mapping to a UniProtKB accession
- If the MOD used a reference other than a PMID (e.g. an internal reference)
- Where the MOD has annotated a protein to the same GO term twice with the same PMID, only differing in evidence code (this is due to database restriction on our side).
3. Why can I/can't I see an annotation in a UniProtKB record when it appears in the association file?
There could be a number of reasons for this:
A. If it appears that a manual annotation is missing:
If the GO annotation has been recently created, then UniProtKB may not yet have cross-referenced the annotation; there can be a time lag of up to 3 months.
B. If it appears that an electronic annotation is missing:
If you are looking at a curated UniProtKB entry (i.e. one in the Swiss-Prot section of UniProtKB), then not all electronic annotations are displayed here. Only annotations from certain methods, such as the HAMAP2GO and EC2GO mappings, are included.
In addition, sets of GO annotations displayed in the UniProtKB are filtered to try to provide a comprehensive yet consise set of cross-references.
For instance, for protein O19470 - compare the view in UniProtKB with the more extensive list of annotations in QuickGO:
To get from the UniProtKB record to the QuickGO browser (which will show the most up-to-date and full set of manual and electronic annotations for a protein) click on the '[QuickGO]' link at the bottom of the GO cross-references section of the UniProtKB entry.
However if none of these reasons appear to apply to your missing annotation please let us know and we will investigate!
The GOA UniProt gene association file contains all manual and electronic annotations that GOA has assigned to UniProtKB entries. This dataset contains annotations to more than 400,000 different species ( http://www.ebi.ac.uk/GOA/uniprot_release.html ) and is redundant for electronic annotations where two different electronic methods have assigned the same or less granular GO term.
The species-specific files are created using the UniProt Complete Proteome sets to determine the protein composition of the files. Further information on UniProt Complete Proteome sets is available here; http://www.uniprot.org/faq/15 . The species-specific files can contain annotations to both reviewed (Swiss-Prot) and unreviewed (TrEMBL) UniProtKB accessions, any user wishing to only identify the reviewed (Swiss-Prot) UniProt protein annotation subset will be able continue to do so using the information supplied in the gp_information.goa_uniprot file, which can be found here; ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gp_information.goa_uniprot.gz .
We aim to remove electronic annotations from the species-specific files that have been created by the same technique and that have predicted the same or less granular GO terms.
An example would be for annotations created by the InterPro2GO mapping technique. In the redundant UniProt gene association file, there are three annotations to binding terms for protein P02144:
UniProt P02144 MYG_HUMAN GO:0005488 GOA:interpro IEA InterPro:IPR000971 F IPI00217493 protein taxon:9606 20060125 UniProt UniProt P02144 MYG_HUMAN GO:0019825 GOA:interpro IEA InterPro:IPR002335 F IPI00217493 protein taxon:9606 20060125 UniProt UniProt P02144 MYG_HUMAN GO:0020037 GOA:interpro IEA InterPro:IPR012292 F IPI00217493 protein taxon:9606 20060125 UniProt (GO:0005488 - 'binding', GO:0019825 - 'oxygen binding' , GO:0020037 - 'heme binding')
However within the human species-specific file there exist only two of these three:
UniProt P02144 MYG_HUMAN GO:0019825 GOA:interpro IEA InterPro:IPR002335 F Myoglobin IPI00217493 protein taxon:9606 20060223 UniProt UniProt P02144 MYG_HUMAN GO:0020037 GOA:interpro IEA InterPro:IPR012292 F Myoglobin IPI00217493 protein taxon:9606 20060223 UniProt (GO:0019825 - 'oxygen binding' and GO:0020037 'heme binding')
The GO term for 'binding' has been removed from the human file as it does not provide users with any extra information, as it is a less granular parent to the oxygen and heme binding terms. This can be done because of the 'true path rule' that GO follows.
In the true path rule "the pathway from a child term all the way up to its top-level parent(s) must always be true" so a protein which is annotated to a term such as 'oxygen binding' automatically indicates that the protein would also be correctly annotated to its parent term 'binding'. This is known because the 'binding' GO term is displayed in GO as a parent of 'oxygen binding'.
5. How do I change between UniProt accessions and other identifiers, e.g. Ensembl, EMBL, RefSeq Gene ID, UniGene?
UniProt provides an identifier mapping file which includes all UniProtKB accessions mapped to identifiers from other databases such as the EMBL/Genbank/DDBJ nucleotide sequence databases, Ensembl, GeneID and RefSeq at the NCBI. This file can be accessed from the GOA downloads page .
The readme for this file can be found here .
If you require any more information on this file you can mail firstname.lastname@example.org .
In addition, the information to link the protein and nucleotide data exists in almost every UniProt entry. The specific format for cross-references from Swiss-Prot or TrEMBL to coding sequences (CDS) in the DDBJ/EMBL/GenBank nucleotide sequence database is in the DR line, e.g.: DR EMBL; AF043736; AAC02090.1. AF043736 is the EMBL/GenBank/DDBJ Accession number AAC02090 is the protein-id/Protein Sequence Identifier for the CDS within the EMBL/GenBank/DDBJ entry. These two are universal IDs shared by all 3 of the collaborating nucleotide sequence databases.
In addition, BioMart ( http://www.biomart.org/ ) provides users with GO annotation for Ensembl IDs . Click on "MartView". You can choose a specific database such as Ensembl, the attribute you want to download (such as a GO identifier), and if you would like annotations to one term or to a term and its children, you can then filter the data (you can enter the GOID under the "Gene Ontology" filter).
The PDB file is made differently from the GOA UniProt gene association file. PDB entries are only assigned GO terms based on matches between PDB entries and InterPro domains. This file no longer contains annotations from sources where GO terms have been assigned to entire UniProt protein accessions (i.e. from GOA:manual, GOA:SPKW, GOA:SPEC or GOA:HAMAP sources). This change has been made to avoid assigning GO terms to PDB chains where some terms might only be correct for the corresponding whole protein.
InterPro2GO (SCOP and Cath) signatures and PDB chains are superimposed on the UniProtKB protein and if there is a good overlap then the InterPro mapping is produced. This data is provided by the InterPro3D group at the EBI. In future we intend to supplement this data by including manual protein binding annotations via the IntAct protein-protein interaction database.
The format of this file is described in our ReadMe, available at: http://www.ebi.ac.uk/GOA/goaHelp.html
You may like to look at some of the GO tools that have been created to help in this area. Software available to profile gene groups using GO include programs such as EASE, Onto-Express, FatiGO, GoMiner etc. With these tools you can enter a group of gene identifiers and the software will find annotations and carry out analyses to see if a subpopulation of GO terms is statistically significant. Different tools use different sources of GO annotation and different statistical methods.
All of the tools designed to help in gene expression analysis, which GO are aware of, are listed with descriptions at: http://www.geneontology.org/GO.tools.microarray.html
If you are looking for a specific tool that you have difficulty locating in this list, it would be worth e-mailing the GO friends list ( email@example.com ) which many tool developers are on, who are very good at pointing you to their tool if it can be of help.
The following paper , describing the use of GO in proteomic studies, may also be of use;
Dimmer EC, Huntley RP, Barrell DG, Binns D, Draghici S, Camon EB, Hubank M, Talmud PJ, Apweiler R, Lovering RC.
The Gene Ontology - Providing a Functional Role in Proteomic Studies.
Proteomics. 2008, 8 Suppl. (Practical Proteomics), DOI 10.1002/pmic.200800002
A supplementary Powerpoint presentation has been prepared to show the features of four 'third-party' GO analysis tools; Blast2GO, FatiGO, Onto-Express and Ontologizer. Each presentation was prepared by the developers of the tools.
8. Is it correct to assume that genes that belong to a child category will automatically be a member of the parent term?
Yes, as every GO term must obey the true path rule: if the child term describes the gene product, then all its parent terms must also apply to that gene product ( http://www.geneontology.org/GO.usage.shtml#truePathRule ). The ontologies are structured as directed acyclic graphs, which are similar to hierarchies but differ in that a child, or more specialized, term can have many parents, or less specialized, terms. For example, the biological process term hexose biosynthesis has two parents, hexose metabolism and monosaccharide biosynthesis. This is because biosynthesis is a subtype of metabolism, and a hexose is a type of monosaccharide. When any gene involved in hexose biosynthesis is annotated to this term, it is automatically annotated to both hexose metabolism and monosaccharide biosynthesis
(the above information originated from: http://www.geneontology.org/GO.doc.shtml#look )
Please note however, that the true path rule is broken for those annotations which contain the 'NOT' qualifier (column 4 in gene association files). If you are intending to carry out an analysis with a large set of annotations it might be easiest to filter 'NOT' annotations out first. Further information on the usage of 'NOT' can be found at: http://www.geneontology.org/GO.annotation.shtml?#qual
- Responding to an annotation error if its for manual annotations from UniProt, GDB or Proteome Inc
We are continually trying to improve and update our annotations. We would be grateful if you could send us as much detail as you have to hand about the annotation that is incorrect so that we can quickly update. Changes to annotations will be visible from our QuickGO browser from the following Monday (updated on a weekly basis):( http://www.ebi.ac.uk/QuickGo ) and visible within the AmiGO browser and the GOA gene association file after our next monthly release.
- Responding to an annotation error if its for electronic (IEA) annotations.
GOA only provides electronic annotations which have been proven to provide high-quality annotations. If you do see any incorrect IEA annotations please send us as much detail you can about the error and we or one of our collaborating groups will update or remove these annotations as quickly as possible. Changes to electronic annotations will be visible after our next monthly release.
- Responding to an annotation error if its from an external database that we integrate from.
If the annotation error you have spotted originates from a different database and which we have integrated into our own releases (i.e. if the source of the annotation is one of: AgBase, DictyBase, FlyBase, GDB, GeneDB, Gramene, MGI, RGD, SGD, TAIR, TIGR, WormBase, ZFIN, IntAct or LIFEdb) you can send us as much detail on these annotations which we will then pass onto the group concerned. The group that provided these annotations is responsible for any corrections and changes will only be visible after GOA next integrates their annotations (which occurs just before our monthly release).
- Responding to comments related to GO terms or the GO structure.
If you could e-mail GOA ( firstname.lastname@example.org ), we will send your comments onto the GO editors. Alternatively, you can send your comments directly by going to the GO Consortium site on SourceForge (at: https://sourceforge.net/projects/geneontology/ ) and submitting a new request via the Curator Request section. Requests for changes to the ontology are discussed and resolved on this forum by many different groups. Submitting information directly to this site would mean that GO editors would be able to e-mail you directly if any discussion about this change was needed and you will be alerted as soon as any change is made.
10. How do I download a bulk set of GO annotations? What is the format and I need GO term names to be included.
All GOA GO annotations to UniProtKB accessions are available from:
The GOA gene association file is a 15 column tab-delimited file. The file format conforms to the specifications demanded by the GO Consortium and therefore GO IDs and not GO term names are shown. For more information on the format of the GOA gene association file you might like to read the ReadMe available at: http://www.ebi.ac.uk/GOA/goaHelp.html You may also be interested in the genes2go file available from NCBI. The NCBI have used the gene association files submitted to the GO site (therefore including the GOA file) to produce a 7 column tab-delimited file containing GO annotations from 33 organisms. This file contains, among other things, Gene IDs (and for genomes previously available from LocusLink, the identifiers are equivalent) and GO term names:
What kinds of GO annotations are in the GOA files?
The GOA project at the EBI aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and is a central dataset for other major multi-species databases; such as Ensembl and NCBI.
GOA has been a member of the GO Consortium since 2001, and is responsible for the integration and release of GO annotations to the human, chicken and cow proteomes. In 2006 GOA became a central participant in the new GOC Reference Genome Annotation project and is committed to the comprehensive annotation of a set of disease-related proteins in human. With this project the GOC intends to generate a reliable set of GO annotations for the twelve selected genomes that will also empower comparative methods used in first pass annotation of other proteomes. GOA works closely with Swiss-Prot, InterPro and IntAct curators at the EBI, as well as external curators from University College London, AgBase and DictyBase to create manual annotations.
Because of the multi-species nature of the UniProtKB, GOA also assists in the curation of over 400,000 species. This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups (e.g. Human Protein Atlas, Reactome pathways and the IntAct protein-protein interaction database). This effort ensures that the GOA dataset remain a key reference and a comprehensive source of GO annotation for all species. GOA does not integrate any electronic annotation from external databases - as many of the model organism databases apply the same mappings that GOA does.
GOA provides electronic GO annotations to UniProtKB proteins by using 5 different mappings of external concepts to GO terms (InterPro IDs, HAMAP IDs, Swiss-Prot Keywords, Swiss-Prot Subcellular Locations and Enzyme Commission numbers) as well as transferring experimentally derived annotations to orthologs identified by the Ensembl Compara group. All these annotations are identified by the 'IEA' evidence code. More information on these techniques can be found on our website ( http://www.ebi.ac.uk/GOA/ElectronicAnnotationMethods.html ). All the mapping files can be viewed/downloaded from; http://www.geneontology.org/GO.indices.shtml .
The BioCreative paper (PMID:15960829) provides a detailed explanation of how GOA manual and electronic annotations are produced. In addition there is extensive documentation for manual GO annotation on the GO Consortium website ( http://www.geneontology.org/GO.annotation.shtml ).
What files are available from the GOA project?
All GOA annotations (electronic and manual) can be downloaded from our ftp site in a simple 17 column tab-delimited format: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
The GOA project offers users a number of different files so people can choose whether to look at the entire collection of GO annotations to proteins in UniProtKB: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/
Or, if you were only interested in proteins from a particular species, we also provide non-redundant, species-specific files for human, mouse, rat, zebrafish, chicken, cow and Arabidopsis proteins (these files are created using the UniProt Complete Proteomes sets).
e.g. the human gene association file can be downloaded from here; ftp://ftp.ebi.ac.uk/pub/databases/GO/GOA/HUMAN/gene_association.goa_human.gz
The UniProt identifier mapping file is also provided to map between UniProt accessions and identifiers from other common databases ( ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz )
In addition, GOA provides species specific subsets of the GOA UniProtKB file for those species which have been completely sequenced, the sequence is in the public domain and which contains >25% GO annotation coverage. All proteomes files are available from: http://www.ebi.ac.uk/GOA/proteomes.html
Further information on the format of our gene association files and xrefs files can be found in the GOA ReadMe located at: http://www.ebi.ac.uk/GOA/goaHelp.html
GO slims are cut-down versions of the GO ontologies containing terms that cover the main aspects of each of the three GO ontologies. They give a broad overview of the ontology content without the detail of the specific fine-grained terms.
As each community has different needs, a variety of GO-slim files have been archived on the GO home page by Consortium members.
Further documentation and links to these slims can be found at: http://www.geneontology.org/GO.slims.shtml
The QuickGO tool from GOA can be used to access or modify the GO Consortium's slims or to create one of your own. You can access this functionality from the GO Slims and GO Term Comparison page of QuickGO.
Depending on which organism your list of genes is from, the GO Term Mapper may be useful. This will take an input list of genes, a set of annotations to those genes, and a list of GO terms that have been selected to represent a subset of the ontology. This list of GO terms, selected to represent major branches in the ontology and not level of indentation, is known as a GO Slim. The GO Term Mapper will take your list of genes and bin them into the appropriate GO ID in a GO Slim.
GO Term Mapper:
Alternatively you can always create your own GO slim from the complete GO ontologies, using the OBO-Edit Editor (which you can download from: http://sourceforge.net/project/showfiles.php?group_id=36855 ). You can then use the map2slim.pl script ( http://www.geneontology.org/GO.slims.shtml#script ) to take the GO slim file and a gene association file, and output the associations mapped to the slim terms.
Every annotation submitted to GO must be attributed to a source - such as a literature reference, another database or a computational analysis. In addition, these annotations must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term. A simple controlled vocabulary is used to record different, broad evidence categories.
There are 12 different evidence codes currently used by curators, these are:
IMP = inferred from mutant phenotype IGI = inferred from genetic interaction IPI = inferred from physical interaction ISS = inferred from sequence similarity IDA = inferred from direct assay IEP = inferred from expression pattern IEA = inferred from electronic annotation IGC = inferred from genomic context TAS = traceable author statement NAS = non-traceable author statement ND = no biological data available IC = inferred by curator RCA = reviewed computational analysis
If you would like to find more detailed information on the meaning and usage of these evidence codes, documentation can be found at the GO web site at: http://www.geneontology.org/GO.evidence.html
There are a number of reasons why an annotation could have disappeared. If it is an annotation that was produced by the InterPro2GO mapping technique then it may be that InterPro have revised their mapping to GO or their protein matches and when GOA carried out its monthly update, the annotation was lost.
Equally for other electronic mapping methods and manual annotations, curators often update annotations when possible to try and provide the most accurate dataset. If you would like to know whether there was a specific reason for an annotation removal please contact us at: email@example.com .
If you use any data obtained from GOA or QuickGO in a publication, please cite the following paper;
Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R.
The GOA database in 2009--an integrated Gene Ontology Annotation resource.
Nucleic Acids Res. 2008 Oct 27. [Epub ahead of print]