The GOA project aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB) and is a central dataset for other major multi-species databases; such as Ensembl and NCBI.
GOA has been a member of the GO Consortium since 2001, and is responsible for the integration and release of GO annotations to the human, chicken and cow proteomes. Because of the multi-species nature of the UniProtKB, GOA also assists in the curation of another 200,000 species. This involves electronic annotation and the integration of high-quality manual GO annotation from all GO Consortium model organism groups and specialist groups. This effort ensures that the GOA dataset remain a key reference and a comprehensive source of GO annotation for all species.
- Why do we need GOA?
- How is GOA curated?
- Precautions when viewing GO annotations
- What can I do with GOA?
- Searching GOA
- Downloading gene association files
- Do you have more questions?
Why do we need GOA?
Proteins are important parts of all living cells and include many substances which have different and very particular functions, such as enzymes, hormones, and antibodies that are necessary for the proper functioning of an organism. They are essential in the diet of humans, birds and all animals for the growth and repair of tissue and can be obtained from foods such as meat, fish, eggs, milk, and vegetables.
How proteins work and in which part of a living cell they perform their action has been studied for many different organisms by scientists from all over the world. Scientists share information publicly by submitting descriptions of their experiments and results to scientific journals. An important task of these publications is to describe exactly how an experiment was carried out so that it can be repeated and verified, to include results and encourage discussion and new ideas. One common problem with these publications is that the language used by scientists is not very precise - words often have several meanings such as 'cell'. It could mean a 'prison cell', a 'battery cell' or a 'living cell. Also different words can have the same meaning such as 'car' and 'automobile' and as anyone who has tried to explain something complicated by e-mail will know this can cause all sorts of confusion. The same confusion happens in scientific publications daily and this can delay the interpretation and use of the scientific results.
Unfortunately the dictionaries designed for scientists do not really help this situation as they offer all the possible meanings of a word (or phrase); what is needed is a list of words that have only one 'official' meaning. This is the purpose of a 'controlled vocabulary'; these vocabularies are used by scientists working on large collections of proteins to describe their properties. The most well known controlled vocabulary in use today is the 'Gene Ontology' or 'GO'. GO was designed to describe the biological processes of a protein, its role and its location in a living cell of any organism.
As scientists are overwhelmed with protein information from large scale experimental projects, organising the data with a standard vocabulary speeds up further analysis by making the data machine-readable (the 'machine' in this case being a computer). Further analysis might include using a computer to infer the purpose or function of a protein in a songbird or turkey from a protein in a similar species, the chicken, where the protein has been experimentally studied in detail. This assistance is crucial these days, as the volume of data being generated defies imagination, meaning no one person could ever cope without help. That is the main reason why we are doing this work - to use a structured vocabulary (GO) to describe how proteins work based on information taken from scientific publications, which allows other people to analyse that and other descriptions quickly by computers. This means that scientists will be able to make use of more of other researchers work with less effort and that means they can do their science quicker and better, which will be a benefit to us all.
How is GOA curated?
GO terms are organised into three ontologies; Molecular Function, Biological Process and Cellular Component.
See the Gene Ontology website for more information on the GO.
GO terms are assigned to gene products using a combination of high-quality electronic mappings and manual curation.
- Electronic annotation
We use existing information within database entries, including Swiss-Prot keywords (SPKW2GO), Swiss-Prot subcellular locations (SPSL2GO), Enzyme Commission numbers (EC2GO) and cross-references to InterPro (InterPro2GO) and HAMAP (HAMAP2GO), which are manually mapped. Electronically combining these mappings with a table of matching UniProtKB entries generates a table of associations. An additional electronic annotation method uses orthology data from Ensembl Compara to project GO annotations from a source species onto one or more target species. For each GOA association, we provide an evidence code, which summarizes how the association is made. Associations that are made electronically are labelled as 'inferred from electronic annotation' (IEA).
- Manual annotation
Manual assignment of GO terms by curators using published literature. Associations that are made manually are given an evidence code that describes what evidence supports the annotation. In addition to the manual annotations generated by GOA curators we also integrate high-quality manual GO annotations from all GO Consortium model organism groups and specialist groups.
Here is an example of how we find GO terms in the scientific literature. Watch a flash tutorial showing how three different GO annotations were made to the human apolipoprotein APOA4. For better playback, please wait a few minutes for the video to download before playing.
More on evidence codes can be found in the Evidence Code Guide on the Gene Ontology website.
GOA data is released every month in the form of gene association files. These are tab-delimited files of the associations between gene product and GO terms.
More information about the format of gene association files can be found at the Gene Ontology website Annotation File Format .
Precautions to be considered when viewing GO annotations
- Use of 'Qualifiers'
A curator can choose to alter the meaning of an annotation by using a ‘qualifier’.
There are three qualifiers; NOT, colocalises_with and contributes_to and, if used, are present in column 4 of the gene association file.
Special attention must be paid to the NOT qualifier as this completely reverses the meaning of the annotation.
NOT is used to make an explicit note that the gene product is not associated with the GO term. For example, if a protein has sequence similarity to an enzyme (whose activity is GO:nnnnnnn), but has been shown experimentally not to have the enzymatic activity, it can be annotated as NOT GO:nnnnnnn.
Colocalizes_with is used only with terms in the Cellular Component ontology and is given to gene products that are transiently or peripherally associated with an organelle or complex.
Contributes_to is used only with terms in the Molecular Function ontology and is given to a gene product that is a member of a complex which has an activity but the individual gene product does not have this activity. All gene products annotated using ' contributes_to ' must also be annotated to a cellular component term representing the complex that possesses the activity.
- Two Taxon ID annotation
A gene product is annotated with terms reflecting its normal activity and location. A function, process, or component observed only in a mutant or disease state is therefore not usually included. In some circumstances, however, what is "normal" is a matter of perspective, depending on the organism being annotated and on the point of view of the curator. For example, many viruses use host proteins to carry out viral processes. The host protein is then doing something abnormal from the perspective of the host, but completely normal from the perspective of the virus. GO curators handle these cases by including two taxon IDs in the "Taxon" column of the gene association file (column 13). The first taxon ID is that of the organism that encodes the gene product (e.g. host), and the second ID is that of the organism that uses the gene product (e.g. virus), and whose perspective is considered "normal" for that annotation.
Please see the GO Consortium Annotation Guide for more information.
What can I do with GOA?
The success of GO can be measured by the number of databases that use it to annotate and exchange biological knowledge. The GOA project has made an important contribution to this global effort. GOA allows you to:
- Access functional information for the human, mouse, rat, cow, chicken, zebrafish and Arabidopsis proteomes or for any protein in the UniProt knowledge base (UniProtKB) using QuickGO or by downloading our gene association files.
- Find common functional information for interacting proteins using the IntAct database .
- Use a GO-slim (cut-down versions of the ontologies useful for providing an overview of GO) to summarise the biological attributes of a proteome, compare proteomes, or find out what proportion of a proteome is involved in e.g. 'transport'.
- Incorporate our manual annotation into your own databases to enhance your dataset, or use it to validate your automated way of deriving information about gene function.
- Map GO terms to your own datasets; for example, our GO mapping to InterPro entries ( InterPro2GO ) can be used to annotate mass spectrometry or microarray data.
- Find the location of human genes mapped to a particular GO term using Ensembl .
If you are new to GOA you might want to start with an easy way to view our data, one of the fastest and easiest is by using the EBI's web-based GO browser QuickGO . QuickGO is updated weekly with electronic and manual GO annotations from the EBI.
In addition to QuickGO, the Ontology Lookup Service can be used to search for GO terms.
For more ways to search GOA data, see the Searching GOA page .
Downloading gene association files
GOA provides annotations to over 110000 species, these data can be accessed from the GOA downloads page .
From this page you can download;
species-specific gene association files, if you are interested in annotations to a particular species
the UniProtKB gene association file which provides annotations to a non-redundant set of proteins from all species present in the UniProtKB
proteome gene association files containing annotations to hundreds of proteomes
For more ways to access GOA data, see the Downloads page .
Do you have more questions?