 |
GOA - README
1. Contents
- Contents
- Introduction
- Differences in the UniProt gene association file from GO and GOA ftp sites
- List of files and file formats
- The non-redundant human proteome set
- The non-redundant IPI higher eukaryotic proteome sets
- Ancillary mappings
- Assignment of GO terms to UniProtKB/Ensembl data
- Additional information on Manual Annotation in GOA
- Addition of GO assignments from other data sources
- Further information on the PDB association file
- Contacts
- Copyright Notice
2. Introduction
GOA (GO Annotation@UniProt) is a project run by the European
Bioinformatics Institute that aims to provide assignments
of gene products to the Gene Ontology (GO) resource. The
goal of the Gene Ontology Consortium is to produce a dynamic
controlled vocabulary that can be applied to all eukaryotes,
even while the knowledge of gene and protein roles in cells is
still accumulating and changing.
In the GOA project, this vocabulary is applied to all proteins
described in the UniProt (Swiss-Prot and TrEMBL) knowledgebase.
GOA also provides non-redundant, species-specific annotation sets
using either the complete proteome set available from UniProtKB or
the International Protein Index (IPI), where sequence
identifiers from the GOA, Ensembl, H-Invitational Database, TAIR,
RefSeq and Vega groups are combined.
GOA manual annotations are created by EBI curators from the GOA,
UniProt and IntAct groups. The dataset is supplemented with manual
GO annotation from external model organism databases: AgBase, BHF-UCL,
DictyBase, Ensembl, FlyBase, GDB, GeneDB(S.pombe),Gramene, HGNC, MGI,
Reactome, RGD, Roslin, SGD, TAIR, TIGR, WormBase, ZFIN, the IntAct
protein-protein interaction database, LIFEdb, the Human Protein Atlas
and the Proteome Inc dataset (see section 9). The source of an
annotation is always indicated in column 15 ('assigned by') of an
association file.
The following describes the philosophy behind the EBI curated
annotation dataset:
GOA curators prioritise human proteins for GO annotation, especially
those proteins which:
- have no GO annotation,
- have disease relevance and (c) are important for high-throughput method analyses.
In GOA our aim is to capture the most recent papers that
provide experimental evidence for the unique features of a given
protein. Our approach is protein-centric rather than paper-centric,
as we don't read all papers that might be used to assign the same
GO term. However when experimental evidence is read which further
experimentally verifies a function, redundant annotations to a term
using different references are created as this can provide greater
confidence to a GO annotation.
For further information please refer to our web site at:
http://www.ebi.ac.uk/GOA
External Contributors to the GOA Gene Association Files:
3. Differences in the UniProt gene association file from GO and GOA ftp sites.
Please note that in addition to the human, chicken and cow gene association file,
a filtered and unfiltered version of the GOA UniProt
gene association file is available from the GO Consortium ftp site
(ftp.geneontology.org). The filtered UniProt file version does not contain annotations
for those species where a different Consortium group is primarily responsible
for annotating the species to GO.
If you would like to download an unfiltered GOA UniProt gene association
file, please use either the GOA ftp site:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
Or the submissions folder in the GO Consortium ftp site:
ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_uniprot.gz
Species which are not present in the filtered version of the gene_association.goa_uniprot.gz
file on the GO Consortium site include:
Danio rerio, Drosophila melanogaster, Mus musculus, Rattus norvegicus,
Arabidopsis thaliana, all rice species,
Bacillus anthracis str. Ames, Campylobacter jejuni RM1221, Candida albicans,
Caenorhabditis elegans, Coxiella burnetii RSA 493, Dehalococcoides ethenogenes 195,
Dictyostelium sp., Dictyostelium discoideum, Geobacter sulfurreducens PCA,
Glossina morsitans morsitans, Leishmania major, Listeria monocytogenes str. 4b F2365,
Methylococcus capsulatus str. Bath, Pseudomonas syringae pv. tomato str. DC3000,
Plasmodium falciparum, Saccharomyces cerevisiae, Schizosaccharomyces pombe,
Shewanella oneidensis MR-1, Silicibacter pomeroyi DSS-3, Trypanosoma brucei and
Vibrio cholerae O1 biovar eltor.
Further information on this filtering script can be found at:
http://www.geneontology.org/GO.annotation.shtml#taxon
4. List of files and file formats
The GOA project produces the following gene association files:
- gene_association.goa_uniprot
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_uniprot.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
This file contains all GO assignments for the UniProt KnowledgeBase (UniProtKB).
- gene_association.goa_human
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.goa_human.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/gene_association.goa_human.gz
This file contains the GO assignments for the non-redundant human proteome set. Please note
that as of February 2009 this file is constructed using only proteins from UniProtKB/Swiss-Prot.
- gene_association.goa_mouse
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_mouse.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/gene_association.goa_mouse.gz
This file contains the GO assignments for the proteins of the non-redundant mouse proteome set.
- gene_association.goa_rat
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_rat.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/gene_association.goa_rat.gz
This file contains the GO assignments for the proteins of the non-redundant rat proteome set.
- gene_association.goa_arabidopsis
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_arabidopsis.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ARABIDOPSIS/gene_association.goa_arabidopsis.gz
This file contains the GO assignments for the proteins of the non-redundant Arabidopsis proteome set.
- gene_association.goa_chicken
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.goa_chicken.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/CHICKEN/gene_association.goa_chicken.gz
This file contains the GO assignments for the proteins of the non-redundant chicken proteome set.
- gene_association.goa_cow
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/gene_association.goa_cow.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/COW/gene_association.goa_cow.gz
This file contains the GO assignments for the proteins of the non-redundant cow proteome set.
- gene_association.goa_zebrafish
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_zebrafish.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ZEBRAFISH/gene_association.goa_zebrafish.gz
This file contains the GO assignments for the proteins of the non-redundant zebrafish proteome set.
- gene_association.goa_pdb
Locations:
ftp://ftp.geneontology.org/pub/go/gene-associations/submission/gene_association.goa_pdb.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/PDB/gene_association.goa_pdb.gz
This file contains the GO assignments for the proteins present in the pdb database.
- gene_association.goa_bhf-ucl
Location:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/bhf-ucl/gene_association.goa_bhf-ucl.gz
This file contains all GO annotations available for proteins implicated in cardiovascular development
and disease. The set of identifiers included in this proteome set was compiled by the Cardiovascular
GO Annotation Initiative, funded by the British Heart Foundation, http://www.cardiovasculargeneontology.com/
We comply with the file format described by the Gene Ontology
Consortium for annotation files
(http://www.geneontology.org/GO.annotation.html#file).
Since we deal with proteins rather than genes, the semantics of some
fields in our files may be slightly different to other gene association files.
- DB
Database from which annotated entry has been taken.
For the UniProtKB, Human and Proteomes gene association files:
UniProtKB (UniProt:Swiss-Prot/TrEMBL)
For the species-specific association files created using IPI
(Arabidopsis, chicken, cow,
mouse, rat or zebrafish):
One of either: UniProtKB, UniProtKB/Swiss-Prot,
UniProtKB/TrEMBL,ENSEMBL (Ensembl),
HINV (H-Invitational Database), TAIR, RefSeq or VEGA.
For the PDB association file: PDB
- DB_Object_ID
A unique identifier in the DB for the item being annotated.
Here: an accession number or identifier of the annotated protein
(or protein chain for the gene_association.goa_pdb file)
For the UniProtKB, Human and Proteomes gene association files:
- either a UniProtKB accession number or IPI identifier.
For IPI species-specific association files (Arabidopsis, chicken, cow,
mouse, rat or zebrafish):
- one of either UniProtKB, Ensembl, VEGA, HINV, TAIR or RefSeq
peptide identifiers
For the PDB association file:
- a PDB entry identifier (could be any non-control ASCII character).
Examples: O00165, O43526-1, PENSP00000241656, OTTDARP00000014036,
HIT000018908, AT1G12760.2, NP_671756, 117E
- DB_Object_Symbol
A (unique and valid) symbol (gene name) to which DB_Object_ID is matched.
An officially approved gene symbol will be added to this field when available.
Alternatively, other gene symbols, or locus names will be applied.
If no symbols are aviailable, the identifier applied in column 2 will be used.
N.B. the contents of this field changed in August 2008.
Examples: G6PC, CYB561, MGCQ309F3, C10H14ORF1, ENSBTAP00000000027, NP_671756, 117E_A
- Qualifier
This column is used for flags that modify the interpretation of an
annotation.
This field may be equal to: NOT, colocalizes_with, contributes_to,
NOT | contributes_to, NOT | colocalizes_with
Example: NOT
- GO ID
The GO identifier for the term attributed to the DB_Object_ID.
Example: GO:0005634
- DB:Reference
Reference cited to support the annotation.
For annotations methods which cannot reference a paper as being
the direct source of an annotation, this field will contain
a GO_REF identifier. See section 8 and
http://www.geneontology.org/doc/GO.references for an
explanation of the reference types used.
Examples: PUBMED:9058808, GOA:interpro|GO_REF:0000002,
GOA:hamap|GO_REF:0000020, GOA:spkw|GO_REF:0000004,
GOA:spec|GO_REF:0000003, GOA:compara|GO_REF:0000019,
GOA:spsl|GO_REF:0000023, GO_REF:0000024
- Evidence
One of either EXP, IMP, IC, IGC, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS,
NR, ND or RCA.
Example: TAS
- With
An additional identifier to support annotations using certain
evidence codes (including IEA, IPI, IGI, IC and ISS evidences).
Examples: UniProtKB:O00341, InterPro:IPROO1878,
Ensembl:ENSG00000136141, GO:0000001, EC:3.1.22.1
- Aspect
One of the three ontologies: P (biological process),
F (molecular function) or C (cellular component).
Example: P
- DB_Object_Name
Name of protein
The full UniProt protein name will be present here,
if available from UniProtKB. If a name cannot be added, this field
will be left empty.
Examples: Glucose-6-phosphatase
Cellular tumor antigen p53
Coatomer subunit beta
- Synonym
Gene_symbol [or other text]
Alternative gene symbol(s), IPI identifier(s) and UniProtKB/Swiss-Prot identifiers are
provided pipe-separated, if available from UniProtKB. If none of these identifiers
have been supplied, the field will be left empty.
Example: RNF20|BRE1A|IPI00690596|BRE1A_BOVIN
IPI00706050
MMP-16|IPI00689864
- DB_Object_Type
What kind of entity is being annotated.
Here: protein (or protein_structure for the
gene_association.goa_pdb file).
Example: protein
- Taxon_ID
Identifier for the species being annotated.
Example: taxon:9606
- Date
The date of last annotation update in the format 'YYYYMMDD'
Example: 20050101
- Assigned_By
Attribute describing the source of the annotation. One of
either UniProtKB, AgBase, BHF-UCL, DictyBase, Ensembl, FB, GDB, GeneDB,
GR (Gramene), HGNC, LIFEdb, MGI, Reactome, RGD, Roslin Institute,
SGD, TAIR, TIGR, ZFIN, IntAct, PINC (Proteome Inc.) or WormBase.
Example: UniProtKB
- xrefs.goa
Locations:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/mouse.xrefs.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/RAT/rat.xrefs.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ARABIDOPSIS/arabidopsis.xrefs.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ZEBRAFISH/zebrafish.xrefs.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/CHICKEN/chicken.xrefs.gz
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/COW/cow.xrefs.gz
N.B. As the human gene association file from GOA is no longer constructed using
the IPI resource, users are now invited make use of the UniProtKB identifier mapping file,
available from:
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz
The ReadMe for this file's format is availble from:
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/README
In addition to the principal IPI files with mappings of UniProtKB/Ensembl/Vega
to GO, files have been prepared describing the relationship between the
entries in this set and other databases, such as the EMBL/Genbank/DDBJ
nucleotide sequence databases, HUGO, and Entrez Gene and RefSeq at
the NCBI. This file is tab delineated (multiple entries in individual
fields are separated by commas) with each row in the file representing
one protein in the IPI set. The fields are as follows:
- Database from which master entry of this IPI entry has been taken.
One of either SP (UniProtKB/Swiss-Prot), TR (UniProtKB/TrEMBL),
ENSEMBL (Ensembl), REFSEQ_STATUS (where STATUS corresponds to the RefSeq entry
revision status), VEGA (Vega), TAIR (TAIR Protein data set)
or HINV (H-Invitational Database).
- UniProtKB accession number or Vega ID or Ensembl ID or RefSeq ID
or TAIR Protein ID or H-InvDB ID.
- International Protein Index identifier.
- Supplementary UniProtKB/Swiss-Prot entries associated with this IPI entry.
- Supplementary UniProtKB/TrEMBL entries associated with this IPI entry.
- Supplementary Ensembl entries associated with this IPI entry.
Havana curated transcripts preceeded by the key HAVANA:
(e.g. HAVANA:ENSP00000237305;ENSP00000356824;).
- Supplementary list of RefSeq STATUS:ID couples (separated by a semi-colon
';') associated with this IPI entry (RefSeq entry revision status details).
- Supplementary TAIR Protein entries associated with this IPI entry.
- Supplementary H-Inv Protein entries associated with this IPI entry.
- Protein identifiers (cross reference to EMBL/Genbank/DDBJ nucleotide databases).
- List of HGNC number, HGNC official gene symbol couples (separated by by a
semi-colon ';') associated with this IPI entry.
- List of NCBI Entrez Gene gene number, Entrez Gene Default Gene Symbol couples
(separated by a semi-colon ';') associated with this IPI entry.
- UNIPARC identifier associated with the sequence of this IPI entry.
- UniGene identifiers associated with this IPI entry.
- CCDS identifiers associated with this IPI entry.
- RefSeq GI protein identifiers associated with this IPI entry.
- Supplementary Vega entries associated with this IPI entry.
The mouse, rat, zebrafish and arabidopsis xref files have the following differences:
- Column 11 in the mouse file contains the MGI (Mouse Genome Informatics)
identifier and symbol for the genes
- Column 11 in the rat file contains the RGD (Rat Genome Database)
identifier and symbol for the genes.
- Column 11 in the zebrafish file contains the ZFIN (Zebrafish information network)
identifier and symbol for the genes.
- Column 11 in the arabidopsis file contains the TAIR Gene (The Arabidopsis
Information Resource) symbol and locus identifier for the genes.
- Column 11 does not contain any data for chicken and cow.
N.B. Entrez Gene is the successor database to LocusLink.
For species covered by LocusLink, it will still be possible
to access the data using the Entrez Gene identifiers.
5. The non-redundant human proteome set
In February 2009, the production of the gene_association.goa_human file
changed from using the International Protein Index (IPI) to using the
complete human proteome set available from UniProtKB/Swiss-Prot
(http://www.uniprot.org/news/2008/09/02/release).
The name and format of this human file has remained the same, however
annotations are now assigned to proteins from just the 'UniProtKB' (column 1)
database source. Human IPI identifiers continue to be included
in column 11 of annotations.
In addition, new releases of the cross-references file for human IPI set (human.xrefs.gz),
will no longer be provided. Instead, identifier mapping is possible
using the UniProt ID mapping file, available from:
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz
idmapping.dat.gz is a tab-delimited table, which includes mappings for 20
different sequence identifier types, including IPI identifiers.
A readme for this file is available from:
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping/README
6. The non-redundant IPI higher eukaryotic proteome sets
The non-redundant mouse, rat, arabidopsis, zebrafish,
chicken and cow files are produced using the monthly IPI
(International Protein Index) releases which provides a top-
level overview of the main databases that describe proteomes:
UniProtKB, Ensembl, TAIR, Vega, H-Invitational and NCBI's RefSeq
databases. IPI assigns stable identifiers to clusters of matching
proteins from its contributing databases.
Information on how the IPI sets are obtained can be found at:
http://www.ebi.ac.uk/IPI/Algorithm.html
IPI sets can be downloaded from:
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current
7. Ancillary mappings
Mappings between UniProtKB and EMBL/Genbank/DDBJ are derived from
the cross references to these databases found in UniProt entries.
Mappings between UniProt and HUGO, Entrez Gene and RefSeq are
derived from various publicly available sources of information
that allow the electronic tracking of identifiers between databases.
Contentious or contradictory data is referred to a curator for
judgement.
8. Assignment of GO terms to UniProtKB/Ensembl data
In this release, we have used eight data sources to assign GO terms to
proteins.
- PUBMED:nnnnnnnn
All such annotations are manually curated and can contain any of the
evidence codes available, except 'IEA' (see section 4). Curators have
read the abstract or full paper with the PubMed identifier nnnnnnnn
and assigned the GO terms manually. Where a journal is not indexed
by PubMed then an internal identifier is provided eg: PBTnnnnnnnn.
The GOA manual annotation set is created by the curators from the
GOA, UniProt and IntAct groups, and is also supplemented with manual
annotation (excluding annotation containing the ISS and IEA codes)
from external model organism databases, see section 2.
Please contact goa@ebi.ac.uk for details.
- GOA:interpro|GO_REF:0000002
Transitive assignment of GO terms based on InterPro classification.
For any protein that has been annotated with one or more InterPro
domains, the corresponding GO terms are obtained from a translation
table of InterPro entries to GO terms (interpro2go) generated
manually by the InterPro team at EBI. The mapping file is available at:
http://www.geneontology.org/external2go/interpro2go.
- GOA:hamap|GO_REF:0000020
GO terms are manually assigned to each HAMAP family rule. HAMAP family
rules are a collection of orthologous microbial protein families,
from bacteria, archaea and plastids, generated manually by expert
curators. The assigned GO terms are then transferred to all the
proteins that belong to each HAMAP family. Only GO terms from the
molecular function and biological process ontologies are assigned.
GO annotations using this technique will receive the evidence code
Inferred from Electronic Annotation (IEA). These annotations are
updated monthly by HAMAP and are available for download on both
GO and GOA EBI ftp sites. HAMAP (High-quality Automated and
Manual Annotation of Microbial proteins) is a project based at
the Swiss Institute of Bioinformatics (Gattiker et al. 2003,
Comp. Biol and Chem. 27: 49-58).
For further information, please see: http://www.expasy.org/sprot/hamap
- GOA:spkw|GO_REF:0000004
Transitive assignment using Swiss-Prot keywords. This method is used
for any database record that has one or more Swiss-Prot keywords assigned.
Each keyword is mapped to the corresponding GO term in the spkw2go file,
which was originally constructed manually by MGI curators and is now
maintained by the GOA team at EBI. The mapping file is available at:
http://www.geneontology.org/external2go/spkw2go.
- GOA:spec|GO_REF:0000003
Transitive assignment using Enzyme Commission identifiers.
This method is used for any database entry, such as a protein record
in Swiss-Prot or TrEMBL, that has had an Enzyme Commission number
assigned. The corresponding GO term is determined using the EC
cross-references in the GO molecular function ontology.
Also see Hill et al., Genomics (2001) 74:121-128.
The mapping file is available at:
http://www.geneontology.org/external2go/ec2go.
- GOA:compara|GO_REF:0000019
GO terms from a source species are projected onto one or more target
species based on gene orthology obtained from the Ensembl Compara system.
Only one to one and apparent one to one orthologies are used, and only GO
annotations with an evidence type of IDA, IEP, IGI, IMP or IPI are
projected. Projected GO annotations using this technique will receive the
evidence code, inferred from electronic anotation, 'IEA'. The UniProtKB
protein accession of the annotation source will be indicated in the 'With'
column of the GOA association file.
- GOA:spsl|GO_REF:0000023
Transitive assignment of GO terms based on Swiss-Prot Subcellular Location
vocabulary annotation. The UniProt Consortium has developed a Subcellular
Location vocabulary (SPSL) to annotate UniProt Knowledgebase entries (in
CC_SUBC LOCATION lines). The GOA curators at EBI have manually mapped this
vocabulary to the GO cellular component ontology. This mapping file, spsl2go,
is used to obtain corresponding GO terms for any UniPRotKB entry that has
SPSL annotation; the mapping file is available is available from:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/spsl2go
- GO_REF:0000024
Method for transferring manual annotations to an entry based on a curator's
judgment of its similarity to a putative ortholog which has annotations with
experimental evidence. Annotations are created when a curator judges that the
sequence of a protein shows high similarity to another protein that has
annotation(s) supported by experimental evidence (IDA, IGI, IMP, IPI or IEP).
Annotations resulting from the transfer of GO terms display the 'ISS' evidence
code and include an accession for the protein from which the annotation was
projected in the 'with' field (column 8). This field can contain either a
UniProtKB Accession or an IPI (International Protein Index) identifier.
Further information on this method can be found at:
http://www.ebi.ac.uk/GOA/ISS_method.html
- GO_REF:0000015
The Gene Ontology (GO) Consortium created the evidence code "ND" to indicate
"no biological data available". This code is used for annotations to
any of the three terms 'molecular function: GO:0005554', 'biological process
unknown: GO:0000004' or 'cellular component unknown ; GO:0008372'. The use of
any of these three GO terms, attributed to this reference and supported
by the ND evidence code, signifies that a curator has examined the available
literature and sequence for this gene and that as of the date of the annotation
to the unknown term, there is no information supporting an annotation to any GO
term in that ontology. (Note that ND can be used with any one (or two) of the
'unknown' terms, even if there is data available to support annotation to a term
from one or both of the other ontologies; e.g., ND can be used with GO:0008372 if the
function and process are known but component is not).
- GO_REF:0000029
Method for GO terms which were manually assigned by to
UniProt KnowledgeBase accession using either a NAS or TAS evidence code by
applying information extracted from a publicly-available, manually curated
UniProtKB entry. Such GO annotations were submitted by the GOA-UniProt group
from 2001, however this annotation practise was discontinued in 2007.
9. Additional information on Manual Annotation in GOA
For information on manual annotation guidelines and the usage of
manual evidence codes please see:
http://www.geneontology.org/GO.annotation.html
http://www.geneontology.org/GO.evidence.html
Usage of the ISS code within GOA
There are three ways in which a curator can use the ISS evidence code:
- If a curator reads a paper that provides functional information
for a protein and also states an orthology between it and another
protein, then manual annotation can be transferred to the ortholog.
The ortholog's annotation will contain the evidence code 'ISS' and
the original literature identifier is displayed in the DB:reference
field (column 6). Any information previously in the 'with' column
of the original protein's annotation is replaced in that of the
sequence identifier (UniProt accession) of the original
protein's accession number. This allows the source of the 'ISS'
annotation to be traced.
- If a curator is confident that a protein shows high similarity
to another protein (e.g. from using BLAST) and it
seemed reasonable to infer that the two proteins have a common
function, then manual annotation can be transferred to an ortholog.
The ortholog's annotation will contain the evidence code 'ISS', an
accession for the protein from which the annotation was projected
will be present in the 'with' field (column 8) and
the reference field (column 6) will contain the GO_REF:0000024.
Further information on this method can be found at:
http://www.ebi.ac.uk/GOA/ISS_method.html
- If sequence similarity and functional information is reported in
two different papers, then the primary annotation can be transferred
to an ortholog. The ortholog's annotation will contain the evidence
code 'ISS', the identifier of the paper which describes the sequence
similarity is displayed in the DB:reference field (column 6) and any
information that was previously contained in the 'with' column of
the original entry is changed in that of the ortholog to contain the
original entry's accession number. This allows the source of the
annotation to be traced.
N.B. For all of the methods described above, only annotations that
have an experimental evidence code (either: IDA, IEP, IGI, IMP or IPI)
can be further transferred to other proteins. In addition, annotations
having the 'NOT' qualifier cannot be transferred by ISS.
10. Addition of GO assignments from other data sources
The GOA dataset has also been supplemented with the last (2001) public
release of manual annotation from Proteome Incorporated. A number of
annotations from Proteome Inc. contain the NR evidence code, which is
not explicitly related to a journal reference; the replacement of this
subset with more up-to-date and detailed GO annotation is one of GOA's
priorities.
GOA has integrated annotations from the EBI's IntAct protein-protein
interaction database. Only those binary interactions which are of high
enough quality to be integrated into the UniProt database have been
included (this is decided on experimental method type). All GO terms
in these annotations are children of the protein binding term
(GO:0005515), use the 'IPI' evidence code along with the sequence
identifier of the protein's binding partner in column 8 ('with').
11. Further information on the PDB association file
The 'gene_association.goa_pdb' gene association file provided by the
GOA group contains GO assignments to PDB entries. In this file PDB
entries are only assigned GO terms based on matching InterPro domains.12. Contacts
Please direct any questions to goa@ebi.ac.uk We welcome any
feedback.
13. Copyright Notice
GOA - GO Annotation@EBI
Copyright 2009 (C) The European Bioinformatics Institute.
This README and the accompanying databases may be copied and
redistributed freely, without advance permission, provided that this
copyright statement is reproduced with each copy.
 |