spacer

BioBabel Successful Outcome of the Biobabel project

Project Outline

Successful Outcome of the Biobabel project

We are pleased to announce the completion of BioBabel, a project funded by the European Union as part of the Quality of Life and Management of Living Resources program. This project was coordinated by the EBI between December 2001 and November 2004. It has delivered its objective to enhance the interoperability of biological databases by improving the standardisation of biochemical terminology and by the introduction of shared ontologies. The project successfully drew on the expertise of databases maintained at major bioinformatics centres throughout Europe.

Why do we need standardised vocabularies? Biological databases describe a wide spectrum of information. Their diversity makes efforts towards database integration difficult. This project allowed the development and implementation of common ontologies to describe biological attributes in databases. This work will allow users to do complex queries across databases in a simpler way. The partners in this project aimed to implement standardised terminology in all the databases they produce and maintain. These include the UniProt Knowledgebase, BRENDA, Newt, GOA, GO,CitEXplore, IntEnz, ChEBI, InterPro and CluSTr. Text- or sequence-based searches of the databases will allow researchers to infer knowledge about the structure and function of genes and proteins and to relate these to the existing corpus of scientific knowledge.

The project consists of 12 workpackages that form an integrated whole. These workpackages fall into 6 different classes:
  1. Research and development of controlled vocabulary for biological and biochemical terminology (WP1-5)

  2. Research and development of a structured controlled vocabulary, the Gene Ontology (GO) to describe gene products in terms of their molecular function, biological role and cellular location (WP6)

  3. Development and implementation of a database system to store the controlled vocabulary for biological and biochemical terminology and the Gene Ontology, enabling us to keep the controlled vocabulary up-to-date (WP7)

  4. Implementation of controlled vocabulary for biological and biochemical terminology in the databases of the BioBabel partners (WP8)

  5. Rigorous classification of data in protein sequence, protein signature, enzyme and enzyme function databases with GO terms (WP9-11)

  6. Development and implementation of new access and retrieval tools that will enable researchers to exploit maximally the data in the databases participating in this project (WP12)

Achievements of BioBabel


Data resources (BioBabel workpackages 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12)

Shared Taxonomy
(WP 1)
To integrate taxonomy data compiled in the NCBI database and the data specific to UniProt protein knowledgebase, the NEWT database was set up and is updated daily.

  • Achievements:
    • Populated NEWT with taxons from NCBI.

    • Enhanced taxonomy for automatic annotation of UniProt/TrEMBL and InterPro.

    • Supplemented with the latest information from specialized databases, articles, websites and research

    • Corrected typos, Latin grammar, capitalization.

    • Established systematic way of naming viral and bacterial strains. Removal of redundancy.

    • Substantial cleanup of Influenza virus taxonomy as well as several other taxa, snakes, red algae,viridiplantae.

    • Collaboration with NCBI to improve consistency and some new classifications created.

    • Will continue to work on standardizing the representation of infected host species, endosymbionts.

  • Access this data:
    http://www.ebi.ac.uk/newt/


Shared Citation
Database
(WP 2)
We have developed a relational database and tools to represent and maintain a shared citation component. Clear standards are being established in terms of content and storing the information in the Citation database and to allow propagation of new literature into the database to other BioBabel partners. These standards will make future developments easier such as checking entries for missing citations, extraction of information and citation database queries. Already 85-90% of EMBL entries with incomplete/incorrect reference records have been validated and completed using the shared citation component.

Currently the citation database is holding 14 Million Pubmed/Medline entries which are accompanied by cross-references to the proteins stored in UniProt. Additional cross-references from citations to records in the EMBL, InterPro and ASD (Alternative splicing database) have been added. Future plans include adding more cross-references from citations to records in the GOA and IntAct databases.

A demonstration of CitEXplore, a web-based browser to the Citation database has been made available and already allows a variety of search query options. CitEXplore combines literature search with text mining tools for biology. Search results are cross referenced to EBI applications based on publication identifiers. Links to full text versions are provided where available.

Further improvements to the Citation database (CitDB) will continue beyond the Biobabel Grant.

Shared Tissue
List (WP 3)
We made good progress on the development of a controlled vocabulary for Tissues to be shared by the partners. So far we have achieved a tissue list already used to standardise the tissue annotation in UniProt, and a synchronised list used by BRENDA. The tissue ontology can be browsed online from the BRENDA website (see links below).

The further development of the controlled vocabulary for Tissues is still dependent on international efforts like OBO (Open Biological Ontologies). The BRENDA tissue ontology has been made available on the OBO site so that other ontology developers can comment and contribute to a gold standard.

It’s important for international database interoperability for BioBabel partners to consider and assist in the development of one gold standard tissue vocabulary.

Shared Strain
List (WP 4)
This workpackage was tightly coupled with WP1. We populated the NEWT database with all strain data from the NCBI taxonomy data. We then cleaned up the information for 6,498 strains and 1,563 curated species. The number of species with strain information increased by 11%. Mainly viral strains have been added or cleaned up (see WP1). This work helped us to update thousands of UniProt records with the correct strain information. This workpackage requires a lot of manual intervention.

Shared biochemical terminology (WP 5) For the development and maintenance of biochemical reactions the Integrated relational Enzyme database (IntEnz) project was initiated. This is supported by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) and will contain enzyme data approved by the Nomenclature committee. During the course of this work, Biobabel partners made considerable progress in the classification of new and the reclassification of existing enzymes. We aim to create a "definitive", freely available database of Chemical compounds of Biological Interest (ChEBI).

Shared GO Vocabularies (WP 6)


Molecular Function


Biological Process

Cellular Component
The Gene Ontology (GO) project was set up as a collaborative effort to address the need for consistent descriptions of gene products in terms of their molecular function, biological process and location of action. The EBI hosts the GO editorial office which is responsible for coordinating all changes in the GO vocabularies. GO is stored in a relational database by the GO Consortium and mirrored at the EBI. Both are updated daily. This work is tightly coupled with WP 9. Currently there are >18,000 terms included in GO. The GO vocabularies have been a great success worldwide with over 400 publications citing GO usage.

Implementation of a relational database system (WP 7) We created robust relational database systems to store and maintain the shared database components (WP1-4), the controlled vocabulary for biological and biochemical terminology (WP5) and the Gene Ontology (WP6). We were already using a relational database to store sequence data and protein family signature data. In addition, we had already implemented a version of the enzyme nomenclature database of partner 2 in the relational database. This database was further extended to populate it with the equivalent data from the Enzyme Classification list of partner 3 and from the enzyme function database of partner 4 (see WP5). An additional extension of the database will accommodate the dictionary of chemical compounds (ChEBI) used in the databases of the BioBabel partners (WP5).

Implementation of
shared vocabularies
(WP 8)
We have implemented the shared vocabularies, wherever already possible, in databases like the EMBL nucleotide sequence database, UniProt, InterPro, Proteome Analysis database, ENZYME, IntAct and IntEnz. The common taxonomy of WP 1 has been included in BRENDA and direct links have been installed to the EBI NewTaxonomy. This involved the redirection of ca. 83 000 links for more than 9800 different organisms.

GO Annotation
(WP 9)
The Gene Ontology Annotation (GOA) project was set up to classify all known gene products with the GO vocabulary (WP 6). Using a combination of electronic (WP 10) and manual techniques, 6.3 million have been transitively associated with over 1.3 million gene products represented in UniProt. UniProt (stats of June 2005). This dataset is publicly available as GOA-UniProt. In addition to our annotation of > 90,000 species we have produced non-redundant data set of GO annotation to Mouse, Human, Rat, Arabidopsis, Chicken, and Zebrafish. GOA datasets are updated monthly in accordance with the latest data released by the primary data sources.
Downloads
Gene Association File This is a tab-delimited file of associations between gene products and GO terms and is the most common form of data transfer within the GO Consortium. For more information on our format read the GOA readme file.

Download:
  • Arabidopsis GOA file: Access - via FTP.
  • Human GOA file: Access - via FTP.
  • UniProt GOA file: Access - via FTP.
  • UniProt GOA file: Access - via FTP.
  • Mouse GOA file: Access - via FTP.
  • Proteomes GOA file: Access - via FTP.
  • Rat GOA file: Access - via FTP.
  • Zebrafish GOA file: Access - via FTP.
GOA xref File For each GOA release we also distribute a file of cross references that displays the relationship between the entries in the GOA data set with other databases, such as EMBL/Genbank/DDBJ nucleotide sequence databases, HUGO and LocusLink and Refseq.

Download:


Web-based tools
QuickGO A fast web-based browser with access to core GO data and up-to-date electronic and manual EBI GO annotations.
SRS Search our GOA database or our mirror of the GO consortium repository (GO).
Proteome Analysis Pages
GO annotations have been produced for classification of proteins belonging to each complete proteome. On the Proteome Analysis Pages a slimmed down version of GO (GO-slim), representing high-level GO terms are displayed as a proteome overview. For example click here.

To view EBI's GO-slim click here.
InterPro GO annotations made by InterPro are visible directly in InterPro entries.



InterPro and CluSTr GO annotation
(WP 10)
InterPro2GO mapping:

InterPro is a key database maintained at the EBI. It provides an integrated documentation resource for proteins, families, and domains. A single InterPro entry provides comprehensive annotation describing a set of related proteins some of which may have identical functions, be involved in the same processes and act in the same locations.

During the curation of each InterPro entry, high-level GO terms are manually curated, based on a review of the literature available on the related well-known proteins. This annotation is used to generate an InterPro2go mapping and also serves as a biological summary in the InterPro entry. Sometimes unknown protein sequences in UniProt have cross-references to identifiable InterPro features. The transfer of GO annotation from InterPro to the UniProt protein generates an important first round annotation of its possible function.
This information is integrated in GOA releases (WP 9). ClusTr2GO mapping:

As part of the TEMBLOR grant we have expanded the CluSTr database. Data from CluSTr is now available for over 50 organisms with completely deciphered genomes, and we will soon make available further data that will increase the number of clustered organisms to over 100.

Our first CluSTr2go mapping was created and released May 2004. This mapping was achieved by selecting all CluSTr clusters which share at least 70% of their SPTR accession numbers with an InterPro domain or a family. Then, the 'InterPro2GO Mapping' was used to assign appropriate GO mapping to those clusters.

Enzyme to GO mapping
(WP 11)
A mapping of GO terms to Enzyme Commission (EC) numbers has already been achieved and applied in the GOA releases (WP 9).

Dissemination
(WP 12)
The EBI is historically strong in ensuring academic exploitation of its services. We have already made good progress on the dissemination of Biobabel project results by the scientific community. Text based searches against the protein sequence databases, the protein family signature database, GO, the shared database components, the enzyme database and enzyme function database are possible from SRS at EBI (http://srs.ebi.ac.uk) and various other entry points.


BioBabel partners





Funding

The BioBabel project is funded by the European Commission as the contract-no. QLRI-CT-2000-00981 under the RTD program "Quality of Life and Management of Living Resources


Contact

Please contact  .


spacer
spacer