BOOTStrep - Bootstrapping Of Ontologies and Terminologies STrategic REsearch Project

BOOTStrep is a paneuropean and interdisciplinary project in the IST program of the EC's Sixth Framework. The project started in April 2006 and will terminate in March 2009. The overall project budget is 4.2M €.

Central Goals

  • To exploit existing terminological resources in the biomedical domain to generate a new resource (the BioLexicon) based on a common representation framework. The BioLexicon includes terminology from domain-specific literature to fill gaps in existing resources and is interlinked to ontological concepts (the BioOntology).
  • To create a repository of biological facts from the literature (the FactStore) via automatic text processing. The FactStore is interoperable with the BioLexicon and the BioOntology.
  • To develop open access NLP tools for text-based knowledge harvesting in order to support information extraction and text mining in the biomedical domain.
  • For further details regarding goals and achieved results please consult:
    1. the official BootStrep Web page
    2. the BootStrep Web page at NaCTeM,
    3. the official BootStrep Web page at the CEC


Website for the Gene Regulation Ontology (GRO) is available now!

GRO has been developed as part of the BOOTStrep project and is hosted at the EBI. More information can be found here.
Beisswanger,E., Lee,V., Kim,J.J., Rebholz-Schuhmann,D., Splendiani,A., Dameron,O., Schulz,S., Hahn,U. Gene Regulation Ontology (GRO): Design Principles and Use Cases. Stud Health Technol Inform. 2008;136:9-14. PMID: 18487700

The Gene Regulation Ontology (GRO) has been submitted to the Open Biomedical Ontologies (OBO) library!

The BOOTStrep Gene Regulation Ontology has been submitted to the Open Biomedical Ontologies (OBO) library and is currently under review. By now it can be found at

MedEvi is a novel search engine that retrieves and aligns sentences from Medline abstracts.

MedEvi has been developed as part of the BootStrep project. The search engine identifies sentences in Medline abstracts that contain the query terms. All sentences are sorted, prioritized and aligned according to the query terms.
Kim,J.J., Pezik,P., and Rebholz-Schuhmann,D. (2008) MedEvi: Retrieving textual evidence of relations between biomedical concepts from Medline. Bioinformatics 2008 (open online access).

Evaluation of the Term Repository against standard corpora

The BootStrep consortium is developing a lexical resource, called the BioLexicon. In the current state, the core content called the Term Repository has been generated and exchanged with the partners to augment the content with terms from the literature (NaCTeM/UoM) and to feed the results into a database schema that fulfils standard requirements of a lexical resource (CNR, Pisa). The content of the Term Repository has been assessed against the corpus of the BioCreAtIve II / Task 1b challenge (gene name normalisation).
Pezik, P. Jimeno, A. Lee, V., Rebholz-Schuhmann, D. (2008) Static Dictionary Features for Term Polysemy Identification. In Proceedings of the Language Resources and Evaluation Conference (LREC-2008), workshop on "Building and evaluating resources for biomedical text mining". Marrakech (Morocco), 28-30 May 2008.

Access to the BioLexicon
The content of the BioLexicon is available in different formats:

  1. XML interchange format (XIF): The collected terms are contained in special XML-formatted files and the whole set of files are called the term repository. The different XIF files of the term repository can be accessed here.
  2. The BioLexicon will also be available as dump of a relational database (MySQL). While the database dump has already been generated, it requires still some maintenance to produce the most efficient lean version of the database.


Several resources are available from the project partners (List and links will be updated till the end of November).

Contributing Partners

Coordinator: Prof. Udo Hahn (FSU-JENA)
Friedrich-Schiller Universität Jena
Jena University Language & Information Engineering (JULIE) Lab
Dr. Dietrich Rebholz-Schuhmann (EMBL-EBI)
European Molecular Biology Institute - European Biology Informatics
United Kingdom
Prof. Nicoletta Calzolari (CNR-ILC)
Consiglio Nazionale delle Ricerche
Instituto di Linguistica Computazionale
Dr. Sophia Ananiadou (UOM-NACTEM)
University of Manchester
National Centre for Text Mining
United Kingdom
Dr. Stefan Schulz (UKLFR)
Universitätsklinikum Freiburg
Prof. Anita Burgun-Parenthoine (UR1)
Université de Rennes
Prof. Su Jian (I2R)
Institute for Infocomm Research