Lexical resource from BOOTStrep project


The BioLexicon brings together terminologies from large public bioinformatics data resources such as UniProtKb, ChEBI and NCBI taxonomy to deliver a terminological resource to different parts of the BootStrep project and to the public. The BioLexicon is geared towards representing terms in conjunction with lexical and statistical information to improve information extraction and text mining.
In the current state, the core content called the Term Repository has been generated and exchanged with the partners to augment the content with terms from the literature (NaCTeM/UoM) and to feed the results into a database schema that fulfils standard requirements of a lexical resource (CNR, Pisa). The final version of the BioLexicon will be delivered as an image of a relational database (MySQL, ongoing work). For further information see ISMB poster.
The content of the Term Repository has been assessed against the corpus of the BioCreAtIve II / Task 1b challenge (gene name normalisation, Pezik, P. Jimeno, A. Lee, V., Rebholz-Schuhmann, D. (2008) Static Dictionary Features for Term Polysemy Identification. In Proceedings of the Language Resources and Evaluation Conference (LREC-2008), workshop on "Building and evaluating resources for biomedical text mining". Marrakech (Morocco), 28-30 May 2008. ).


The content of the BioLexicon is available in different formats:

  1. XML interchange format (XIF): The collected terms are contained in special XML-formatted files and the whole set of files are called the term repository. The different XIF files of the term repository can be accessed here.
  2. The BioLexicon will also be available as dump of a relational database (MySQL). While the database dump has already been generated, it requires still some maintenance to produce the most efficient lean version of the database.

A workshop tutorial from Language Resources and Evaluation Conference (LREC-2008) is available here.

The ISMB 2008 poster can be found here.


