spacer

LexEBI
Lexical Entities of Biological Interest

Introduction

Lexical resources are essential to process the scientific literature and link named entities to the primary data resource. Standardisation of terminological resources required for text mining and normalisation of database resources. Querying across different terminological resources leads to the retrieval of ambiguous alternatives. A standard resource is required to increase interoperability between IT solutions.
LexEBI uses an XML format for the representation and storage of the terminological resource. Explicit reference are implemented to the preferred term, the term variants, concept ids, term frequency in the British National Corpus, in Medline, and the frequency of the term variants. An additional table makes reference to the nestedness of the terms in the resources.

Contents

The Lexicon makes use of the following resources:
  1. BioThesaurus: Extraction of the clusters terms and the term variants. Non-sensical terms such as “hypothetical gene”, “putatitve gene”, “probable gene”, “possible gene” and single numbers have been removed. The concept identifier of each term from each resource has been kept for later reference purposes. All term variants for a given concept have been organised in a single cluster, where the preferred term forms the label of the cluster.
  2. In the same way, the terms from ChEBI and Jochem have been processed in the same way.
  3. Enzyme and Interpro have served as terminological resources for enzyme terms and protein family terms.
  4. The NCBI taxonomy provided the species names.
  5. Disease terms have been extracted from Medline

Overview on the content of LexEBI

# Labels

# Variants

Total

Biothes.7.0

516,113

4,005,040

4,521,153

Biothes.6.0

488,577

3,389,316

3,877,893

InterPro

20,671

0

20,671

Enzymes

4,905

8,082

12,987

JoChem

278,578

1,691,980

1,970,558

ChEBI

19,645

94,748

114,393

ChEBI (all)

549,838

1,187,322

1,737,160

Diseases

56,010

165,581

221,591

Species

643,280

199,130

842,410

Download

The content of the LexEBI is available:

  1. XML interchange format (XIF): The collected terms are contained in special XML-formatted files and the whole set of files are called the term repository.
  2. The different XIF files of the term repository can be accessed here.
  3. Use, reuse and distribution under the Creative Commens License


Acknowledgements

This work has been funded by the EU Support Action grant 231727 under the 7th EU Framework Programme (ICT 2007.4.2), by the EC STREP project “BOOTStrep” (FP6-028099, www.bootstrep.org) and UKPMC funding was received from the Wellcome Trust, Medical Research Council, and Cancer Research UK.

spacer
spacer