 |
LexEBI
Lexical Entities of Biological Interest
Introduction
Lexical resources are essential to process the scientific literature and link named entities to the primary data resource. Standardisation of terminological resources required for text mining and normalisation of database resources. Querying across different terminological resources leads to the retrieval of ambiguous alternatives. A standard resource is required to increase interoperability between IT solutions.
LexEBI uses an XML format for the representation and storage of the terminological resource. Explicit reference are implemented to the preferred term, the term variants, concept ids, term frequency in the British National Corpus, in Medline, and the frequency of the term variants. An additional table makes reference to the nestedness of the terms in the resources.
Contents
The Lexicon makes use of the following resources:
BioThesaurus: Extraction of the clusters terms and the term variants. Non-sensical terms such as “hypothetical gene”, “putatitve gene”, “probable gene”, “possible gene” and single numbers have been removed. The concept identifier of each term from each resource has been kept for later reference purposes. All term variants for a given concept have been organised in a single cluster, where the preferred term forms the label of the cluster.
In the same way, the terms from ChEBI and Jochem have been processed in the same way.
Enzyme and Interpro have served as terminological resources for enzyme terms and protein family terms.
The NCBI taxonomy provided the species names.
Disease terms have been extracted from Medline
Overview on the content of LexEBI
|
# Labels |
# Variants |
Total |
Biothes.7.0 |
516,113 |
4,005,040 |
4,521,153 |
Biothes.6.0 |
488,577 |
3,389,316 |
3,877,893 |
InterPro |
20,671 |
0 |
20,671 |
Enzymes |
4,905 |
8,082 |
12,987 |
JoChem |
278,578 |
1,691,980 |
1,970,558 |
ChEBI |
19,645 |
94,748 |
114,393 |
ChEBI (all) |
549,838 |
1,187,322 |
1,737,160 |
Diseases |
56,010 |
165,581 |
221,591 |
Species |
643,280 |
199,130 |
842,410 |
Download
The content of the LexEBI is available:
XML interchange format (XIF): The collected terms are contained in special XML-formatted files and the whole set of files are called the term repository.
The different XIF files of the term repository can be accessed
here.
Use, reuse and distribution under the Creative Commens License
Acknowledgements
This work has been funded by the EU Support Action grant 231727 under the 7th EU Framework Programme (ICT 2007.4.2), by the EC STREP project “BOOTStrep” (FP6-028099, www.bootstrep.org) and UKPMC funding was received from the Wellcome Trust, Medical Research Council, and Cancer Research UK.
 |