spacer
spacer

Mining the Biomedical Literature


Go to the main EBI website What is the expected gain, if we process scientific text automatically?


Scientific publications form the major tool for the exchange of new scientific facts. Therefore, scientists have to process large amounts of publications, without being able to read everything. This explains, why automatic methods for the analysis of scientific text become very popular. If an automatic method could extract the details needed with good quality, the scientist would safe lot of time reading. If an automatic method could retrieve and pre-select at 100% certainty the relevant documents from the complete set of documents available, then the scientist would safe lot of time searching for the best publications, but he still would have to read them to find the facts.

Text mining methods have been developed and applied in all fields, where large amounts of data were available: analysis of large archives of news documents, internet documents and in recent years scientific publication in the biomedical field. The strong increase in publications in the biomedical field is the consequence of the strong increase of information gathered throughout numerous sequencing projects.

What is text mining (TM)?

The term ‘text mining’ refers to a variety of techniques which fulfill different tasks mainly:
- Information retrieval (IR): Such automatic methods extract the features of a document (words, parts of sentences), classify them and then compare them to a profile, which consists in general of a list of terms. If the document features match to the profile, then the document is selected.
- Information extraction (IE): Such automatic methods seek written evidence of facts in the text, which are in general represented by combinations of words (patterns). Such patterns select a sequence of words and take their morphological changes into consideration. Other techniques include grammar rules to ‘understand’ the sentence (Natural Language Processing (NLP)) or complex classification techniques, e.g. Support Vector Machines (SVM) or Bayesian Networks, to identify patterns of high complexity.
Other approaches use similar techniques to archive documents for better retrieval or to answer questions on the basis of existing documents.

What data is used for TM?


Scientific abstracts from Medline (PubMed) form the standard data resource. In addition scientific groups analyze complete scientific publications. Patent text from the European Patent Office is an additional source.
The importance of correct terminology
The use of correct terminology is necessary for the identification of the object behind the described fact and for the correct link between the literature and scientific databases. Unfortunately natural language is often ambiguous in the use of terminology. The biological community is addressing these issues with work on nomenclature and ontologies. The following ambiguities are most prominent.
- Syntactic ambiguity occurs when a word has a noun or a verb interpretation.
- Semantic ambiguity occurs when two identical terms have different meanings. Disambiguation has to take the context of the term into consideration.

Which advantages result from natural language processing (NLP) ?

A variety of different systems is available to do natural language processing, but no system can cope with the complexity of natural language in a spoken language. In addition, the time for the analysis of a sentence increases with its complexity. As a conclusion, NLP based systems provide solutions on chunks of sentences. The main advantage in NLP based approaches can be found in the fact that sentence analysis can be shaped following a hand-crafted linguistical model and thus they do not rely on the correct annotation of a training data set.

Which TM tasks are important in bioinformatics ?

The most prominent information extraction task is the extraction of protein-protein interactions from Medline (numerous publications). Other tasks are concerned with the association of a gene to a phenotype, e.g. a disease or a symptome, or the identification of a gene in conjunction with a mutation. In addition, research groups are working on the correct identification of protein and gene names from Medline, which is again a non-trivial task due to the complexity of the applied names.


Evaluation of extracted results


The evaluation quantifies how many correct facts were selected from all facts available (= recall) and how many correct facts are contained in all selected facts (= precision). The sweat part of this task is to generate a random sample from the complete set of documents and to count how many facts are contained in the sample. Other evaluation methods compare the extracted information by the systems with independent information stored in databases and large experimental repositories.

by Dr. Dietrich Rebholz-Schuhmann (February, 2004).



spacer
spacer