CALBC Project Overview
The provision of a gold standard for (semantically) annotated data is a time-consuming and costly process predominantly due to the manual curation work. We advocate an alternative approach in terms of a “silver standard” which results from the harmonization of automatically provided annotations. Different annotation groups deliver their meta data as generated by their in-house annotation systems, which, finally, is merged to form a compromise set of annotations.
We cover 150,000 Medline abstracts on immunology, a reasonably broad topic which is dealt with in more than 1M abstracts from the 18M abstract set in Medline. This document set will collaboratively be annotated with five to ten semantic entity types by the organizers and the participants in two consecutive annotation challenges.
A secondary goal of this project is to define a standardized format for representing the annotations contributed by the participants and comparing them effectively. Currently the lack of such a format hinders progress in the evaluation of named entity recognition systems. The final corpus will also be made available formatted in RDF for exploitation in Semantic Web applications.
The annotated corpus becomes a resource for the community, to be used as a reference for improving text mining applications.
The CALBC project and the CALBC corpus are innovative solutions:
Biggest expected benefits:
Acknowledgement / Funding