CALBC Project Overview


The provision of a gold standard for (semantically) annotated data is a time-consuming and costly process predominantly due to the manual curation work. We advocate an alternative approach in terms of a “silver standard” which results from the harmonization of automatically provided annotations. Different annotation groups deliver their meta data as generated by their in-house annotation systems, which, finally, is merged to form a compromise set of annotations.

We cover 150,000 Medline abstracts on immunology, a reasonably broad topic which is dealt with in more than 1M abstracts from the 18M abstract set in Medline. This document set will collaboratively be annotated with five to ten semantic entity types by the organizers and the participants in two consecutive annotation challenges.

A secondary goal of this project is to define a standardized format for representing the annotations contributed by the participants and comparing them effectively. Currently the lack of such a format hinders progress in the evaluation of named entity recognition systems. The final corpus will also be made available formatted in RDF for exploitation in Semantic Web applications.

The annotated corpus becomes a resource for the community, to be used as a reference for improving text mining applications.

The CALBC project and the CALBC corpus are innovative solutions:

  • first corpus that contains a large number of annotations
  • first corpus that makes use of shared terminological resources for the annotation
  • first corpus that contains the annotations from several system
  • first corpus that has been generated fully automatically

Biggest expected benefits:

  • train a NER solutions against the corpus and you can identify a large number of semantic groups from any corpus
  • use the corpus to disambiguate different semantic groups
  • you can contribute a different type of annotations (more specific, more general) and can receive an assessment against the SSC and can contribute to the next SSC


  • Quality of the SSC is unknown
  • SSC contains systematic errors due to use of incomplete standard resources
  • No criteria available yet for the proper choice off an adequate consensus model


Acknowledgement / Funding

  • European Commission: 7th FRAMEWORK PROGRAMME
  • THEME: Intelligent Content and Semantics [ICT-2007.4.2]
  • Grant Agreement Number: 231727