CALBC – the Challenge

Proceedings and presentations of the CALBC Workshop II are now available.

CALBC (Collaborative Annotation of a Large Biomedical Corpus) is a European Support Action addressing the automatic generation of a very large, community-wide shared text corpus annotated with biomedical entities. We propose to create a broadly scoped and diversely annotated corpus (about one million Medline immunology-related abstracts annotated with different semantic types) by automatically integrating the annotations from different named entity recognition systems.

To collect the annotations from as much different systems as possible help is needed from all interested research groups. Therefore two challenges are organised. The CALBC Challenge II continues the efforts of the CALBC Challenge I. You can participate in Challenge II without any consideration or participation in Challenge I.

Participation and Benefits

Participation is open to any team that is willing to submit annotations obtained with their own named entity recognition or concept identification system. Participants will receive an assessment of their results against the SSC through a fully automated analysis.

If you train NER solutions against the SSC, which will be made public available, you will be able to identify a large number of semantic groups from any other corpus.

You can contribute a different type of annotations (more specific, more general) to the challenge and therefore receive an assessment against the current SSC and at the same time you are contributing to the next SSC.

The resulting corpus can be exploited for different goals:

  • The text mining community can train existing text mining solutions to reproduce the CALBC annotations.
  • Novel text mining solutions can be developed using the corpus, such as new methods for the disambiguation of entities.
  • CALBC will provide a larger body of biomedical information than is currently available to the text mining community.
  • The corpus will be delivered in a Resource Description Framework (RDF) representation so that it can be integrated in the Semantic Web. The corpus will serve as a data resource for data mining solutions that contribute to the understanding of immunological questions.


  • October 2009: Challenge I opens
  • February 14th, 2010 (extended): Challenge I closed
  • June 17th to 18th, 2010: First CALBC Workshop (EBI, Hinxton, Cambridge, U.K.)
  • September 13th 2010: Starting of CALBC Challenge II
  • October 19th 2010:CALBC training data available
  • December 15th, 2010: Challenge II closes
  • March 2011: Second and final CALBC Workshop
  • June 30th, 2011: Final harmonized corpus available