spacer

Gold standard corpora

Following gold standard corpora (GSC) are used to benchmark our approach. Results concerning the performance evaluation of the CALCB data against the GSC will be published in the near future.

BioCreative II Gene Mention Test

  • Protein mention annotation
  • 5,144 annotations, 4,171 sentences
  • Year of release: 2005
  • Original source [download]
  • Publication [download]
  • Download the corpus in IeXML format [download]
  • Download the harmonized corpus in IeXML format [download]: 13 partners, 6 vote agreement

PennBioIE-Oncology

  • Protein mention annotation
  • 18,148 annotations, 1,414 abstracts
  • Year of release: 2008
  • Original source [download]
  • Publication [download]
  • Download the corpus in IeXML format [download]
  • Download the harmonized corpus in IeXML format [download]: 13 partners, 6 vote agreement

JNLPBA-Test

  • Protein mention annotation
  • 6,142 annotations, 401 abstracts (missing one abstract from the original, since PMID 93343972 seems to be deprecated)
  • Year of release: 2004
  • Original source [download]
  • Publication [download]
  • Download the corpus in IeXML format [download]
  • Download the harmonized corpus in IeXML format [download]: 13 partners, 6 vote agreement

FSU-PRGE

  • Protein mention annotation
  • 59,483 annotations, 3,236 abstracts
  • Year of release: 2009
  • Publication [downlaod]
  • Download the corpus in IeXML format [download]
  • Download the harmonized corpus in IeXML format [download]: 13 partners, 6 vote agreement

Arizona Disease

  • Disease mention annotation
  • 3,206 annotations, 2,775 sentences
  • Year of release: 2008
  • Original source [download]
  • Publication [download]
  • Download the corpus in IeXML format [download]
  • Download the harmonized corpus in IeXML format [download]: 11 partners, 5 vote agreement

SCAI-Test

  • Chemical mention annotation
  • 1,206 Annotations, 100 abstracts
  • Year of release: 2008
  • Original source [download]
  • Publication [download]
  • Download the corpus in IeXML format [download]
  • Download the harmonized corpus in IeXML format [download]: 11 partners, 5 vote agreement

Please note that, the sentences included in the CALBC corpora named bc2 and azdc are derived from the  BioCreative-II and Arizona Disease Corpora but do not include all the sentences in those corpora. Those original sentences were mapped, in so far as we were able to, into their original PubMed abstracts and it is the abstracts (not just the sentences) that were tagged by calbc participants. Some content may be lost in this mapping process. Some content may also be lost during the harmonization process, if participants non-trivially alter the content during the tagging process.