Student Projects and Internships

We encourage students with a background in computer science, bioinformatics and/or computational linguistics to apply for projects and interships.  Students should have some experience with Unix/Linux environment and Java, and must be prepared to get used to shell scripting and XML annotations. Biological knowledge is a plus. Student projects and internships have a duration of 6 months or longer.  They may serve to achieve a master thesis or diploma thesis. The student receives a lump sum for living and costs for travel within Europe are covered.
If you can agree to these conditions, we invite you to check out the projects currently available and apply via email. Please respond to the following questions.
  1. Which project to you want to apply for? (see below)

  2. Which kind of merits does it earn you for your studies?
    (Note that these projects are not suitable as PhD projects). Master/Bachelor or similar thesis internship with final report internship only other (please explain ....)

  3. Briefly explain why you think you are the right person to pursue the project.

    If you want to propose your own project in or around the topics of text mining and information extraction in biology, please submit your proposal to me ( Dietrich Rebholz-Schuhmann ) for further discussion/assessment.  Project proposals should not exceed 300 words.

Available projects:

  • Identification of disease types from biomedical text. 
    Terminological resources have to be exploited to identify disease names in scientific literature.  These resources will be harvested, assessed and integrated into the information extraction pipeline available in the research group.
  • Identification of phenotype information from the biomedical literature
    Phenotypic information is available from terminological resources.  The terminological resource has to be harvested and WHAT ...
  • Disambiguation of species or drug names based on contextual information
    This requires generation of a training corpus for a machine learning classifier.  An initial corpus could be generated by mining Medline with language patterns.
  • Identification of chemical compounds in literature (together with collaboration partners)
    We want to assess a machine learning approach for this task.  It has to be suitable to be integrated into our information extraction pipeline.
  • Assessment of approaches for Anaphora resolution.
  • Assessment of summarisation techniques.
  • Integration of information extraction into bioinformatics workflow's (e.g. microarray experiment annotation, ongoing curation work at the EBI)
  • Linguistically motivated categorisation of protein-protein interactions based on samples from Medline.

Past projects:

  • Automatic generation of a training corpus for the disambiguation of protein names, where the protein name is also used in common English with a different meaning.  The training corpus was generated from Medline. 
    (german diploma thesis for Georg Schumann, University of Applied Sciences, Weihenstephan, Germany)
  • Theoretical foundation of HPSG parsers: assessment for its use in information extraction
    (internship Damiano Somenzi, University of Milan)
  • Automatic generation of language patterns for the identification of protein-protein interactions from Medline. 
    (internship Joerg Hakenberg, Humboldt University, Berlin, Germany;  DAAD exchange program)
  • Extraction of kinetics parameters from the scientific literature.
    (master thesis Sebastian Schmeier, Freie University Berlin, Germany)
  • Assessment of the use of iHOP in the IntAct curation project
    (internship Robert Hofmann, CSIC-CNB, Madrid, Spain; Marie Curie Fellowship)
  • Extraction of GO annotation evidence from Medline to annotate proteins from the TrEMBL database. 
    (internship Francisco Couto, University of Lisbon, Lisbon, Portugal; Marie Curie Fellowship, click for URL)