Student Internships


The EMBL Visitors and Scholars Programme also offers opportunities at other EMBL sites.


Cheminformatics and Metabolism Team Christoph Steinbeck


Placement duration Placements typically last 4-6 months
Contact Christoph Steinbeck;
Deadline for applications Open until further notice
Special requirements All applicants should have a good computational background. See also the individual project descriptions
Group page

How are our databases used? (User research)

Usage of the ChEBI database has been growing steadily since its original release in 2004. The same is true for our recently released public databases such as MetaboLights. We would like to better understand the ways that our databases are being used by analysing the citations to our publications. This user research study will involve reading many papers and grouping the users into broad categories based on what they are using the database for.

Development of an adaptive binning/centroiding algorithm

Mass spectra acquired in profile mode contain several mass-to-charge (m/z) signals representing one 'real' measured ion. The differences between the m/z are typically in the sub ppm range. Downstream analysis of mass spectrometry data requires extraction of ion traces, relying on reproducible signal picking of the representative m/z signal from the signal set. A process also called centroiding, which gets complicated by baseline noise, drift, and signal convolution, e.g., due to instrument-specific low mass resolution and mass accuracy. This project aims at the development of a simple and fast centroiding algorithm, primarily for mass spectra. Currently most open-source centroiding algorithms are parameter heavy (e.g., wavelets) and/or don't yield robust and reproducible results. The project requires basic programming knowledge in Java.

Sequence alignment for structure elucidation with NMR

Deterministic approaches for NMR structure elucidation are used as a method of last resort when data driven approaches fail to predict a structure with a reasonable spectra similarity to the queried spectrum. However, the deterministic methods are computationally demanding as they require to generate all possible molecular structures for a given chemical formula and predict the NMR spectrum for each of the structures. This project aims at developing a new method that uses the spectral features to restrict the number of structural fragments that need to be assembled for generating a incumbent structure. This project will require some basic skills in (integer) linear programming and modelling. Support regarding structure elucidation and NMR analysis will be given.

Generation of inhibitor records for all EC enzyme classes

For compounds which act as an enzyme inhibitor is added to the ChEBI database, the compound must be linked by a has_role relationship to the appropriate enzyme inhibitor term. Thus paracetamol, CHEBI:46195 is linked by a has_role relationship to cyclooxygenase 1 inhibitor, CHEBI:8249 . If the appropriate enzyme inhibitor term is not present in the ChEBI database, it must be created in a separate (time consuming) step. In fact, relatively few enzyme inhibitor terms are currently present in the ChEBI database.Since the action of any enzyme could in principle be inhibited by a compound which may be required in the ChEBI database, it would be beneficial to automatically create an "inhibitor" entity in ChEBI for every enzyme classified by the IUBMB, using the information stored in the IntEnz database to populate these entries with appropriate names, synonyms, definition, etc.. These records could reside quietly in ChEBI as 1-star entries (i.e. not visible on the public website), until required by a particular ChEBI compound in a has_role classification. The project would also encompass upgrading/deduplicating existing enzyme inhibitor records.

Build knowledgebase for pKa values

Gather information on pKa values from a variety of databases and the literature. We want to map those onto compounds so that we have a gold standard for pKa values. It would involve some programming to collate information but also require curation to check the data.

Gather what's out there already, curate (Paula)

Retrieve & display structures of component parts for mixtures in ChEBI

Also known as "Display two structures in ChEBI at the same time"! For compounds classed as mixtures, CHEBI:60004, such as racemates and diastereoisomeric mixtures, no single structure can represent the entity. In such cases, the compound is entered in ChEBI without a structure, but is classified in the ontology as is_a racemate, CHEBI:60991 and is linked to the two separate enantiomers which make up the racemate (and which each have a structure) by has_part relationships. Analogously for compounds classed is_a diastereoisomeric mixture CHEBI:60915 . What is required is to be able to display the structures of all of the has_part children when the ChEBI record has no structure but is classed as is_a racemate or is_a diastereomeric mixture.

Automatic methods for enrichment of the ChEBI Role Ontology Branch

The ChEBI Role Ontology is specially useful when trying to get a biological meaning for a set of small molecules. However, the annotation of roles to existing ChEBI molecules is a bit sparse, having many entities that should have a has_role relationship with out it. It would be interesting to have automatic suggestions for the curators to add additional roles to old and new entries, increasing the number of roles assigned in ChEBI. To do this one way would be to integrate what ever is in KEGG Brite, MetaCyc chemical ontology, ChemIDplus categories and MeSH annotations of PubChem compounds. Also natural language processing techniques (in the line of Adam Bernard's work) could be used to mine this knowledge through text mining from the literature. The results could be fed into the OWL ontology and use it to decide whether they make sense or contradict previous knowledge (I'm speculating on this last sentence, but I think it is doable).

Glycan visualization (ME)

Glycans (collections of carbohydrate monosaccharides linked glycosidically) are notoriously difficult to portray clearly, as the differences between the structures of the monosaccharide residues may be subtle. A number of carbohydrate databases have developed new methods or use other existing ones for displaying glycan structures more clearly, either using sets of coloured symbols or standard abbreviations, see for instance KEGG GLYCAN (e.g.,, GlycomeDB (e.g., and JCGGDB (e.g., We need to be able to show something similar in ChEBI. The student will need to have a basic understanding of carbohydrate structure in order to research what systems for displaying glycan structures are already available, along with sufficient imagination and programming skills to allow him/her to develop the system most appropriate for ChEBI.

A one-stop search application for public databases

A significant part of the curation process involves searching of public databases (KEGG compound, KEGG Drug, KEGG Glycan, ChemID+, Reaxys, HMDB, PDBe, MetaCyc, LipidMaps, DrugBank, NIST, PubChem, ChemSpider etc.) to establish database links and obtain synonyms for the individual compound. As part of the design for the new curator tool, it would be good to have an application where you can input a structure/name and web-crawl through the public databases in a single step. KEGG do have a text search application where you can search for a compound name and this covers all KEGG databases as well as DrugBank, LipidMaps, PubChem (and CHEBI!). Otherwise for each of the other databases it is necessary to open up the individual database and paste in a structure. Some databases (KEGG, NIST, MetaCyc) do not have structure-based search facilities, so perhaps this application may have to be limited to text searches only, but this would still save time on the seemingly endless copying/pasting involved when searching for datbase links, synonyms etc..

Generation of structures from a class of compounds

Rhea contains classes of reactions, in which the structures of reaction participants contain R groups. Curators need a facility to create concrete instances of these reactions - with a precise stoichiometry -, by specifying the R groups involved (ex: derive ethanol, propanol... from primary alcohol) which would be applied to the proper reaction participants. This would facilitate creation of new reactions in the database. This project would require understanding of MDL MOL format and java programming skills. Knowledge of CDK would be a plus.

Enrichment of open natural products data

Natural products (NPs) are small molecules synthesised by living organisms. Good knowledge base of natural products is critical in filtering less probable structures from Computer Assisted Structure Elucidation (CASE) and in virtual screening of combinatorial libraries. Currently around 25000 open-data natural products are known to us. Text-mining of scientific literature has the potential towards increasing the open-knowledge-base of natural products and we are looking for a candidate to work towards that end. The candidate should come up with heuristics to comprehensively mine NP-like names from the literature, develop automated workflows to extract names, resolve synonyms, analyse meta-data of compounds in large databases (PubChem) to get clues for a compound to be a natural product and eventually extract the structures. The resulting high quality structural data will then be integrated with our existing ChEBI resource. The candidate should be familiar with concepts of text-mining, basics of chemistry and statistics. Programming knowledge in Java is a plus. The ideal candidate will be required to work for 3-4 months.

Automatic Structure Diagram Generation

In cheminformatics it is often necessary to create 2D coordinates for compounds which either have wrong coordinates or none at all. This task is difficult because on the one hand the algorithm needs to lay out arbitrary structures, but on the other hand there are certain conventions on how certain cases should be handled. In this project, the student would work on the existing code for SDG in the Chemistry Development Kit (CDK) and improve it with respect to e. g. handling of stereochemistry, deterministic layout, IUPAC-conforming alignment of final structure (major ring sytem/largest chain horizontally placed) and globally optimized layout. This requires good programming skills in general, best in an object-oriented fashion, and a basic command of the Java language. The project would include assesment of the literature with respect to standards, specification of the requirements and implementation of a solution.

Developing a web application for a text-mining-based data capture system

This project will allow you to develop your web application framework skills while working with our bio- and cheminformatician and data curators to develop a web-based systems to graphically approve and edit text-mined information into data sets ready to be published in our databases ChEBI, MetaboLights and Rhea. The actual information mining part will be done by our collaborators at the University of Cambridge. You should have some experience with Java programming and web application development.



  ChEMBL Team John Overington


Placement duration Functional Genomics Team
Contact John Overington;

Deadline for applications

Open until further notice

Special requirements

Good computer knowledge in a Linux-type environment

Group page


Current projects include:

1) ChEMBL is a database containing bioactivity data on drug-like  molecules. We are interested in analysing the distribution of aromatic  rings and molecular frameworks within the database and to investigate  whether the occurrence of these molecular fragments are related to  activity at specific protein targets. A good knowledge of chemistry is  required and knowledge of statistical methods and pipelining tools would  be useful.

2) We are interested in characterising the target space of the  molecular targets in the ChEMBL database by using knowledge of the  properties of compounds that bind to specific targets. A knowledge of  chemistry and biology is needed. Experience in manipulating large  datasets and statistical analysis would also be useful.

3) The ChEMBL database is a rich source of pharmacological data. We are  interested in mining these data to gain insights into the history of the  development of drugs and pharmacological techniques. Of key interest is  the identification of bioassays of greatest therapeutic relevance. Such  a project would require a 4 month internship. Scripting skills (Perl,  Python, etc) would be required. Some knowledge of SQL and an  understanding of the fundamentals of pharmacology would be an advantage.

4) Enhancement of ChEMBLdb web interface:
ChEMBLdb is an online database of information on the properties and activities of drugs and drug-like small molecules and their targets. We are interested in extending the capabilities of the web interface to our large SAR (Structure Activity Relationship) database. A good knowledge of web and database programming is required, development and use of novel visualisation methods would greatly advantageous.

5) Protein structure analysis of drug target domains: 
We are interested in extending the annotation of our drug targets to include Pfam and structural domain coverage. Experience of sequence searching strategies and structural bioinformatics is required.



Proteomics Services Team Henning Hermjakob 


Placement duration Negotiable, ideally 6 months
Contact Henning Hermjakob;

Deadline for applications

Open until further notice

Special requirements

All applicants should have solid knowledge of Java and relational databases. It is advantageous to have domain-specific knowledge in molecular biology. Prior knowledge of proteomics is not essential.

Group page


Computational projects will usually implement new features for existing database systems, in particular the IntActPRIDE Reactome, and BioModelsdatabases. The available projects range across a broad spectrum, from data analysis, evaluation, and statistics, to web interfaces and data visualisation. Projects will always be based on our open source, production quality database applications, and will contribute to the publicly accessible systems.

The following 2011/2012 publications all result from a traineeship or visit in the Proteomics Services Team and have the trainee/visitor as first author:

  1. Salazar GA, et al. MyDas, an Extensible Java DAS Server. PLoS One. 2012;7(9):e44180. doi: 10.1371/journal.pone.0044180. Epub 2012 Sep 13.
  2. Wein SP, et al. Improvements in the Protein Identifier Cross-Reference service. Nucleic Acids Res. 2012 Jul;40 (Web Server issue):W276-80.
  3. Koh GC, et al. Analyzing protein-protein interaction networks. J Proteome Res. 2012 Apr 6;11(4):2014-31.
  4. Ndegwa N, et al. Critical amino acid residues in proteins: a BioMart integration of Reactome protein annotations with PRIDE mass spectrometry data and COSMIC somatic mutations. Database (Oxford). 2011 Oct 23;2011:bar047.
  5. Villaveces JM, et al. Dasty3, a WEB framework for DAS. Bioinformatics. 2011 Sep 15;2 (18):2616-7. Epub 2011 Jul 28.
  6. Griss J, et al. Published and perished? The influence of the searched protein database on the long-term storage of proteomics data. Mol Cell Proteomics. 2011 Sep;10(9):M111.008490.
  7. Salazar GA, et al. DAS writeback: a collaborative annotation system. BMC Bioinformatics. 2011 May 10;12:143.
  8. Gel Moreno B, et al. easyDAS: automatic creation of DAS servers. BMC Bioinformatics. 2011 Jan 18;12:23.


 Saez-Rodriguez Research Group Julio Saez-Rodriguez  


Placement duration Typically at least 4 months, shorter placements in exceptional cases
Contact Contact us at Please check our projects, and explain in your application why you are interested in one or more of our research interest. Non-specific applications without this expression of interest will not be considered.

Deadline for applications

Open until further notice

Special requirements

At least some programming experience is expected, preferably R, Python, or Java. Biology knowledge is in not always required, but advantageous.

Group page

We are broadly interested in how the dynamics of signal transduction, mediated for example by protein post-translational modifications, ultimately influence cell fate decisions. We build predictive mathematical models using high-throughput experimental data collected after applying many different perturbations to the pathways of interest to get at the underlying network structure. Specifically, research in our group aims to combine statistical methods with models describing the mechanisms of signal transduction either as logical or physico-chemical systems. We then use these models to better understand how signalling is altered in human disease and predict effective therapeutic targets.

Projects connected to our ongoing projects are frequently available, can range from methods development to specific applications, but in most cases entail a bit of both. 



UniProt Group – Rolf Apweiler 


Placement duration 2-8months
Who should I contact about this internship? Maria Martin;

What is the deadline for applying for this internship?

Open until further notice

Special requirements

Good computational background, including the ability to program in Perl, Java or another common bioinformatics language. Biology knowledge is not required but advantageous

Group page