Joint EBI-Industry Workshop: Cheminformatics in R
Day 1
The R programming environment has emerged as a powerful platform for a variety of bioinformatics and chemometric analysis. However, for chemometric and cheminformatics problems, the environment does not natively support manipulation of molecular representations. This session will describe how the integration of the CDK with R provides the ability to load, manipulate and analyse chemical structure and associated data seamlessly within the R environment. The session will start with a brief R tutorial and then explore the capabilities of the R-CDK package using examples from QSAR modelling and similarity searching. We will also learn about the R-Pubchem package that allows one to directly access PubChem structure and bioassay data from within R. The session will end with a discussion on how the packages can be extended by writing R or Java code.
Dr Rajarshi Guha is a research scientist at the NIH Chemical Genomics Center and has been using R for QSAR modelling and chemical data mining for the last 7 years.
Day 2
In Metabolomics research, many experiments are comprised of hundreds to thousands of samples. This amount of data requires automated processing. Several packages in the Bioconductor project are able to analyse mass spectrometry data, combining powerful statistics and visualisation. The session will focus on the processing of LC/MS profilling data, and identification of compounds with the Metlin spectral library.
Dr Steffen Neumann is the head of the Bioinformatics and Mass spectrometry group at the Institute for Plant Biochemistry in Halle, where several tools and databases for MS profilling and identification are developed, including several Bioconductor packages.
Speakers
Rajarshi Guha, NIH Chemical Genomics Center (R-CDK and R-Pubchem)
Steffen Neumann, AG Massenspektrometrie & Bioinformatik ( XCMS, Rdisop, CAMERA)
H. Paul Benton, Imperial College London.
David Broadhurst, Cork University Maternity Hospital.
| Time | PRELIMINARY AGENDA |
|---|---|
| 17 May 2010 | |
| EMBL-EBI IT Training Room, Hinxton, UK | |
| 09:00 | Welcome and Introductions |
| 09:15 | Brief Introduction to R (R. Guha) A very brief overview of R from a programming and application point of view. Will look at some basic programming constructs and then survey packages that are useful for cheminformatics problems. This session will also briefly touch on RDBMS access from R. |
| 10:45 | Brief Introduction to the CDK (R. Guha) An overview of CDK functionality. This will be relatively high level and will not go into the nitty gritty details of Java programming. Will serve to highlight what can (and cannot) be done from R. |
| 11:15 | Tea/Coffee |
| 11:30 | Input/Output & Molecular Manipulations (R. Guha) Reading and writing chemical structures from various sources and in various formats. What does I/O entail in the CDK programming model? How does it affect working in R? Once we have a set of molecules, what can we do with them? We’ll cover accessing atoms and bonds, setting and getting properties on molecules and so on. |
| 12:30 | Lunch |
| 13:30 | Fingerprints & Similarity Searching (R. Guha) I’ll discuss accessing the various fingerprint methods of the CDK and manipulating fingerprints using the fingerprint R package. I’ll also address reading fingerprint data from files generated by other programs. |
| 14:00 | Descriptors and QSAR Modeling (R. Guha) QSAR modeling is a common cheminformatics task. Key to developing QSAR models is the evaluation of molecular descriptors. In this session, I’ll cover the available types of descriptors and how one evaluates them. We’ll then run through examples of developing QSAR models, starting from molecule loading and ending at a final model. |
| 15:30 | Tea/Coffee |
| 15:45 | R, CDK and Chemical Databases (R. Guha) Getting chemical structure and bioassay information from PubChem using R. This section will overview the functions that let one retrieve structures, assay information and assay data. I will also highlight current limitations in terms of data size and ways around these limits. |
| 16:45 | Adding New Functionality (R. Guha) This session will highlight how one might go about extending the package – either by wrapping calls to the CDK in R or by adding your own Java methods and then calling them. |
| 17:15 | Additional Practical and Questions (R. Guha) |
| 18:15 | Close of first day |
| 18:15 | Close of first day |
| 18:30 | Transport to workshop dinner |
| 19:00 | Workshop Dinner - The Cricketers, Clavering |
| 22:00 | Transport from The Cricketers, Clavering to Hinxton and Cambridge Train Station |
| 18 May 2010 | |
| EMBL-EBI IT Training Room, Hinxton, UK | |
| 09:00 | Welcome and Introductions |
| 09:15 | A brief fly-through a typical Analysis Session (P. Benton) The first talk of the workshop will be a fly-through the plain-vanilla analysis of LCMS data. |
| 09:45 | Picking with waves (S. Neumann) For high-resolution, centroided data we will give details of the centWave peak picker. This includes details on how to debug unexpected results, which might be either bugs in the algorithm, or (more likely) unfulfilled assumptions about the data. |
| 10:15 | Let's do the time-warp again (S. Neumann) The Obiwarp alignment is a member of the "dynamic-time warping" family of algorithms. It has been proposed a while ago, and we recently imported the library into XCMS. We'll present a few more details on both the established LOESS alignment and Obiwarp. |
| 10:45 | Tea/Coffee |
| 11:00 | Applying XCMS to large LCMS datasets (David Broadhurst) |
|
11:45 |
Parameters and Best practices (P. Benton & S. Neumann) We'll give a few tried-and-tested parameter settings, and ask for yours (!). Also, there will be assorted Tips & Tricks on various aspects. |
| 12:15 | Lunch |
| 13:15 | Identify your buddies (S. Neumann) Identification of metabolites is one of the most difficult challenges in metabolomics. Following up on RPubChem, we'll demonstrate how to obtain sum formulae with Rdisop. |
| 13:45 | Take your CAMERA out (S. Neumann) The world beyond XCMS. CAMERA is one of the newest additions to the Bioconductor Mass Spectrometry ecosystem. We will demonstrate some of the functionality and give details of the algorithms behind it. |
| 14:15 | Taking XCMS to higher levels (P. Benton) XCMS is learning to analyse MSn data. This will focus on using fragmentation trees for MSn data and metabolite identification for high mass accuracy MS2 data. |
| 14:45 | Tea/Coffee |
| 15:15 | Hacking XCMS (P. Benton & S. Neumann) Want to do more debugging? Stuff we never envisioned one could do with XCMS? Then you need some insights how data is represented internally. We'll also show how to install the latest-and-greatest of XCMS, and give a sneak-preview of stuff to come. |
| 15:45 | Additional Practical and Questions (P. Benton & S. Neumann) |
| 17:45 | Closing remarks |
There is a charge of £25 per day (£50 for both days) for this course to cover the cost of refreshments and course materials.
Registration Closed.
