 |
MSDchem,
PDB Ligand Chemistry
Introduction
The "Ligand Chemistry" service provides web access to the "ligands and small molecule dictionary" of the MSD database developed by the MSD group at EBI. This dictionary is part of the core "reference" information of the MSD relational database and is consistently referenced by all macromolecular structures for all bound molecules as well as standard and modified aminoacids. Since every residue and every atom in the MSD database references a ligand and an atom in this dictionary, this is the repository that defines the link between proteins and chemistry.
Ligand in MSDchem
The term ligand refers to the distinct chemical entity of a stereoisomer of a small molecule or monomer.
This means that structural isomers, geometric isomers, and enantoimers (but not conformation isomers) are distinct ligands in MSDchem. The properties that define the chemical identity of a ligand are:
- atoms (including hydrogens) and atom elements,
- bonds and bond orders as well as
- atom and bond stereo descriptors
These allow the exact identification of stereo-isomers and these are maintained explicitely in MSDchem.
Atom coordinates and nomenclature (names) are not a fundamental properties of the molecule apart from the extend that they correspond to the specific stereo-isomer. So different sets of coordinates may belong to the same ligand and be perfectly valid if they generally agree with the molecular structure and its stereo configuration. This is modeled in the dictionary by having different sets of coordinates for the same ligand for different "libraries" (different sources). The set of coordinates that are used by default are idealised coordinates provided by CORINA.
So different stereo-isomers are defined as different ligands with different 3-letter codes in MSDchem while it is not possible to have 2 different 3-letter codes when the chemical entity is the same and the coordinates or atom names are different. Legacy cases with this type of problems are explicitely marked as obselete and superceded.
To summarize: A ligand in MSDchem is a distinct chemical graph with atoms of a particular elemenet as nodes and bonds or a particular order as edges. Some of the atoms and bonds have an additional stereo-descriptor property (R or S for atoms and E or Z for bonds)
Ligands in wwPDB
The ligand dictionary is not an isolated effort of the MSD group. The fundamental parts of the dictionary are exchanged on a weekly basis with collaborators of the international wwPDB (RCSB - PDBj) in the form of mmCif chem_comp_group files and are in sync with the PDB archive. During this process new ligands are manually and semi-automatically processed by the wwPDB members, before they become official 3-letter code identifiers of the PDB. In the MSD database in addition to the common fundamental parts there is derived information like stereo-smiles, gifs, idealised coordinates, energy types etc that is also included. Furthermore the ligand dictionary is an integrated consistent and enriched library with clear and enforced relationships with the rest of the MSD database.
How to search with MSDchem
There is a wide range of possibilities for searching and exploring the dictionary.
- Short code: This is the PDB 3 letter code for the ligand (i.e. ATP).
You may also select the "like" operator with a wildcard expression ('*' means any characters and '.' means one character)
For example *TP will match most triphospate ligands.
- Code: This is the standard extended code of the ligand as defined by EBI. It is identical with the short code but it can be used to distinguish between topological variants of aminoacids and aminoacids in various protonation states. For example using the 'like' operator with the *_LSN3 wildcard for this item, will return the N-terminus aminoacid variants, while HIS* will return the variants for the different protonation states of histidine.
- Molecule name: An expression or word that is part of any of known molecule name (standard name, common name, systematic name). The special character '*' matches a sequence of any characters and '.' matches any single character.
Examples:
- 'amino',
- 'galactose',
- '*ALPHA*galactose*'.
You may also use the '=' operator for an exact match.
- Formula: An expression that sets range constraints for the number of atoms from each element. The value that you have to provide is of the form [<E><n>-<m> ]* where <E> is an element <n> is the minimum number and <m> is the maximum number that the element must appear on the formula . The order in which the elements are given is not important. For example if you want to find ligands that have more than 10 and less than 15 carbons, 3 nitrogens and one oxygen, you should give 'C10-15 N3 O1'.Other examples:
- - 'CL3 N0' find molecules with exactly 3 Clorines and no nitrogens,
- - 'C40-50 N5-10 O5-15 S1-10' molecules with 40-50 carbons, 5-10 nitrogens, 5-15 oxygens and 1-10 sulfurs.
By clicking on the button next to the item, you may use the formula range editor to build your formula expression interactively.
You may also use the '=' operator for an exact formula match.
- Non stereo smile: For structure based searches. By clicking on the edit button, a form appears that will allow you to specify a molecule or a molecule segment by using one of the three options:
- - Draw the molecule using the JME Molecular editor
- - Upload a standard chemical file like Mol2,Sdf,PDB e.t.c. in the JME editor. You may specify any file types and formats accepted by the CACTVS system
- - Give the standard code (i.e. ATP) of a ligand that already exists in the database in order to be loaded on the JME editor.
After you load a ligand you may also modify it. For example if you are looking for ligands similar to ATP you may load ATP on the JME editor and then
remove some atoms and bonds, keeping just the substructure you are interested in.
As soon as a molecule or molecule segment is specified then you may use it to search the dictionary using one of the following operators:
- - contains: Find all the ligands that whose graph contains the molecule specified as a subgraph. This uses the MSD subgraph algorithm with a prefiltering step. Please be patient since this operation may take a few (2-4) minutes in the worst case.
- - is contained: Find all the ligands that are contained as subgraphs in the graph of the specified molecule.This uses the MSD subgraph algorithm with a prefiltering step. Please be patient since this operation may take a few (2-4) minutes in the worst case.
- - exact match: Find any ligands with exactly the same graph as of the molecule specified.This uses the CACTVS hashcode and is instantaneous
All these search operations ignore sterechemistry. This means that a molecule will also match its stereoisomers.
Additionally aromatic bonds are treated as single-double. This means that in the case of aromatic rings etc, there may be also some false positives.
- Stereo smile: Similar with the exact match described above, but is this cases stereochemistry is not ignored. The stereoisomers will not match.
- Fragments: Similar to formula search but now the search items are not chemical elements but chemical fragments. An expression sets range constraints for the number of occurences from each fragment. The value that you have to provide is of the form [<E><n>-<m> ]* where <E> is a fragment <n> is the minimum number and <m> is the maximum number that the fragment must be contained in the molecule. The order in which the fragments are given is not important. For example if you want to find ligands that have more than 1 and less than 3 adenine groups and a furan ring, you should give 'adenine:1-2 furan:1'.
The library of chemical fragments is predefined it includes about 84 fragments while the fragment expression search is quite fast,
By clicking on the button next to the item, you may use the fragment pattern editor which is practically the easier way to build a fragment expression.
- Fingerprint: This is a fuzzy similarity search operation. The user will use a form like the one described above to input a molecule, and the result will be ligands that contain almost the same segments with it (at least 99% of their segments are common). There are 500 segments in the predefined library used by the CACTVS system. This search is very tricky and will give usefull results mainly for big molecules. For example by giving ATP as input you will get back molecules with similar chemical areas as ATP (like phosphate complexes and adedine segments).
Functionality of MSDchem
The MSDchem service offer a generic browsing interface of all areas of the ligand dictionary. The user may follow links that are available from every
record in order to navigate through the relationships of the dictionary. For example he may follow a relationship link to view the atoms of a ligand
and then for a particular atom, its bonds and energy types and so on.
The "contents" link provides on a single page all the primary information for a ligands (atoms and bonds) while the "complete" link provides a single
page with all the information available (including coordinates and energy types).
The user may also export the data available on any page in various data formats like XML, mmCif, etc.
There is additional functionality provided for ligands. From a ligand page you may also:
- 3-D View: choose the set of coordinates you want to use (i.e. idealised or PDB) as well as the viewer you prefer (i.e. JMol,a variant of rasmol - or another plugin for PDB files) and click the view button. In order to use a plugin like rasmol you need to install it on your workstation and configure your brower to activate it for "chemical/x-pdb", with extension .pdb MIME types.
- File Export: You have to choose the set of coordinates you want to use (i.e. idealised, PDB, or PDB with CACTVS hydrogens) , the export format (PDB,SDF,mmCIf or XYZ), and the output target (html-your browser, or save as a unix file (no linefeed), or as windows file (with linefeed)) and press the save button.
- PDB entries: Follow links to the atlas pages of the entries that are including this ligand
- Site interactions: Follow links to the particular instances of the ligands in the entries and their binding sites through MSDSite
- Binding statistics: View binding interaction statistics for the ligand from MSDSite
- Direct code reference: If you want to include a reference of a ligand (ie ATP) to you web pages using directly its 3 letter code, you may use the following URL parameters: FUNCTION=getByCode&CODE=ATP. You may also specify multiple 3 letter codes, seperated with the | character: FUNCTION=getByCode&CODE=ATP|GTP|ACP|AVP
You may also specify a CONTENTS parameter with values "contents" or "complete" and alternative export formats like XML i.e.FUNCTION=getByCode&CODE=ATP&CONTENTS=complete&FORMAT=XML
- Direct entry code reference: If you want to include a reference of the ligands that occur in particular PDB entries (ie 1tob) to you web pages using directly its entry code, you may use the following URL parameters: FUNCTION=getByEntry&CODE=1tob. You may also specify multiple entry codes, seperated with the | character: FUNCTION=getByEntry&CODE=1tob|2tob. Take into account that there are still unresolved issues with occurences of ligands in PDB entries mainly in cases where there are conflicts in the chemistry or the nomeclature in old PDB entries or when new entries are not loaded yet. In such cases there is no direct link to the referenced ligands but just a reference on its 3 letter code
- Export the complete dictionary: Select the ouput format in the starting page of MSDchem and simply press the "Export" button. You will get a summary file with the most interesting information for each ligand (like code, smiles etc)
- Useful URL's for direct searching: If you want to include a reference the the ligands incorporating the search functionality of the MSD-chem, like
ligands that match a particular name pattern, or contain a particular subgraph etc., these are some useful example URL's
- Using one of the direct search functions getByName and getByFormula that can be called like
FUNCTION=getByName&NAME=warfarin|succinamide|tricyclo
for ligands that contain warfarin or succinamide or tricyclo
-
FUNCTION=getByFormula&FORMULA=Cl3-10|Br3-10
for ligands with formula with 3 or more clorines or 3 or more bromides
- Any of the search operators can be used by passing the operator and the values as parameters in the URL like
FUNCTION=list&CHEM_COMP_NAME=S-WARFARIN&CHEM_COMP_NAME_OPERATOR=0
or
FUNCTION=list&CHEM_COMP_NAME=S-WARFARIN&CHEM_COMP_NAME_OPERATOR=&eq;
for chem_comps where the names is exactly equal with S-WARFARIN.
-
FUNCTION=list&CHEM_COMP_NONSTEREO_SMILES=Brc1ccccc1
for chem_comps that contain the subgraph with smile Brc1ccccc1
FUNCTION=list&CHEM_COMP_NONSTEREO_SMILES=ATP
for chem_comps that contain ATP as a subgraph
-
FUNCTION=list&CHEM_COMP_FINGERPRINT=ATP
FUNCTION=list&CHEM_COMP_FINGERPRINT=ATP&CHEM_COMP_FINGERPRINT_OPERATOR=0
FUNCTION=list&CHEM_COMP_FINGERPRINT=ATP&CHEM_COMP_FINGERPRINT_OPERATOR=common segments
for chem_comps with common segments with ATP
FUNCTION=list&CHEM_COMP_FINGERPRINT=ATP&CHEM_COMP_FINGERPRINT_OPERATOR=1
FUNCTION=list&CHEM_COMP_FINGERPRINT=ATP&CHEM_COMP_FINGERPRINT_OPERATOR=contains segments
for chem_comps with at least all the segments of ATP
- Adding the options
&FORMAT=XML or &FORMAT=Perl will give back the data in XML or a perl data structure like for example
FUNCTION=list&CHEM_COMP_NONSTEREO_SMILES=Brc1ccccc1&FORMAT=XML
FUNCTION=list&CHEM_COMP_NONSTEREO_SMILES=Brc1ccccc1&FORMAT=Perl
Usage Examples
MSDchem back-end
The database that is accessible by the service is the MSDSD search database, which is based on a transformation of the MSD deposition database. The MSD deposition database is used internally from the MSD group during the processing of new PDB entries while the MSD search database is derived and kept synchronised on a weekly basis while it keeps only non-confidential information that is released and is publicly accessible.
Additionally the dictionary contains classification of the atoms of the ligands in energy types, and associates them with the energy types reference dictionary for different set of libraries (different classification sets).
An overview of the database schema may provide an easy way to become familiar with the information contained in the ligand database while for more detailed information there is also a reference dictionary.
The MSDchem interface is based on a generic underlying mechanism that allows the user to interact with database "entities" and the forms are generated based on 5 basic templates that apply to each entity.
Please contact MSD group for suggestions, comments or problem reports. Your input is very helpfull.
Derived information
Several external programs are also used for the ligand dictionary in order to provide derived information like
- - Gif Images
- - Smiles - Stereo smiles
- - Hash - Stereo Hash
- - Atom stereochemistry (R/S)
- - Bond stereochemistry (E/Z)
- - Atom chiral neighbours (chirality bond sequence for chiral atoms)
- - Atom ring flag
- - Bond aromatic flag
- - Rings and ring atoms
- - Planes and plane atoms
- - Fingerprints
- - Cactvs hydrogen missing coordinates
- - Idealised 3D Coordinates
- - Atom energy types
- - Molecule systematic names
- - IUPAC InChi strings
The derivation of this information is performed in the MSD group by following these steps:
The MSD curators will ensure that the basic structure of every ligand is correct. This means that the atoms, their names and elements as well as the bonds between atoms and the bond orders are defined and checked manually during deposition processing.
- CACTVS is then used to get
- - non-stereo gif image
- - non-stereo smiles
- - non-stereo hash
- - fingerprints
- - check if there are potential stereo centers
- Also MSD in house build software is used to derive
- - rings, ring atoms and flags
- If there are no potential stereo centers, the non-stereo smiles is given to CORINA to generate
- The MSD curators will find a good representative set of coordinates from the PDB for this ligand (if one exists) and this is loaded as an initial set of coordinates
There coordinates are then used by CACTVS to get
- - stereo gif image
- - stereo smiles
- - stereo hash
- - atom stereochemistry (R/S)
- - bond stereochemistry (E/Z)
- - atom chiral neighbours (chirality substituent priorities for chiral atoms)
- - hydrogen missing coordinates
- Also MSD in house build software is used to derive
- The stereo-smile is given to CORINA to generate
- which in turn is also exported into MDL/SDF format and given to the InChi software in order
to derive the InChi strings
- Using the complete information of the particular stereo-isomer, curators use ACD-Labs software to get the systematic name for the molecule
- Finally an in house implementation of the VEGA ideas is used to get then atom energy types
Citing MSDchem
- UNIT 14.3: Using MSDchem to Search the PDB Ligand Dictionary
Dimitropoulos, D., Ionides, J. and Henrick K. (2006) In Current Protocols in Bioinformatics (A.D. Baxevanis, R.D.M. Page, G.A. Petsko, L.D. Stein, and G.D. Stormo, eds.) pp 14.3.1-14.3.3 John Wiley & Sons, Hoboken, N. J. ISBN: 978-0-471-25093-7
- MSDsite: behind the scene: The technology used in database searching and retrieval for the analysis and viewing of bound ligands and active sites.
Golovin, A., Dimitropoulos, D., Oldfield, T. and Henrick, K. (2004)
The eCheminfo 2004 Conference "Applications of Cheminformatics and Modelling to Drug Discovery 8-19 November.
- MSD database and MSD database services
A. Golovin, T. J. Oldfield, J. G. Tate, S. Velankar, G. J. Barton,
H. Boutselakis, D. Dimitropoulos, J. Fillon, A. Hussain,
J. M. C. Ionides, M. John, P. A. Keller, E. Krissinel, P. McNeil,
A. Naim, R. Newman, A. Pajon, J. Pineda, A. Rachedi, J. Copeland,
A. Sitnov, S. Sobhany, A. Suarez-Uruena, J. Swaminathan, M. Tagari,
S. Tromm, W. Vranken and K. Henrick (2004) E-MSD: an integrated data
Nucleic Acids Research, 32 (Database issue), D211-D216. 2004
The following methods and packages have also be used for MSDchem
- CACTVS (http://www2.chemie.uni-erlangen.de/software/cactvs/index.html)
CACTVS: A Chemistry Algorithm Development Environment
W. D. Ihlenfeldt, Y. Takahasi, H. Abe, S. Sasaki,
in: Daijuukagakutouronkai Dainijuukai Kouzoukasseisoukan Shinpojiumu Kouenyoushishuu,
Machida, K., Nishioka, T. (Eds),
Kyoto University Press (1992), 102-105
- CORINA (http://www2.chemie.uni-erlangen.de/software/corina/index.html)
Gasteiger, J.; Rudolph, C.; Sadowski, J.
Automatic Generation of 3D-Atomic Coordinates for Organic Molecules.
Tetrahedron Comp. Method. 1990, 3, 537-547.
- JME (http://www.molinspiration.com/jme/)
- JMOL (http://jmol.sourceforge.net/)
Christoph Steinbeck, Yongquan Han, Stefan Kuhn, Oliver Horlacher, Edgar Luttmann, and Egon Willighagen, The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics, J.Chem.Inf.Comp.Sci., 2003
- ACDLABS (http://www.acdlabs.com/)
- VEGA (http://users.unimi.it/~ddl/vega/index_noanim.htm)
A. Pedretti, L. Villa, G. Vistoli
"Vega - an open platform to develop chemo-bio-informatics applications, using plug-in architecture and script programming"
J.C.A.M.D., Vol. 18, 167-173 (2004)
- InChi The IUPAC International Chemical Identifier (InChI TM)
Copyright © The International Union of Pure and Applied Chemistry 2005: IUPAC International Chemical Identifier (InChI) (contact: secretariat@iupac.org)
|