-------------------------------------------------------------
* Chemical substructure search in SQL
* Copyright (C) 2008 Adel Golovin
* The aim of this project is to provide a fast deterministic chemical
* substructure search and an easy way of its embedding into relational
* databases. The technique is as follows: the chemical package generates an SQL
* query given SMILES of MDL file. Then this SQL query is executed on a database.
*
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
* MA 02110-1301, USA.
-------------------------------------------------------------
This chemical search package allows
- to create a database of chemical compounds
- to load MDL(SDF) files into the database and perform updates with new data
- to install a web search interface
- to search substructures inside database by SMILES
The supported database brands are Oracle and Postgres. Mysql is a may be in the future.
The system runs in client server mode, when the SQL query is generated on the client side and
then executed on the database server side. We use a web application as a client, its role is
to generate SQL query.
The software has been tested with Oracle 9, 10, 11 and PostgreSQL 8.3.
To run this package on the server side only, it is available to
install it as the java package inside Oracle and then use it in SQL queries.
The same can be done in PostgreSQL by the use of PL/Java.
The package requires java 1.5 or above to be installed on the database server,
this generally limits Oracle version to 11 or above.
To install the chemical search you need an account with DBA privileges.
It implies that the RDBMS is installed and the database is created.
The chemical search user accounts and schemas will be created by our utility.
An Oracle installation creates a database with the name "ORCL" by default.
A PostgreSQL installation creates a database with the name "postgres" by default.
If you do not have Oracle software then a trial version is available for download from oracle.com
If you decided to install Oracle then after installation is complete you will need
to create a user with DBA privileges.
For Oracle 11g it is available through oracle web interface:
select tag "Server" and in the "Security" column select "Users", then create a user,
for instance CHEM and, for the sake of simplicity, grant it all available privileges.
!!! Important notice for PostgreSQL installation.
-------------------------------------------------
The installation requires that the user under which you load the database has rights
to create folders in the PostgreSQL/data folder. These folders are tablespaces.
The tablespaces are created by the loading procedure in pairs - one tablespace
for data (tables) and another for indexes.
By default the access to the data folder is available to
a user with administrative privileges under which the installation of postgres was done.
Right after the installation this data folder will be read-only.
To remove this flag you will need to stop the postgres server and then change
the privilege using an administrative account.
Alternatively you can create the folders manually.
The folders name start with the user name which is chosen for chemical data in the chmload.properties file
(mol) then underscore '_' followed by 'data' and 'data_i'
For example:
C:/PostgreSQL/8.3/data/mol/mol_data
C:/PostgreSQL/8.3/data/mol/mol_data_i
After creation of the folders make sure that created by the installation of PostgreSQL "postgres" user
has a full access them.
The choice of using tablespaces is made to easier removal of the chemical database.
It may assist the copying too.
-------------------------------------------------------------
DATABASE LOADING
The psm.db.CHEMLoadManager class creates database of chemical compounds and
loads data from CML/MDL/SDF files.
To create a database of chemical compounds the following files are required:
jar files:
- chempackage.jar - our jar file, it is distributed under GPL
- ant.jar - ANT jar file from jakarta ant project
Apache ANT. It is covered by Apache License: http://ant.apache.org/license.html
- ojdbc14.jar - jar file with oracle JDBC drivers, it can be down loaded from oracle.com.
This is Oracle JDBC driver. It is covered by OTN Development and Distributil License.
You must agree with this license before using this jar file. Note, that is is not opensource and
inherited classes from classes of this jar can not be distributed.
Consider the license and the distribution see oracle.com.
- postgresql-8.3-604.jdbc3.jar - jar file with postgres JDBC drivers, it can be down loaded from jdbc.postgresql.org
This is PostgreSQL JDBC driver.
- cdk-1.0.4.jar CDK library, it can be down loaded from the sourceforge.net
This is Chemistry Development Kit. It is distributed under GPLL and is available from http://sourceforge.net/projects/cdk
configuration files:
- chmload.properties - configuration of the database and the source of chemical compounds
It is necessary to edit this file. The instructions can be found in it.
For the sake of an example you must change the "database" and "database.filepath" values only.
- compound.property - this file contains mapping of MDL(SDF) tags to the database fields
It is optional to edit this file, because if database fields are not assigned they will be calculated
- logging.cfg - java logging properties
SQL Script files
- chemTables.sql creates necessary database tables for data loading
- chemSearch.sql creates necessary database tables and indexes for search
These files are RDBMS specific and therefore there are copies for Oracle, PostgreSQL
and the other supported vendors.
HOW TO EXECUTE DATABASE LOADING
Java 1.5 or above is required.
1) edit chmload.properties and specify "database" and "database.filepath"
fields in according with your database.
2) The command line for Unix and MS Windows looks similar to:
java -jar chempackage.jar dba=usr/pwd conf=chmload.properties
where it is necessary to assign usr and pwd in dba=usr/pwd to
the database administrator user name and password.
But if you load additional data into an already created database then
this parameter can be omitted:
java -jar chempackage.jar conf=chmload.properties
When the "database" attribute in the configuration file is specified by its TNS name
then it is necessary to start java with the parameter: -Doracle.net.tns_admin={$TNS_ADMIN}
otherwise make sure that "database" is given by its URL (host:port-number:database-name)
With the TNS names the command line for Unix looks like:
java -Doracle.net.tns_admin={$TNS_ADMIN} -jar chempackage.jar dba=usr/pwd conf=chmload.properties
for MS Windows
java -Doracle.net.tns_admin=%TNS_ADMIN% -jar chempackage.jar dba=usr/pwd conf=chmload.properties
TNS_ADMIN system variable points to oracle file with TNS names,
if you do not have access to this file then it is necessary to specify database URL
in chmload.properties file.
To log messages into a file (psm.log) the java run-time attribute can be included:
-Djava.util.logging.config.file=logging.cfg
the file logging.cfg contains java logger configuration in according with which
all messages are written into psm.log file.
----------------------------------------
INSTALL Web Application
Java 1.5 or above is required.
The installation consists of webchem.war file and webchem.xml
webchem.war contains the web application
webchem.xml contains the web application configuration (context)
You must edit webchem.xml
1) set docBase field to specify path to the webchem.war file,
or if you use tomcat like server then put webchem.war into webapps folder.
2) set context parameters
2.1) database
Database TNS name or URL
This value ("postgres" in the example) must be the same as the value of "database" property
in the chmload.properties file. It is the database where the chemical compounds were loaded.
URL example for oracle: localhost:1521:orcl
URL example for postgresql: localhost:5432:postgres
2.2) databases
Number of search schemas (users) where to search chemical compounds
Value of this parameter (1 in the example) must correspond to search.number.of.users
property from the chmload.properties file which was used during loading.
The recommended value is the number of CPUs (CPU cores) of the database server.
2.3) database-user
Database schema (user) name. If there are more than one schemas (users)
to search then the schema name will be made as the value of this parameter plus
the integer schema number as suffix starting with 1: mol1, mol2,mol3, ...
This value (mol in the example) must correspond to search.user property from the
chmload.properties file which was used during loading.
2.4) database-password
Database password. If there are more than one schemas (users) to search
then all these schemas (users) must have the same password
This value (mol64 in the example) must correspond to search.password property from the
chmload.properties file which was used during loading.
2.5) use-bond-orders-only
The database chm_bonds table has two attributes - bond_type and bond_order.
bond_type is the original bond_type from the source file whether the bond order
can be assigned to aromatic when the bond is considered to belong to an aromatic system by
the loading procedure. The original data files from which the loading was done can have
aromatic bonds assigned or have conjugated systems instead.
Our system can calculate aromaticity but it can not reverse it. So if the original data has
aromaticity already assigned then the information about conjugated systems will be lost
and can not be used by the search engine. Therefore it is recommended then data is
loaded from files with conjugated systems instead of aromatic.
if this parameter true (like in the example) then the search engine will use
bond_order fields in bond tables whether if it is false then bond_type will be
searched instead when the system is not aromatic.
2.6) hitlist-rows
It is the web interface parameter.
The number of rows in the hitlist page when it is in non detail (icon) mode
2.7) hitlist-columns
It is the web interface parameter.
The number of columns in the hitlist page when it is in non detail (icon) mode
2.8) hitlist-detail
It is the web interface parameter.
Indicates whether the hitlist page is in detail or non detail (icon) mode
Now, after the webchem.xml contains correct information,
it is ready to deploy into your Web Application Server.
To deploy it to a Tomcat server it is available to use tomcat manager interface or
copy this file into conf/Catalina/localhost/ folder.
If the folder does not exist then create it.
-------------------------------------
INSTALL chemical search package into Oracle.
This allows an easy way of use of the substructure search in another SQL query, e.g:
select * from table(chem.compounds('C1CC1N', 'Y', 'Y'));
Java 1.5 or above is required to be install on oracle.
Assume that the data was loaded into a single schema (one user) and it is called mol
with password mol64 and the database ora_db.
change the directory into the installation folder.
to load java classes into the oracle use the following commands:
loadjava -user mol/mol64@ora_db ant.jar
loadjava -user mol/mol64@ora_db cdk-1.0.4.jar
loadjava -verbose -resolve -user mol/mol64@ora_db chempackage.jar
loadjava is oracle utility to load java classes into the database server.
then load sql wrapper
sqlplus mol/mol64@ora_db @oraPackage
sqlplus is oracle client utility.
some queries by smiles:
select * from table(chem.compounds('C1CC1N', 'Y', 'Y'));
select * from table(chem.compounds('NCCCN', 'Y', 'Y'));
select * from table(chem.compounds('OC1CCc2ccccc12', 'N', 'Y'));
select * from table(chem.compounds('[nH]1cnc2cncnc12', 'N', 'Y'));
select * from table(chem.compounds('C1CC2C3CC1C23', 'N', 'Y'));
the function understands whether the query is by SMILES or by MDL. MDL must have '|' instead
of next line ('\n') separators:
select * from table(chem.compounds('C1CC1|JME 2007.03.2 Wed Oct 29 11:10:00 GMT 2008| | 3 3 0 0 0 0 0 0 0 0999 V2000| 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0|0.7000 1.2124 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0| 1.4000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0| 1 2 1 0 0 0 0| 1 3 1 0 0 0 0| 2 3 1 0 0 0 0|M END|', 'Y', 'Y'));
---------------
Troubleshooting:
1) Ignore all loadjava errors except when loading our jar - chempackage.jar
If there are errors the first thing to check is which version of Java your Oracle supports.
It must be 1.5 or above.
2) the first time you execute a query like
select * from table(chem.compounds('C1CC1N', 'Y', 'Y'));
may throw an exception because the java package is not in sink due to it was just compiled.
Repeat the query and all next queries must run Ok.
------------------------------------
Any problems e-mail: golovin@ebi.ac.uk with the subject "chemical search in SQL"