README 2005/03/30 MSDSD mySQL Release - Database version: beta00.05_2005-03-10 - Data version: 13-2005_03_10 - mySQL data files suitable for any server platform and recommended version of mySQL 4.1 - Contact melford@ebi.ac.uk dimitris@ebi.ac.uk, Last modified Mar 2005 Documentation Documentation for the structure of MSDSD at the table level is available online at: http://www.ebi.ac.uk/msd/dbdoc/ Installation of MSDSD In order to install and use the mySQL version of MSDSD you need to downnload and install the latest recommended release of the mySQL relational database system under the free software/open source GNU General Public License (commonly known as the "GPL") or under a commercial license from http://dev.mysql.com/downloads/. Installation involves the creation of a database MSDSD and the downloading of the required mySQL datafiles on its data directory. Updates 1a. We do not supply regular incremental updates to the data in MSDSD for mySQL. Occasionally newer up-to-date releases of the mySQL version of MSDSD will become available from the MSD web site and they may be used to replace older releases. 1. Notes 1a. The data and scripts supplied with this release of MSDSD is suitable for any hardware server platform using mySQL 4.1. There were some non-extensive tests done for mySQL version 4.0 and 5.0 that seemed to work without problems. 1b. Minimum Requirements * Disk storage - At least 35 GB on as many disks as possible for the mySQL datafiles in mySQL compressed format without indeces. - 300 GB in order to use mySQL uncompressed format and create all recommended indeces * Available memory of 512GB (your hardware should have 1GB) * 1 CPUs * mySQL 4.1 * Character set of database must be ascii and collation ascii-bin 1c. Directory structure of the mySQL MSDSD release * data/msdsd: The data directory of the MSDSD database. Download the .MYD,.MYI,.frm files for any tables that you want to replicate in your local data/msdsd data directory * data/mysql: The data directory of the mySQL dictionary. Use only for debugging purposes * schema: A directory with SQL scripts that could be used in order to create a complete but empty (without data) MSDSD schema. You will need this to download the "CREATE INDEX" and any CONSTRAINT statements to apply after the actual data download * scripts: Configuration files (like my.ini) and utility scripts 1d. Character set compatibility with Oracle Oracle and mySQL deal differently with the VARCHAR data type. In particular mySQL ignores character case in VARCHAR types for most character sets and has introduced a different data type VARBINARY for case sensitive textual data. Oracle VARCHARs in the contrary are always case sensitive. So for example the expression 'Text'='text' is true for most character sets in mySQL and if these values appear in the primary/unique key of 2 different rows of a table, there is going to be a unique key violation error. In Oracle this is not true so in such cases, data that is perfectly valid in Oracle, will not be able to exist in mySQL tables. The way around this is the use the "ascii" character set and the "ascii-bin" collation in mySQL that are case sensitive. While the character set that we have in MSDSD in Oracle is UTF8, in fact all the textual data can be represented by ascii and the behaviour of the mySQL ascii/ascii-bin character set is pefectly adequate to MSDSD data. 1e. MyIsam and Compressed MyIsam table formats From the various table types available in mySQL we have choosen to use the MyIsam tables because they are generally simple, perform better, are really open source and use disk space efficiently. We have also been experimenting with the InnoDB tables that seem to be closer to Oracle technology (they support foreign keys and transactions) but we found that they do not use disk space usage efficiently (http://forums.mysql.com/read.php?20,3876,3876#msg-3876). Additionally the MyIsam table data are stored in 3 distinct files per table, that can be copied and distributed seperately without any external dependencies. Furthermore MyIsam allows the option of Compressed MyIsam tables that use disk space even more efficiently and can be downloaded faster in a format which is ready for use (no need to unzip to use). These tables though will need unpacking if you need to create indexes on them. The unpacking script is also supplied. The compressed tables are ATOM_DATA,ATOM,RESIDUE_DATA,RESIDUE (their partitions - see below), CATH_MAPPING , PFAM_MAPPING , SCOP_MAPPING , SWISS_PROT_MAPPING , CONTACT , PLANE_CENTER , SITE , HELIX , HELIX_HELIX , SHEET_HBOND , STRAND , TURN. The decision about which tables are compressed, depends on their size (tables of datafiles >100MB are compressed) 1f. Partitioned merged tables There is a small number of very large tables in MSDSD (tables on the atom and residue level). In order to be manageable and perform well, these tables are "partitioned" in Oracle. MySQL offers "Merged tables" a similar feature that we have used for the following tables: ATOM, ATOM_DATA, RESIDUE, RESIDUE_DATA. In particular we have divided the entries in 32 ranges (using the ordering implied by their entry_id) that include about the same number of residues in total. Details of this partitioning are availabe from the "MYSQL_PARTITIONS" table which is part of the distribution. Based on this range partitioning we have generated 32 ATOM_DATA_* tables (ATOM_DATA_1,ATOM_DATA_2 .. ATOM_DATA_32) that include the atom coordinates from the entries of the corresponing partitions. Then the actual ATOM_DATA table is a "MERGE" table created by merging these 32 partition tables. What we have achieved is that each one of the 32 ATOM_DATA_* tables is about 600MB (in MyIsam compressed) and can be used directly if one first uses the MYSQL_PARTITIONS table to find the partition that a particular entry belongs in. If we had a single ATOM_DATA table it would be about 20GB, making it very difficult to download and even hitting the operating system filesize limit in older systems. In a similar fashion table ATOM is organised in 16 partitions (2 ATOM_DATA partitions for a single ATOM partition) and RESIDUE and RESIDUE_DATA in 8 partitions (4 partitions in 1). 1h. Creation of indexes There are no precreated in the distribution release. The reason is that it is often faster to re-create the indexes locally than download them. The distribution includes scripts with the SQL commands that can be used to created the set of recommended indexes. What we suggest though is to use these scripts based on the needs of your application and create only the indexes that are really usefull for you. 1g. Creation of constraints There are no constraints defined in the distribution release since this means creation of indexes. There is no need to re-check the validity of the constraints in MSDSD since this is done at our end, in the Oracle database, as a part of the production of the database. The constraints are available in SQL script files for documentation purposes. We do not recommend using them, because they would not serve your application in any obvious way. In any case in mySQL the MyIsam tables do not check the validity of foreign keys. There is an additional point to that has to be clarified. Foreign key constrains for MSDSD are valid only just after a fresh new release of MSDSD (after each new transformation). During the process of incremental updates, these constraints can become invalid since the database is patched with new data and the re-created from scratch. Occasionally we will provide mySQL distributions of the more up-to-date incremented MSDSD. It has to be made clear that for these distributions the foreign key constrains may not be valid. 2. Installation Steps The installation process involves the following steps and could easily take several hours up to a a couple of days. Time estimates are for a complete replication of the MSDSD. The instructions are based on a unix (i.e. linux) based enviroment, but the corresponding set of steps is also required for Windows platforms. Variable MYSQL_HOME below refers to the location where the mySQL software has been installed i.e (/u0/mysql/mysql-4.1.5) and MYSQL_DATADIR to the location where the database datafiles will be located (35GB of free space required there) (i.e. /u0/mysql/data) (Total estimated time:44 hours - for a host base on the specifications above and very fast intenet connection - 4Mbps effective ftp download rate. Note: if you have a reasonable host and you don't want to wait, there are obvious ways of parallelising many of the time-consuming tasks listed below and finish the installation in a fraction of this time. For example you may download that MSDSD datafiles in parallel through different ftp connections and you may create the database indexes you want in parallel through different mysql sessions till you utilise properly your system I/O and CPU) a. Download and modify the mySQL initialisation file my.ini (quick) Download "scripts/my.ini" to ~/.my.cnf (or C:\my.ini on Windows). Modify the parameter "datadir" to the value of $MYSQL_DATADIR. Explanation of other parameters: * "lower_case_table_names" : this is required in unix machines to make table names case insensitive. Otherwise in mySQL table "test" is different than "Test" on Unix. * "character-set, collation" : see discussion above b. Start the database (quick) Run the command available in scripts/startdb (mysqld --character-set-server=ascii --collation-server=ascii_bin --skip-grant-tables --skip-innodb &) c. Create the MSDSD database (quick) Connect to mySQL and create the MSDSD database by running "schema/database.sql": $MYSQL_HOME/bin/mysql mysql>\. schema/database.sql mysql>exit d. Shutdown the database: $MYSQL_HOME/bin/mysqladmin -u root shutdown (quick) e. Download the mySQL data files in the MSDSD data area (minutes - up to 10 hours dependind on what you replicate) cd $MYSQL_DATADIR/msdsd Download any tables you need from the "data" area on the FTP server. You need 3 files per table .frm .MYI .MYD If you need to have any of the partitioned tables you need to download, all their partitions (i.e. ATOM_DATA_[1-32]) You may want to do a checksum (cksum command) for any of your files (ie find . -type f -exec cksum {} \;) and compare the results with the file cksum.txt located on the same directory in order to check for corruption during the ftp download f. (Optional) Uncompress the compressed mySQL tables in order to be able to create indeces (about 3 hours for all tables) Download and run: "scripts/uncompress". Ignore errors if you have not downloaded the complete MYSQL datafile set Note: This script assumes that your system is capable of uncompressing 3 datafiles in parallel. If you have more processors and better I/O you may increase the parallelisation (run more uncompress processes in the background), or otherwise even reduce the parallelisation. g. Restart the database by following instruction in setp "b.". (quick) h. (Optional) Creating any indexes that can be usefull for your particular applications (minutes - up to several hours dependind on what you replicate, and what indexes you need details: about 12h for residue tables, 4h for atom tables, 14h for all other, total 30 hours, attention: be selective with index creation and perhaps leave this step for later when you have an idea about the queries you will run. The reason that the indexes are not pre-built and available on the ftp server is because we believe that very few of them will be usefull to each particular case) $MYSQL_HOME/bin/mysql mysql>\. indexes.sql mysql>\. big_indexes.sql Ignore any errors if you have not downloaded the complete MYSQL datafile set i. Create the MERGED tables for the ones you have installed (quick) mysql>\. merge_big.sql j. (Optional-not recommended): Create any constraints mysql>\. pkeys.sql mysql>\. ukeys.sql mysql>\. fkeys.sql k. Fix the "connect" and "table" privileges by following the mySQL documentation and restart the database without the "skip-grant-tables" option l. Test the database tcsh> $MYSQL_HOME/bin/mysql mysql> select count(1) from helix; mysql> select assembly_serial,assembly_type,assembly_class,chain_code, chain_code_1_letter,chain_type non_assembly_valid,associated_chain_code,num_residues, chainmol_serial,molecule_code,molecule_name from chain where accession_code='1b5t' order by assembly_serial,chain_type,chain_code; If you experience any problems or you want to repeat the steps from the start, you may drop the MSDSD database by doing mysql>\. DROP DATABASE MSDSD; and start again from step "a.". Note that this command will also delete any data files that you have downloaded, so make a local copy of them if you intend to reuse them without re-downloading. Consult the mySQL documentation from * http://dev.mysql.com/doc/ * http://dev.mysql.com/doc/mysql/en/index.html