spacer
spacer

IPI - International Protein Index - MySQL Dump (NEW SCHEMA SEPT 07)



IPI MySQL (.mysql) files provide a relational view of IPI data sets suitable for loading into MySQL databases.

They can be downloaded for the current release from the IPI FTP site.

See below for how to load back these dumps into a MySQL server.

Each species dump should be loaded into a different database (although there is the
possibility for more expert users to load them all into the same database [more...]):

e.g.

gunzip ipi.HUMAN.mysql.gz;
mysql -h host_name -u username -ppassword IPIhuman < ipi.HUMAN.mysql;
gunzip ipi.MOUSE.mysql.gz;
mysql -h host_name -u username -ppassword IPImouse < ipi.MOUSE.mysql;
etc...

or directly

gunzip -c ipi.HUMAN.mysql.gz | mysql -h host_name -u username -ppassword IPIhuman
gunzip -c ipi.MOUSE.mysql.gz | mysql -h host_name -u username -ppassword IPImouse
etc...

Ask your MySQL administrator to create the databases for you if you don't have "CREATE
DATABASE" permissions.

Please Note: In order to comply with MySQL Server (Distrib 5.0 onwards) as of Sept 07 the "release" table has been renamed to "current_release".
This is because the word "release" is now a reserved word in the latest MySQL. In order to continue using existing dumps with version MySQL 5.0+ please use the qualifier of the database name when querying the table e.g.

select * from IPIhuman.release

If you want to import data from old releases into the new schema rename the table first e.g:

rename IPIhuman.release to current_release


A graphic of the IPI shcema is presented below (generated by DbVisualizer) and some query examples follow.

IPI MySQL distribution schema



Query examples:
  • In order to get IPI entries and their associated source entries which description contains the keyword 'receptor'
    select p.id as ipi_id,
    ds.name as data_source,
    pe.primary_id as primary_source_id,
    pe.secondary_id as secondary_source_id,
    p.description as description
    
    from protein p,
    protein_entry pe,
    data_source ds
    
    where p.description like '%receptor%'
    and p.id = pe.protein_id
    and pe.data_source_id = ds.id
    
    order by p.id,
    ds.id,
    pe.primary_id;

  • In order to get all ensembl genes with their chromosomal location mapped to IPI entries
    select g.chromosome as chromosome,
    ge.primary_id as ensembl_gene,
    p.id as ipi_id,
    pe.primary_id as master_protein_entry
    
    from protein p,
    gene g,
    protein2gene p2g,
    gene_entry ge,
    protein_entry pe,
    data_source ds
    
    where ds.name = 'ENSEMBL'
    and ds.id = ge.data_source_id
    and ge.gene_id = g.id
    and g.id = p2g.gene_id
    and p2g.protein_id = p.id
    and p.master_id = pe.id
    
    order by g.chromosome,
    ge.primary_id,
    p.id;

  • In order to dump InterPro hits in a tab-delineated file
    mysql -h hostname -P port -u user -ppassword IPIhuman -N -e "select ds.name, p.id, pe.primary_id, im.entry_ac, im.method_ac, im.from_pos, im.to_pos from interpro_match im, protein p, protein_entry pe, data_source ds where im.protein_id = p.id and p.master_id = pe.id and pe.data_source_id = ds.id order by ds.id, p.id" > ipi.human.ipc

  • In order to dump IPI sequences in a FASTA format file
    mysql -h hostname -P port -u user -ppassword IPIhuman -r -B -N -e "select concat('>', p.id, '.', p.version, ' ', p.description, ' [', o.org_sci_name, ']\n', u.sequence ) from protein p, protein_entry pe, data_source ds, uniparc u, organism o, current_release r where p.release_id = r.id and r.org_name = o.name and p.uniparc_id = u.id and p.master_id = pe.id and pe.data_source_id = ds.id order by p.id" > ipi.human.fasta

MySQL software is available under a number of flexible licenses (see here for details).

These dumps were generated and tested on a MySQL server (Distrib 4.1.12).

Any feedback is welcome.

spacer
spacer