spacer

Other Resources

The following are links to various tools, companies and general resources related to text mining:




Sites with Text Mining Resources

  • National Text Mining Center (NaCTeM)

    NaCTeM Software Tools

    The Web site offers access to software components that serve different purposes. The Cheshire 3 retrieval engine was initially designed to retrieve entries from database catalogues and processes XML documents. The GENIA Toolkits comprises several components: (1) the CFG Parser for general English, (2) a POS Tagger for English, (3) the GENIA Tagger performing POS tagging and shallow parsing for biomedical text (including a demo interface), (4) a named entity recognizer (part of theGENIA Tagger), (5) Enju which is a deep syntactic parser for English, and (6) Moriv a GUI client for browsing feature structures extracted from text.

    Furthermore the Web site offers a sentence and paragraph breaker, access to the BioMinT suite of text mining tools for biomedical articles and a system for clinical document classification.




Text Mining Tools

A list of software tools that have been presented in the past:
  • AbXtract

    " Paste abstracts related to a protein in the window. The results may be deleted after a week."

    No publication found.

    URL: http://columba.ebi.ac.uk:8765/andrade/abx

  • AcroMed

    " AcroMed is a computer generated database of biomedical acronyms and the associated long forms extracted from the last year of Medline abstracts (2001). AcroMed is a part of the Medstract project whose goal is to apply natural language processing technologies to extraction of knowledge from biomedical texts."

    J. Pustejovsky, J. Castaño, B. Cochran,M. Kotecki, M. Morrell, A. Rumshisky. (2001)
    Linguistic Knowledge Extraction from Medline: Automatic Construction of an Acronym Database
    An updated version of the paper presented at Medinfo, 2001.

    URL: http://medstract.med.tufts.edu/acro1.1/index.htm

  • EASE

    " EASE is a customizable software application for rapid biological interpretation of gene lists that result from the analysis of microarray, proteomics, SAGE and other high-throughput genomic data. The biological themes returned by EASE recapitulate manually determined themes in previously published gene lists and are robust to varying methods of normalization, intensity calculation and statistical selection of genes."

    Douglas A. Hosack, Glynn Dennis, Jr, Brad T. Sherman, H Clifford Lane and Richard A. Lempicki (2003)
    Identifying biological themes within lists of genes with EASE
    Genome Biol, Vol 4 (10):R70.

    URL: No link available

  • Genies

    " We present a system, GENIES, that extracts and structures information about cellular pathways from the biological literature in accordance with a knowledge model that we developed earlier. We implemented GENIES by modifying an existing medical natural language processing system, MedLEE, and performed a preliminary evaluation study."

    Friedman C, Kra P, Yu H, Krauthammer M and Rzhetsky A. (2001)
    GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles.
    Bioinformatics. 17 Suppl 1:S74-82.

    URL: No online link available

  • GoMiner

    " We have developed GoMiner, a program package that organizes lists of 'interesting' genes (for example, under- and overexpressed genes from a microarray experiment) for biological interpretation in the context of the Gene Ontology. GoMiner provides quantitative and statistical output files and two useful visualizations. The first is a tree-like structure analogous to that in the AmiGO browser and the second is a compact, dynamically interactive 'directed acyclic graph'. Genes displayed in GoMiner are linked to major public bioinformatics resources."

    Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, & Weinstein JN (2003)
    GoMiner: a resource for biological interpretation of genomic and proteomic data
    Genome Biol, 4:R28.

    URL: http://discover.nci.nih.gov/gominer

  • IHOP

    "We report the development of an information system that provides this network as a natural way of accessing the more than ten million abstracts in PubMed.
    By employing genes and proteins as hyperlinks between sentences and abstracts, we convert the information in PubMed into one navigable resource and bring all the advantages of the internet to scientific literature investigation."

    Hoffmann R & Valencia A. (2004)
    A gene network for navigating the literature.
    Nature Genetics 36, 664.

    URL: http://www.pdg.cnb.uam.es/UniPub/iHOP/

  • MatchMiner

    " MatchMiner is a freely available program package for batch navigation among gene and gene product identifier types commonly encountered in microarray studies and other forms of 'omic' research. The user inputs a list of gene identifiers and then uses the Merge function to find the overlap with a second list of identifiers of either the same or a different type or uses the LookUp function to find corresponding identifiers."

    Bussey KJ, Kane D, Sunshine M, Narasimhan S, Nishizuka S, Reinhold WC, Zeeberg B, Ajay W, & Weinstein JN (2003)
    MatchMiner: a tool for batch navigation among gene and gene product identifiers
    Genome Biol, 4:R27.

    URL: http://discover.nci.nih.gov/matchminer

  • MedMiner

    " The MedMiner filters will extract and organize relevant sentences in the literature based on a gene, gene-gene or gene-drug query. This tool combines the GeneCards and PubMed search engines with user input and automated server-side scripts in an integrated text filtering system."

    L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein (1999)
    MedMiner: an Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling,"
    BioTechniques 27(6):1210-1217.

    URL: http://discover.nci.nih.gov/textmining/main.jsp

  • MedSynDiKate


    Hahn, Udo; Romacker, Martin; Schulz, Stefan (2000)
    medSynDiKATe: Design considerations for an ontology-based medical text understanding system.
    In: AMIA 2000 - Proc.Annual Symposium of the American Medical Informatics Association. Converging Information, Technology, and Health Care.
    Los Angeles, CA, November 4-8, 2000.
    Ed. by J.M. Overhage. Philadelphia/PA: Hanley & Belfus, 2000, pp.330-334.

    URL: No online link available

  • MeshMap

    " MeSHmap supports searches via PubMed followed by user driven exploration of the MeSH terms and subheadings in the retrieved set."

    Srinivasan (2001)
    MeSHmap: a text mining tool for MEDLINE.
    Proc. AMIA Symp pp. 642-6.

    URL: No online link available

  • NLProt

    "NLProt is a tool for finding protein-names in natural language-text. It is based on Support Vector Machines (SVMs), which are trained on contextual-features of named entities in scientific language."

    URL: http://cubic.bioc.columbia.edu/services/nlprot/submit.html

  • PASTA

    " PASTA . aims at creating a database of protein active sites using novel text extraction methods. . The work for PASTA concentrated on extracting information concerning the roles of particular amino acid residues in known three-dimensional protein structures. The ultimate objective is the provision of a WWW-based searchable knowledge base of the roles of the important residues in each protein structure in the Protein Data Bank."

    Gaizauskas R, Demetriou G, Artymiuk PJ and Willett P. (2003)
    Protein structures and information extraction from biological texts: the PASTA system.
    Bioinformatics 19(1):135-43.

    URL: http://www.dcs.shef.ac.uk/nlp/pasta/

  • PubMatrix

    " PubMatrix is a simple way to rapidly and systematically compare any list of terms against any other list of terms in PubMed. It reports back the frequency of co-occurrence between all pairwise comparisons between the two lists as a matrix table. Lists of terms can be anything; gene names, diseases, gene functions, authors... pretty much anything. The user can then quickly sort or browse the frequency matrix table to do individual searches independently."

    Kevin G Becker, Douglas A Hosack, Glynn Dennis Jr, Richard A Lempicki, Tiffani J Bright, Chris Cheadleand Jim Engel (2003)
    PubMatrix: a tool for multiplex literature mining
    BMC Bioinformatics 4: 61.

    URL: http://pubmatrix.grc.nia.nih.gov/

  • TAMBIS

    " TAMBIS aims to aid researchers in biological science by providing a single access point for biological information sources round the world. The access point will be a single interface (via the World Wide Web) which acts as a single information source. It will find appropriate sources of information for user queries and phrase the user questions for each source, returning the results in a consistent manner which will include details of the information source."

    Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A. (2000)
    TAMBIS: transparent access to multiple bioinformatics information sources.
    Bioinformatics 16(2): 184-5.

    URL: http://imgproj.cs.man.ac.uk/tambis/

  • Textpresso

    "The Textpresso search engine for C. elegans abstracts and fulltexts was developed at Wormbase to service the C. elegans community."

    " Textpresso is an information extracting and processing package for C. elegans literature developed by Eimear Kenny and Hans-Michael Muller, with contributions from Juancarlos Chan. We are part of the WormBase group at the California Institute of Technology , California."

    URL: http://www.textpresso.org/

  • TextQuest

    "We present an algorithm for large-scale document clustering of biological text, obtained from Medline abstracts. The algorithm is based on statistical treatment of terms, stemming, the idea of a 'go-list', unsupervised machine learning and graph layout optimization."

    Iliopoulos I, Enright AJ and Ouzounis CA. (2001)
    Textquest: document clustering of Medline abstracts for concept discovery in molecular biology.
    Pac Symp Biocomput: 384-95.

    URL: No online link available

  • Whatizit

    " Whatizit can tell you the meaning of words found in your text, depending on the kind of information you want to see highlighted."

    Kirsch, H and Rebholz-Schuhmann, D (2004)
    Distributed Modules for TextAnnotation and IE applied to the Biomedical Domain.
    [BioNLP] Coling 2004 Workshop, Geneva.

    URL: http://www.ebi.ac.uk/Rebholz-srv/whatizit/form.jsp

  • XPLORMED

    "The XplorMed server allows you to explore a set of abstracts derived from a MEDLINE search. The system gives you the main associations between the words in groups of abstracts. Then, you can select a subset of your abstracts based on selected groups of related words and iterate your analisis on them."

    Perez-Iratxeta C, Bork P, Andrade MA. (2001)
    XplorMed: a tool for exploring MEDLINE abstracts.
    Trends Biochem. Sci. 26, 573-575.

    URL: http://www.bork.embl-heidelberg.de/xplormed/




Text Mining Companies

A list of companies, which claim to do information extraction: If I have forgotten your company, please send me details.



Other Text Mining Resources

Useful links to other resources and pages with links to resources:
  • Alex Morgan's Homepage - "I have tried to compile a list of many of the freely available resources either used directly and cited by the participants of BioCreAtIvE or that seem potentially useful. I have put in descriptions taken largely from the respective webpages, but I have also included some reviews based on my own experiences or what I have heard from users other than the developers of the resource. This is probably rife with errors and comissions, so please contact me to fix anything." - Alex Morgan.

  • BioNlp.org - Bob Futrelle's gatherings of useful data+links+literature. No mission statement.

There are numerous labs that have to be honored, but this is postponed to a period, where I have the time for it.
































spacer
spacer