 |
Other Resources
The following are links to various tools, companies and general resources related to text mining:
Sites with Text Mining Resources
- National Text Mining Center (NaCTeM)
NaCTeM Software Tools
The Web site offers access to software components that serve different
purposes. The Cheshire 3 retrieval engine was initially designed to
retrieve entries from database catalogues and processes XML documents.
The GENIA Toolkits comprises several components: (1) the CFG Parser for
general English, (2) a POS Tagger for English, (3) the GENIA Tagger
performing POS tagging and shallow parsing for biomedical text
(including a demo interface), (4) a named entity recognizer (part of
theGENIA Tagger), (5) Enju which is a deep syntactic parser for English,
and (6) Moriv a GUI client for browsing feature structures extracted from
text.
Furthermore the Web site offers a sentence and paragraph breaker, access
to the BioMinT suite of text mining tools for biomedical articles and a
system for clinical document classification.
Text Mining Tools
A list of software tools that have been presented in the past:
AbXtract
" Paste abstracts related to a protein in
the window. The results may be deleted
after a week."
No publication found.
URL: http://columba.ebi.ac.uk:8765/andrade/abx
AcroMed
" AcroMed is a computer generated database
of biomedical acronyms and the associated long forms extracted from the last
year of Medline abstracts (2001). AcroMed is a part of the Medstract project
whose goal is to apply natural language processing technologies to extraction
of knowledge from biomedical texts."
J. Pustejovsky, J. Castaño, B. Cochran,M. Kotecki, M. Morrell, A. Rumshisky. (2001)
Linguistic Knowledge Extraction from Medline: Automatic Construction of an Acronym Database
An updated version of the paper presented
at Medinfo, 2001.
URL: http://medstract.med.tufts.edu/acro1.1/index.htm
EASE
" EASE is a customizable software application
for rapid biological interpretation of gene lists that result from the analysis
of microarray, proteomics, SAGE and other high-throughput genomic data. The
biological themes returned by EASE recapitulate manually determined themes in
previously published gene lists and are robust to varying methods of
normalization, intensity calculation and statistical selection of genes."
Douglas A. Hosack, Glynn Dennis, Jr, Brad T. Sherman, H Clifford Lane and Richard A. Lempicki (2003)
Identifying biological themes within lists of genes with EASE
Genome Biol, Vol 4 (10):R70.
URL: No link
available
Genies
" We present a system, GENIES, that extracts
and structures information about cellular pathways from the biological
literature in accordance with a knowledge model that we developed earlier. We
implemented GENIES by modifying an existing medical natural language processing
system, MedLEE, and performed a preliminary evaluation study."
Friedman C, Kra P, Yu H, Krauthammer M and
Rzhetsky A. (2001)
GENIES: a
natural-language processing system for the extraction of molecular pathways
from journal articles.
Bioinformatics. 17 Suppl 1:S74-82.
URL: No online
link available
GoMiner
" We have developed GoMiner, a program package that organizes lists of
'interesting' genes (for example, under- and overexpressed genes from a
microarray experiment) for biological interpretation in the context of the Gene
Ontology. GoMiner provides quantitative and statistical output files and two
useful visualizations. The first is a tree-like structure analogous to that in
the AmiGO browser and the second is a compact, dynamically interactive
'directed acyclic graph'. Genes displayed in GoMiner are linked to major public
bioinformatics resources."
Zeeberg BR, Feng W,
Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC,
Lababidi S, Bussey KJ, Riss J, Barrett JC, & Weinstein JN (2003)
GoMiner: a resource for biological
interpretation of genomic and proteomic data
Genome
Biol, 4:R28.
URL: http://discover.nci.nih.gov/gominer
IHOP
"We report the development of an information system that provides
this network as a natural way of accessing the more than ten million abstracts
in PubMed.
By employing genes and proteins as hyperlinks between sentences and abstracts,
we convert the information in PubMed into one navigable resource and bring all
the advantages of the internet to scientific literature investigation."
Hoffmann R & Valencia A. (2004)
A gene network for navigating the literature.
Nature Genetics 36, 664.
URL: http://www.pdg.cnb.uam.es/UniPub/iHOP/
MatchMiner
" MatchMiner is a freely available program
package for batch navigation among gene and gene product identifier types
commonly encountered in microarray studies and other forms of 'omic' research.
The user inputs a list of gene identifiers and then uses the Merge function to
find the overlap with a second list of identifiers of either the same or a
different type or uses the LookUp function to find corresponding identifiers."
Bussey KJ, Kane D, Sunshine M, Narasimhan S,
Nishizuka S, Reinhold WC, Zeeberg B, Ajay W, & Weinstein JN (2003)
MatchMiner:
a tool for batch navigation among gene and gene product identifiers
Genome Biol, 4:R27.
URL: http://discover.nci.nih.gov/matchminer
MedMiner
" The MedMiner filters will extract and organize relevant sentences in
the literature based on a gene, gene-gene or gene-drug query. This tool
combines the GeneCards and PubMed search engines with user input and automated server-side scripts in an integrated text
filtering system."
L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N.
Weinstein (1999)
MedMiner: an Internet Text-Mining Tool for Biomedical Information, with
Application to Gene Expression Profiling,"
BioTechniques 27(6):1210-1217.
URL: http://discover.nci.nih.gov/textmining/main.jsp
MedSynDiKate
Hahn, Udo; Romacker, Martin; Schulz, Stefan
(2000)
medSynDiKATe: Design considerations for an ontology-based medical text
understanding system.
In: AMIA 2000 - Proc.Annual Symposium of the
American Medical Informatics Association. Converging Information, Technology,
and Health Care.
Los Angeles, CA, November 4-8, 2000.
Ed. by
J.M. Overhage. Philadelphia/PA: Hanley & Belfus, 2000, pp.330-334.
URL: No online
link available
MeshMap
" MeSHmap supports searches via PubMed followed by user driven
exploration of the MeSH terms and subheadings in the retrieved set."
Srinivasan
(2001)
MeSHmap: a
text mining tool for MEDLINE.
Proc. AMIA
Symp pp. 642-6.
URL: No online
link available
NLProt
"NLProt is a tool for finding
protein-names in natural language-text. It is based on Support Vector Machines
(SVMs), which are trained on contextual-features of named entities in
scientific language."
URL: http://cubic.bioc.columbia.edu/services/nlprot/submit.html
PASTA
" PASTA . aims at creating a database of protein active sites using
novel text extraction methods. . The work for PASTA concentrated on extracting
information concerning the roles of particular amino acid residues in known
three-dimensional protein structures. The ultimate objective is the provision
of a WWW-based searchable knowledge base of the roles of the important residues
in each protein structure in the Protein
Data Bank."
Gaizauskas R, Demetriou G, Artymiuk PJ and
Willett P. (2003)
Protein structures and information extraction from biological texts: the PASTA system.
Bioinformatics 19(1):135-43.
URL: http://www.dcs.shef.ac.uk/nlp/pasta/
PubMatrix
" PubMatrix is a simple way to rapidly and
systematically compare any list of terms against any other list of terms in
PubMed. It reports back the frequency of co-occurrence between all pairwise
comparisons between the two lists as a matrix table. Lists of terms can be
anything; gene names, diseases, gene functions, authors... pretty much
anything. The user can then quickly sort or browse the frequency matrix table
to do individual searches independently."
Kevin G Becker, Douglas A Hosack, Glynn
Dennis Jr, Richard A Lempicki, Tiffani J Bright, Chris Cheadleand
Jim Engel (2003)
PubMatrix:
a tool for multiplex literature mining
BMC Bioinformatics 4: 61.
URL: http://pubmatrix.grc.nia.nih.gov/
TAMBIS
" TAMBIS aims to aid researchers in
biological science by providing a single access point for biological
information sources round the world. The access point will be a single
interface (via the World Wide Web) which acts as a single information source.
It will find appropriate sources of information for user queries and phrase the
user questions for each source, returning the results in a consistent manner
which will include details of the information source."
Stevens R, Baker P, Bechhofer S, Ng G,
Jacoby A, Paton NW, Goble CA, Brass A. (2000)
TAMBIS:
transparent access to multiple bioinformatics information sources.
Bioinformatics 16(2): 184-5.
URL: http://imgproj.cs.man.ac.uk/tambis/
Textpresso
"The Textpresso search engine for C. elegans abstracts and fulltexts was developed
at Wormbase to service the C. elegans community."
" Textpresso is an information extracting
and processing package for C. elegans literature developed by Eimear
Kenny and Hans-Michael
Muller, with contributions from Juancarlos Chan. We are part of the WormBase group at the California Institute of
Technology , California."
URL: http://www.textpresso.org/
TextQuest
"We present an algorithm for large-scale
document clustering of biological text, obtained from Medline abstracts. The
algorithm is based on statistical treatment of terms, stemming, the idea of a
'go-list', unsupervised machine learning and graph layout optimization."
Iliopoulos I, Enright AJ and Ouzounis CA.
(2001)
Textquest:
document clustering of Medline abstracts for concept discovery in molecular
biology.
Pac Symp Biocomput: 384-95.
URL: No online
link available
Whatizit
"
Whatizit can tell you the meaning of words found in your text,
depending on the kind of information you want to see highlighted."
Kirsch, H and Rebholz-Schuhmann, D (2004)
Distributed Modules for TextAnnotation and IE applied to the Biomedical Domain.
[BioNLP]
Coling 2004 Workshop, Geneva.
URL: http://www.ebi.ac.uk/Rebholz-srv/whatizit/form.jsp
XPLORMED
"The XplorMed server allows you to explore a set of abstracts derived from a MEDLINE search. The system gives you the main associations
between the words in groups of abstracts. Then, you can select a subset of your abstracts based
on selected groups of related words and iterate your analisis on them."
Perez-Iratxeta C, Bork P, Andrade MA.
(2001)
XplorMed: a tool for exploring MEDLINE
abstracts.
URL: http://www.bork.embl-heidelberg.de/xplormed/
Text Mining Companies
A list of companies, which claim to do information extraction:
If I have forgotten your company, please send me details.
Other Text Mining Resources
Useful links to other resources and pages with links to resources:
- Alex Morgan's Homepage
- "I have tried to compile a list of many of the freely available resources either used directly and cited by the participants of BioCreAtIvE or that seem potentially useful. I
have put in descriptions taken largely from the respective webpages, but I have also included some reviews based on my own experiences or what I have heard from users other than the
developers of the resource. This is probably rife with errors and comissions, so please contact me to fix anything." - Alex Morgan.
- BioNlp.org
- Bob Futrelle's gatherings of useful data+links+literature. No mission statement.
There are numerous labs that have to be honored, but this is postponed to a period, where I have the time for it.
 |