Clustering through the integration of similarity matrices from heterogeneous data sets is a recent and interesting topic in machine learning research. In bioinformatics, clustering by data integration of multiple gene or protein profiles is a promising application that improves clustering by benefiting from similar or complementary information extracted from multiple data sets. However, it is not always the case that all the information sources are equally informative or useful for a given problem. In some cases, a parametric data integration that optimally weighs the different data sources might be more sensitive (or robust to noise) than a uniformly weighted integration method. In this poster, we present an adaptive clustering method integrating multiple similarity matrices. The method is based on the convex integration of kernel matrices, in the spirit of the kernel based data fusion framework for machine learning (Lanckriet et al., 2004a, De Bie et al., 2007) which however did not deal with clustering. We demonstrate the proposed algorithm on several toy data sets. Furthermore, the proposed algorithm is validated on a real life application: Clustering by integrating text mining data and other heterogeneous data sources using with validation by pathway information. According to our experimental result, the proposed method has the ability to distinguish "informative" and "uninformative" data sources and its adaptivity to weigh these sources in integration. The proposed method shows advantages in 3 aspects. First, it is shown empirically to outperform the uniformly weighted method on clustering problems of toy examples and a real application of gene clustering by pathway validation. Second, it appears to be more sensitive to "informative" data sources and moreover more robust to "noise" data sources in the context of clustering by data integration. And third, if appropriate, the proposed method can yield significant enhancement in clustering performance by combing gene profiles from heterogeneous data sources.

Background: The functional annotation of genes is still a major challenge in the post genomic era. Traditional manual annotations by literature curation are reliable and of high quality. However, as both the volume of literature and of genes requiring characterization increases, the manual processing capabilities are becoming overloaded. To efficiently annotate genes with controlled vocabularies such as Gene Ontology (GO), computational methods to automate the process of functional annotation are required. We are working towards a novel annotation system for ONDEX that includes data integration and literature analysis methods to predict the function of previously unannotated genes. Our aim is to provide comprehensive integrated networks in which genes are enriched with manual (if available) and automatic annotations. Here, we present an integrated text mining approach that can extract relevant information for biological research from referenced literature. Results: We have developed the first steps towards a text mining framework for the ONDEX data integration system. In the current version the text mining package exploits advanced information retrieval and named entity recognition (NER) steps. It enables the automated integration (mapping) of natural text (publications) with relevant biomedical resources such as GO, EC, Gene, Disease, and Species. To reveal the quality of the mappings, a score and evidence sentences are assigned to each relationship created by the mapping process. For the assessment of automatic GO annotation methods, we have established a hierarchical evaluation measure to compare automatic with manual annotations. The statistical evaluation of the generated text mining based GO mappings against manual annotations from GOA results in a precision and recall of 59% and 52% respectively. Conclusion: Our work shows that integrated gene-annotation networks can provide substantial support for semi-automated genome annotation projects. Using text mining together with automatic annotation methods opens up the wealth of indispensable knowledge in the scientific literature. The methods and algorithms presented in this thesis are an integral part of the ONDEX system, which is freely available for download at http://ondex.sourceforge.net/.

CRAB – Cancer Risk Assessment and Biomedical Text Mining The amount of scientific evidence showing a strong link between environmental chemicals and cancer calls for urgent efforts to issue exposure limits on the use of harmful chemicals. The critical tool used by authorities in making decisions on exposure limits is Risk Assessment (RA). Cancer RA involves examining existing published evidence to determine the relationship between exposure to a substance and the likelihood of developing cancer from that exposure. Performed by teams of highly qualified experts in health related institutions worldwide (e.g. IARC, WHO), RA is a costly and challenging task which requires combining scientific expertise with elaborate literature search and review. It involves manually searching, locating and interpreting the relevant information in repositories of scientific peer reviewed journal articles - a process which can be extremely time-consuming because the data required for RA of just a single carcinogen may be scattered across thousands of journal articles. Given the exponentially growing volume of articles under inspection, the rapid development of molecular biology techniques, the increasing knowledge of mechanisms involved in cancer development, and the accelerating need for chemical assessment, RA is gradually getting too challenging to manage via manual means. We are investigating a more effective approach to RA based on text mining (TM). To our knowledge, no TM technology has yet been developed for the needs of cancer RA. Such technology could greatly assist risk assessors with the management of large textual data, increase their productivity, aid knowledge discovery, and lead into more consistent and standardized RA. From the perspective of TM, cancer RA is an excellent example of an important real-world task which provides a suitably complex test bed for tackling the most timely problems in the field. The task involves (1) identifying the optimal set of journal articles relevant for RA of the chemical in question and (2) studying the relevant experimental results in these articles to determine (i) whether and (ii) exactly how chemical causes cancer. Each step of RA requires examining specific types of scientific evidence in journal articles. Identifying the evidence is not straightforward and no detailed classification of the range of evidence required for comprehensive RA is publicly available which would enable a fully systematic and automatic approach. In this first paper on the topic, we describe the work we did on identifying and organizing the key types of scientific evidence into a taxonomy. Data The main types of evidence used for cancer RA are 1) scientific tests related to the carcinogenic activity: human studies, animal studies (in vivo) and cell studies (in vitro), and 2) the mode of action (MOA) of the carcinogen. The two most frequent types of MOA are genotoxic (i.e. chemicals affect cell’s genetic material and thereby cause mutations) and nongenotoxic (i.e. chemicals induce tumours e.g. by increasing cell proliferation). To obtain a more comprehensive and finer-grained classification of required evidence, we composed a representative corpus of RA data for further analysis. Four test chemicals were first selected which and 15 journals were then identified which are used frequently for cancer RA (e.g. Toxicological Sciences, Mutation Research). From these journals (years 1998 to 2008) all the PubMed abstracts including the 4 test chemicals were downloaded for further analysis. Annotation Tool An annotation tool was then designed for the analysis of the abstracts (including their titles) by experts in cancer RA. The tool enables the experts to annotate such keywords (words and phrases) in the abstracts and titles which indicate scientific evidence relevant for examining the carcinogenic properties of chemicals. It also enables the experts to classify abstracts using the classical Information Retrieval concept of Document Relevance. An abstract is marked as relevant, or irrelevant. Annotation The annotation was carried out by three experts in cancer RA. All abstracts for the initial two chemicals were classified by one of the experts according to the initial shallow taxonomy and document relevance. The results were reviewed by another expert. The review resulted in updates to the classification and considerable extension of the taxonomy. The Resulting Taxonomy and Corpus The taxonomy created by manual annotation includes three classes at the top level: scientific tests, MOA, and toxicokinetics. The complete taxonomy contains as many as 45 nodes, with individual keywords falling under different nodes. Just over 62% of the abstracts returned by the PubMed queries were deemed relevant for the cancer RA task by the expert reviewers (based on the title or the abstract) and 10% were deemed as irrelevant. 28% were marked as unsure. Automatic classification In order to determine whether the taxonomy is machine learnable and thus optimal for TM purposes, we trained and tested a series of Multinomial Naive Bayes classifiers on the abstracts using document level classifications. All the F-scores were promisingly above 75%. We also tested whether a similar multinomial Naive Bayes classifier could also prove a reliable predictor of document relevance. The document relevance classifier performed very well indeed with precision of 95.8, recall of 89.5 and F-score of 92.6 Future work Our future work will include embedding the automatic classifiers into the RA workflow and evaluating the impact of the classifiers on the work-flow and on overall task efficiency. We will also be widening the scope of our data collection beyond the four chemicals considered so far and fine-tuning the classifiers to raise their performance to the best achievable. In the more distant future, we intend to expand on this initial work and tackle the later stages of cancer RA.

Comparative analysis of expression microarray studies is difficult due to the large influence of technical factors on experimental outcome. Still, the identified differentially expressed genes may hint at the same biological processes. However, manually curated assignment of genes to biological processes, such as pursued by the Gene Ontology (GO) consortium, is incomplete and limited. We hypothesised that automatic association of genes with biological processes through thesaurus-controlled mining of Medline abstracts would be more effective. Therefore, we developed a novel algorithm (LASSO: literature-based association analysis) to quantify the similarity between transcriptomics studies. We evaluated our algorithm on a large compendium of 102 microarray studies published in the field of muscle development and disease, and compared it to similarity measures based on gene overlap and over-representation of biological processes assigned by GO. While the overlap in both genes and overrepresented GO-terms was poor, LASSO retrieved many more biologically meaningful links between studies, with substantially lower influence of technical factors. LASSO correctly grouped muscular dystrophy, regeneration and myositis studies, and linked patient and corresponding mouse model studies. LASSO also retrieves the connecting biological concepts. Among other new discoveries, we associated cullin proteins, a class of ubiquitinylation proteins,with genes down-regulated during muscle regeneration, whereas ubiquitinylation was previously reported to be activated during the inverse process: muscle atrophy. Our literature-based association analysis is capable of finding hidden common biological denominators in microarray studies, and circumvents the need for raw data analysis or curated gene annotation databases.

Introduction We hypothesize that protein concept profiles can be used to predict novel protein interactions. A concept profile of a concept, e.g. a protein, is a list of concepts where a certain weight factor for each concept reflects the association between the protein and the concepts as found in Medline abstracts. The strength of the associations is calculated based on co-occurrence information between the protein and other concepts on the abstract level. The information in MedLine abstracts is extracted by software called Peregrine that finds concepts stored in an ontology taking into account spelling variation, synonyms, and homonyms. Concept profiles can be matched with each other, resulting in a matching score, enabling to relate two concepts that have no direct co-occurrence in literature. We assume that if two proteins share a physical interaction that the matching score will be higher in general than the matching score of two proteins that are not related to each other. We try to find those protein pairs where the matching score passes a certain prediction threshold stating that those two proteins are strongly related to each other even though they have never been co-mentioned together in an abstract of stored in a database as interacting. Results To evaluate our hypothesis we generated two distributions of matching scores, for a set of interacting and a set of non-interacting protein pairs. The former set consisted of protein pairs that were indicated to interact in at least 4 different protein interaction databases out of 8 databases used in our analysis. The latter consisted of randomly selected pairs of proteins not reported to interact in any of the 8 protein databases. These databases are BioGrid, DIP, HPRD, IntAct, Mint, Reactome, and Uniprot (consisting of Swiss-Prot and TrEMBL). We set a prediction threshold on the matching score between concept profiles to discriminate between non-interacting and interacting proteins. ROC curve analysis demonstrated that the discriminative power was large, the area under the curve exceeding 0.9. Of the protein pairs not known to interact 0.75% passed the prediction threshold and reflect potential newly discovered protein-protein interactions. Two protein concepts share a direct co-occurrence if there is at least one abstract where the two protein are co-mentioned together. For every protein databases the group of protein pairs that share a direct co-occurrence, the sensitivity exceeded 97%. To evaluate the predictive power of the approach, we went two years back in time and determined concept profiles and a prediction threshold at that time point using only the Swiss-Prot database and Medline records as they were available at that time. Of the pairs that are stored in Swiss-Prot 2007 (release 54.0) but not in Swiss-Prot 2005 (release 46.0), 45% were predicted to interact. The ROC curve for those pairs showed an AUC of 0.77. From the protein pairs that are also stored in other protein databases besides Swiss-Prot (release 46.0), 60% were predicted to interact. Here the ROC showed an AUC of 0.88 stating that those pairs are in the top of highest matching scores of all predicted protein pairs. Materials and Methods To generate concept profiles for each protein in the database we need a MedLine Corpus. We collect abstracts over the period 1980 till February 2005 and 1980 till July 2007 for Swiss-Prot release 46.0 and 54.0 respectively. Peregrine tags the concepts in each abstract as found in the ontology resulting in a document profile. The document profiles belonging to a protein are merged together resulting in a concept profile for that protein. In a mathematical sense a concept profiles is nothing more than a vector. Concept profiles can now be matched with each other using a distance measure such as the inner-product. The matching score then reflects the strength of association between the two concepts based on literature. Conclusions We conclude that the concept profiles have predictive power to find protein-protein interactions that have not been recorded yet in protein-protein interaction databases. Of the protein interactions that we predicted in 2005, the protein pairs known to interact in 2007 tended to have the highest matching scores. From a practical point of view, this means that protein pairs in the top will have a high probability to be confirmed to interact when they are analyzed in a lab experiment. Therefore, concept profiles methodology is useful for biologists to select those protein pairs for follow-up experiments that are most likely to interact with each other. In future analyses, the predictive power for other paired entities such as drug-diseases relationships will be evaluated.