spacer

ECCB 2008 Workshop: Annotation, Interpretation and Management of Mutations (AIMM)

Agenda of the AIMM workshop (Preliminary)

Morning Session

Keynote:
Catherine L. Worth, G. Richard J. Bickerton, Adrian Schreyer, Julia R. Forman, Tammy M.K. Cheng, Semin Lee, Sungsam Gong, David F. Burke, and Tom L. Blundell:
A structural bioinformatics approach to the analysis of nsSNPs and prediction of disease association.

Abstract:
Understanding the impact that non-synonymous single nucleotide polymorphisms (nsSNPs) have on the structures of gene products, proteins, is important in identifying the origins of complex diseases. Predicting the effects that these mutations have on protein function depends critically on exploiting all information available on the three-dimensional structures of proteins. We have developed software and databases for the analysis of nsSNPs that allows a user to move from SNP to sequence to structure to function. In both structure prediction and in the analysis of the effects of nsSNPs, we exploit information about protein evolution, in particular, that derived from investigation of the relation of sequence to structure gain]ed from the study of amino acid substitutions in divergent evolution. The techniques developed in our laboratory have allowed fast and automated sequence-structure homology recognition to identify templates and to perform comparative modelling, as well as simple, robust and generally applicable algorithms to assess the likely impact of amino acid substitutions on structure and interactions. We describe our strategy for relating SNPs to disease [1] and the results of benchmarking our approach on a set of human proteins of known structure and recognized mutation [2].

[1] Burke DF, Worth CL, Priego EM, Cheng TMK, Smink LJ, Todd JA and Blundell TL (2007) Genome bioinformatic analysis of nonsynonymous SNPs. BMC Bioinformatics 8:301

[2] Worth CL*, Bickerton GRJ*, Schreyer A, Forman JR, Cheng TMK, Lee S, Gong S, Burke DF and Blundell TL (2007) A structural bioinformatics approach to the analysis of non-synonymous single nucleotide polymorphisms and their relation to disease. Journal of Bioinformatics and Computational Biology special issue: Making Sense of Mutations requires Knowledge Management vol.5 no 6. *these authors contributed equally to this work

Presentation

Talks:

  • Joke Reumers, Joost Schymkowitz, and Frederic Rousseau:
    Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations

    Abstract:
    Background: Linking structural e ects of mutations to functional outcomes is a major issue in structural bioinformatics, and many tools and studies have shown that speci c structural properties such as stability and residue burial can be used to distinguish neutral variations and disease associated mutations. Results: We have investigated 39 structural properties on a set of SNPs and disease mutations from the Uniprot Knowledge Base that could be mapped on high quality crystal structures and show that none of these properties can be used as a sole classification criterion to separate the two data sets. Furthermore, we have reviewed the annotation process from mutation to result and identifed the liabilities in each step. Conclusions: Although excellent annotation results of various research groups underline the great potential of using structural bioinformatics to investigate the mechanisms underlying disease, the interpretation of such annotations cannot always be extrapolated to proteome wide variation studies. Di culties for large-scale studies can be found both on the technical level, i.e. the scarcity of data and the incompleteness of the structural toolsuites, and on the conceptual level, i.e. the correct interpretation of the results in a cellular context.

    Presentation - Publication

  • Simon Forbes, Gurpreet Tang, Jon Teague, Andrew Futreal, and Mike Stratton:
    COSMIC, curating the cancer variome.

    Abstract:
    Background. COSMIC is a system designed to curate the world's literature on somatic mutations in known cancer genes. Initially conceived to capture the mutation spread in point-mutated genes, COSMIC has now grown to encompass gene fusion products of genome rearrangement events which generate completely novel transcripts, together with all the somatic mutation data from candidate gene screens at the Cancer Genome Project, UK (CGP), covering almost 5000 genes of potential interest in cancer genetics. Results. The latest release of COSMIC (version 37; July 2008) now holds full and up-to-date curation of over 5,900 scientific papers, examining over 268,000 tumours, in which over 59,000 mutations are detailed through 60 point-mutated genes. Fusion gene products have been curated for 16 pairs of genes, described through over 4200 tumours. 2246 papers were rejected during manual curation, usually due to significant inconsistencies in the publication. A relational database holds the captured information, which is warehoused for each release. The information is presented on the internet with a series of graphical and tabulated views aiding navigation and interpretation. Conclusions. The current version of COSMIC is close to fulfilling its original intentions, with curation of most point-mutated genes in cancer complete. However, new challenges are emerging with the need to calculate the effect of high numbers of observed sequence changes to identify those driving tumour formation, and the need to meaningfully handle the increasing quantities of data from high-throughput screens and next-generation sequencing technologies.

    Presentation - Publication

  • Kirsty Lee:
    An analysis of different ontological approaches to describe renal mutant phenotypes submission information

    Abstract:
    The abundance of phenotypic data emerging from mouse mutagenesis screens [1[2] implies a need to describe phenotypes in a way that is amenable to computational comparison. Phenotype comparison is imperative in order to study the underlying genetic mechanisms, and may involve identifying subtle differences between mutant phenotypes. When phenotypic descriptions come in the form of free text, placing lexical and syntactic constraints on them may allow for a more effective comparison. Recently, ontologies have provided these constraints and have increasingly been used in the representation of a variety of biological daa [3]. The major alternative ontologies available for mouse phenotype description are the MPO (Mammalian Phenotype Ontology) [4] and PaTO (Phenotype and Trait Ontology) [5]. Ontologies should be able to contribute to the analysis of mutant phenotypes by providing a framework for reasoning. However, any reasoning task will be of limited value if a phenotype ontology cannot represent the majority of phenotypes in publication accurately and in sufficient detail. Therefore, it is important to investigate the accessibility and expressivity of phenotype ontologies, firstly to ensure the scope and consistency of phenotype databases but also as a prerequisite for meaningful automatic reasonig methods. This paper will incorporate the findings of a 6-month case study which explored potential methods of phenotype description for the EuReGene project [8]. During the course of the case study, it was possible to visit the participating laboratories which gave a unique and pragmatic insight into how phenotype ontologies can match the requirements of the mouse research community.

    Publication


Afternoon Session

Keynote:
Kevin B. Cohen:
Mutations and representations in natural language semantics.

Abstract:
Years of work in linguistics and computational semantics have led to a variety of theories about the nature of semantic representations. They generally share an assumption about the relationship between semantic and syntactic constituents of an utterance. Recent work on the automatic processing of language about mutations has brought to light a phenomenon that raises questions about this basic assumption. This talk will examine the Argument Realization principle in the light of data on how scientists speak about mutations in genes and proteins.

Presentation

Talks:

  • Kevin Nagel, Antonio Jimeno, and Dietrich Rebholz-Schuhmann:
    Annotation of residues based on a literature analysis: cross-validation against UniProtKB submission information

    Abstract:
    Background: A protein annotation database, such as the Universal Protein Resource (UniProtKB), is a valuable resource for the validation and interpretation of predicted 3D structure patterns in proteins. Previously, results have been on point mutation extraction methods from biomedical literature which can be used to support the consuming work of manual database curation. However, these methods were limited on point mutation extraction and do not extract features for the annotation of proteins at the residue level. Results: This work introduces a system that identifies protein residue sites in abstract texts and annotate them with features extracted from the context. The performances of all text mining modules were evaluated against a manually annotated corpus. The identified annotation features can be attributed to at least one of seven targeted categories, e.g. enzymatic reaction. Extracted results were cross-validated against UniProtKB and for 13 annotations of residues that have not been confirmed in the UniProtKB a manual assessment was performed. Conclusions: This work proposes a solution for the automatic extraction of protein residue annotation from biomedical articles. The presented approach is an extension to other existing systems in that a wider range of residue entities are considered and that features of residues are extracted as annotations.

    Presentation - Publication

  • Rainer Winnenburg, Conrad Plake, and Michael Schroeder:
    Mutation tagging with gene identifiers applied to membrane protein stability prediction

    Abstract:
    The automated retrieval and integration of information about protein point mutations in combination with structure, domain and interaction data from literature and databases promises to be a valuable approach to study structure-function relationships in biomedical data sets. As a prerequisite, we developed a rule- and regular expression-based protein point mutation retrieval pipeline for PubMed abstracts, which shows an F-measure of 87% for the pure mutation retrieval task on a benchmark dataset. In order to link mutations to their proteins, we utilised a named entity recognition algorithm for the identification of gene names co-occurring in the abstract, and established links based on sequence checks. We identified more than 10Mio genes/proteins in nearly 3.5Mio abstracts and 260.000 mutations in 80.000 of these abtracts (2.3%). In 52% of cases the identified gene's sequence and the mutation are consistent. We evaluated the use of mutations in gene identification in detail on a small test set of 22 abstracts. Identifying the correct gene improved from 77% to 91% when considering the mutations. To demonstrate practical relevance, we set up a mutation screening for five membrane proteins from the family of G protein coupled receptors to evaluate a solvation energy based model for the prediction of stabilizing regions in membrane proteins. We identified 35 mutations in text. 25 out of 35 mutation phenotypes reported in literature were in compliance with the prediction of the energy model, which supports a relation between mutations and stability issues in membrane proteins.

    Presentation - Publication

  • Kar Heng Choo, Wee Tiong Ang, Rajaraman Kanagasabai, and Christopher J. O. Baker:
    Ontology-driven selection of mutation annotations from text for protein structure annotation

    Abstract:
    Protein structure annotations are frequently buried within full-text scientific documents. In particular mutations and their impacts are described in detail and a number of published systems facilitate their extraction. The mSTRAP system extracts mutations and annotations from texts and maps them onto homology models of the corresponding protein structure. Indiscriminate mapping of all mutations to multiple models is computationally expensive. User-driven selection of mutations, based on their impact on protein properties before homology modelling, serves to ameliorate this load. Knowlegator is an ontology-centric knowledge navigation tool that facilitates the construction of precise queries over indexed literature sources. We report on the integration of the Knowlegator's query construction capacity, over mutation impacts and protein properties found in indexed sentences, with mSTRAP's visualisation of annotated protein structures with corresponding impact annotations. The integrated system will be demonstrated on mutated protein phosphatases.

    Presentation

spacer
spacer