MASAMB :: 16-17 April 2015

Mathematical and Statistical
Aspects of Molecular Biology

25th Annual MASAMB Workshop @ Helsinki, Finland


Thursday 16 April 2015

9:00-10:00 Registration, coffee and poster setup
10:00-10:10 Opening
10:10-12:00 Networks and association studies

Empirical assessment of causal network inference methods: The HPN-DREAM network inference challenge
Steven M. Hill, Laura M. Heiser, Thomas Cokelaer, Michael Unger, Heinz Koeppl, Gustavo Stolovitzky, Julio Saez-Rodriguez and Sach Mukherjee

High-dimensional statistical approaches for differential network testing and network-based clustering in heterogeneous molecular data
Nicolas Staedler, Frank Dondelinger and Sach Mukherjee

Testing the hygiene hypothesis: microbiome development of infants with different lifestyles
Tommi Vatanen, Aleksandar Kostic, Dirk Gevers, Moran Yassour, Raivo Kolde, Heli Siljander, Curtis Huttenhower, Harri Lähdesmäki, Mikael Knip and Ramnik Xavier

Hierarchical Bayesian model for rare variant association analysis integrating genotype uncertainty in human sequence data
Liang He, Janne Pitkäniemi, Antti-Pekka Sarin, Veikko Salomaa, Mikko J. Sillanpää and Samuli Ripatti

12:00-13:30 Lunch break
13:30-14:45 Systems biology

Recovering a reaction network after linear elimination of species
Meritxell Saez

Obtaining Persistence from Simplified Models
Michael Marcondes de Freitas

Parameter-free methods for systems biology distinguish Wnt pathway models and guide design of experiments
Adam L MacLean, Zvi Rosen, Helen M Byrne and Heather A Harrington

14:45-15:15 Coffee
15:15-16:30 Systems and synthetic biology

Efficient experimental design for systems biology dynamical models
Karol Nienałtowski and Michał Komorowski

Reachable state set computation of stochastic biological systems with controlled input and uncertainty
Eszter Lakatos and Michael P H Stumpf

Bayesian Optimization for Synthetic Gene Design
Javier Gonzalez, Joseph Longworth, David James and Neil D. Lawrence

16:30-16:45 Vote for MASAMB 2016
16:45-18:30 Posters
19:00- Dinner

Friday 17 April 2015

9:00-10:15 Phylogenies

Phylogenetic trees without a single sequence alignment
Marcin Bogusz and Simon Whelan

Inferring rooted phylogenies via non-reversible substitution models
Svetlana Cherlin

The Impact of Ancestral Population Size and Incomplete Lineage Sorting on Bayesian Estimation of Species Divergence Times
Konstantinos Angelis and Mario Dos Reis

10:15-10:45 Coffee
10:45-12:00 Phylogenies and more

Computing the Internode Certainty using partial gene trees.
Kassian Kobert, Leonidas Salichos, Antonis Rokas and Alexandros Stamatakis

Coalescent Theory in Tissue Growth Modelling with Lineage Tracing Applications
Patrick Smadbeck and Michael Stumpf

Alignment by numbers: sequence assembly using compressed numerical representations
Avraam Tapinos, Bede Constantinides, Douglas Kell and David Robertson

12:00-13:30 Lunch break
13:30-14:45 Transcriptomics

Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses
Matthew Ritchie, Ruijie Liu and Gordon Smyth

Clustering Gene Expression Time Series of Mouse Model for Speed Progression of ALS
Sura Zaki Alrashid, Muhammad Arifur Rahman, Nabeel H. Al-Aaraji, Paul R. Heath and Neil D. Lawrence

Shrinkage estimators in data analysis for comparative high-throughput sequencing experiments
Simon Anders, Michael I Love and Wolfgang Huber

14:45-14:50 Closing

Talk abstracts

Empirical assessment of causal network inference methods: The HPN-DREAM network inference challenge

Steven M. Hill, Laura M. Heiser, Thomas Cokelaer, Michael Unger, Heinz Koeppl, Gustavo Stolovitzky, Julio Saez-Rodriguez and Sach Mukherjee
Abstract: Data-driven characterisation of molecular network topology continues to be an active area of research. It is often of interest to infer networks that are specific to a biological context. For example, disease-specific networks could help to better understand disease mechanisms and response to therapy. In addition, since molecular networks describe causal relationships between components, it is desirable to assign a causal interpretation to inferred network edges. However, causal network inference is a challenging problem. Many methods focus on correlation rather than causation, while approaches that do have a causal emphasis (e.g. causal Bayesian networks) require strong assumptions that may not hold in molecular biology settings. It is therefore important to empirically assess the ability of network inference methods to recover causal relationships. Methods are often assessed using simulated data, where a “gold-standard” causal network structure is available. Yet, among other limitations of this approach, performance does not necessarily generalise to biological settings of interest. For real-world biological problems, lack of a “gold-standard” makes assessment of performance challenging.

We developed the HPN-DREAM network inference challenge, which focussed on assessing the ability of methods to infer context-specific, causal molecular networks. In particular, participants were tasked with inferring protein signalling networks in individual breast cancer cell lines, using time-course proteomics data with interventions on network nodes. Submitted networks were assessed using a metric that quantified the extent to which causal relationships encoded in inferred networks agreed with held-out interventional data. In total more than 2000 networks submitted by more than 70 teams were assessed in this way, allowing, for the first time, empirical assessment of causal network learning in a complex, mammalian setting. In addition to describing the assessment approach, I will also present key results and insights from post-challenge analyses.

High-dimensional statistical approaches for differential network testing and network-based clustering in heterogeneous molecular data

Nicolas Staedler, Frank Dondelinger and Sach Mukherjee
Abstract: Molecular interplay plays a central role in basic and disease biology. Patterns of interplay are thought to differ between biological contexts, such as cell type, tissue type, or disease state. Many high-throughput studies now span multiple such contexts and the data may therefore be heterogeneous with respect to patterns of interplay. This motivates a need for statistical approaches that can cope with molecular data that are heterogeneous in a multivariate sense.

In this work, we exploit recent advances in high-dimensional statistics (Staedler and Mukherjee, 2013a,b) to put forward tools for analysing heterogeneous molecular data. We model the data using Gaussian graphical models (Rue and Held, 2005), and develop two useful techniques based on estimation of partial correlations using the graphical lasso (Friedman et al., 2008): a two-sample test that captures differences in molecular interplay or networks, and a mixture model clustering approach that simultaneously learns cluster assignments and multi-variate network models that are cluster-specific.

We demonstrate the characteristics of our methods using an in-depth simulation study, and proceed to apply them to proteomic data from The Cancer Genome Atlas (TCGA) “pan-cancer” study (Akbani et al., 2014). Our analysis of the TCGA data provides formal statistical evidence that protein networks differ significantly by cancer type (see attached Fig. 1). Furthermore, we show how multivariate models can be used to refine cancer subtypes and learn associated networks (attached Fig. 2). Our results demonstrate the challenges involved in truly multivariate analysis of heterogeneous molecular data and the substantive gains that high-dimensional methods can offer in this setting.

Testing the hygiene hypothesis: microbiome development of infants with different lifestyles

Tommi Vatanen, Aleksandar Kostic, Dirk Gevers, Moran Yassour, Raivo Kolde, Heli Siljander, Curtis Huttenhower, Harri Lähdesmäki, Mikael Knip and Ramnik Xavier
Abstract: The frequency of immune-mediated diseases is increasing in developed countries. According to the hygiene hypothesis, a modern lifestyle might be linked to alterations of gut microbial colonization, mediating the susceptibility to these diseases.

The DIABIMMUNE cohort was motivated by the marked difference in T1D incidence between Finland and the geographically adjacent Estonia and Russian Karelia. We have studied two unique subcohorts in DIABIMMUNE, a T1D case-control cohort with 33 infants and a life-style cohort with 222 infants (74 per country) from all the three countries. All infants are genetically predisposed to autoimmune diseases and they were followed by monthly stool sampling from birth until the age of three years. We have analysed both these data with a collection of tools including QIIME [1] for 16S data, MetaPhlAn 2.0 [2] for more precise detection of microbial species using metagenomics data, the second generation of the HUMAnN [3] method for microbial gene and pathway profiling and ConStrains [4] for strain level haplotyping, and an array of statistical techniques such as mixed-effects models. canonical correlation analysis and t-SNE. This data gives an unprecedented view on the developing infant gut microbiome.

In the T1D cohort, we observed a marked drop in alpha-diversity in T1D progressors in the time window between seroconversion and T1D diagnosis together with spikes in inflammation-favoring organisms, gene functions, and serum and stool metabolites. We also studied strain composition of microbiota and found clear trend of stable within-individual and variable between-individual strain profiles. Finally, we have integrated metabolomic and microbiota using canonical correlation analysis revealing potential secondary metabolites.

In the life-style cohort, we have observed several differences between the three countries. A significantly decreased microbial diversity marked the first year of Russian life, which is characterized by an increase in Bifidobacteria, a genus associated with breast feeding. Yet, after one year of age Russians have a more diverse microbiota compared to their Finnish peers. In addition, Russian infants have an overall more flexible (i.e. less rigid) microbiome throughout the whole prospective period with one exception, namely a stable presence of Actinobacteria throughout the first year. The presence of Bifidobacteria, Actinobacteria, and a more diverse microbiota has been associated with lower incidence of immune-mediated diseases and may represent a link between the hygiene hypothesis and the gut microbiome.

[1] J. Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D. Bushman, Elizabeth K. Costello, Noah Fierer, Antonio G. Pena, Julia K. Goodrich, Jeffrey I. Gordon, Gavin A. Huttley, Scott T. Kelley, Dan Knights, Jeremy E. Koenig, Ruth E. Ley, Catherine A. Lozupone, Daniel McDonald, Brian D. Muegge, Meg Pirrung, Jens Reeder, Joel R. Sevinsky, Peter J. Turnbaugh, William A. Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld, and Rob Knight. QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5):335–336, 2010.

[2] Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, and Curtis Huttenhower. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods, 9(8):811–814, 2012.

[3] Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria M. Schubert, Jacques Izard, Brandi L. Cantarel, Beltran Rodriguez-Mueller, Jeremy Zucker, Mathangi Thiagarajan, Bernard Henrissat, Owen White, Scott T. Kelley, Barbara A. Methé, Patrick D. Schloss, Dirk Gevers, Makedonka Dautova Mitreva, Curtis Huttenhower. Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome. PLoS Computational Biology 8(6), 2012

[4] Chengwei Luo, Rob Knight, Heli Siljander, Mikael Knip, Ramnik J. Xavier, and Dirk Gevers. Strain profiling and genotyping of microbial communities from metagenomic sequence data. Submitted for review.

Hierarchical Bayesian model for rare variant association analysis integrating genotype uncertainty in human sequence data

Liang He, Janne Pitkäniemi, Antti-Pekka Sarin, Veikko Salomaa, Mikko J. Sillanpää and Samuli Ripatti
Abstract: Next generation sequencing (NGS) has led to the study of rare genetic variants, which possibly to explain the missing heritability for complex diseases. Most existing methods for rare variant (RV) association detection do not account for the common presence of sequencing errors in NGS data. The errors can largely affect the power and perturb the accuracy of association tests due to rare observations of minor alleles. We developed a hierarchical Bayesian approach to estimate the association between RVs and complex diseases. Our integrated framework combines the misclassification probability with shrinkage-based Bayesian variable selection. This allows for flexibility in handling neutral and protective RVs with measurement error, and is robust enough for detecting causal RVs with a wide spectrum of minor allele frequency (MAF). Imputation uncertainty and MAF are incorporated into the integrated framework to achieve the optimal statistical power. We demonstrate that sequencing error does significantly affect the findings, and our proposed model can take advantage of it to improve statistical power in both simulated and real data. We further show that our model outperforms existing methods, such as SKAT. Finally, we illustrate the behavior of the proposed method using a Finnish low density lipoprotein cholesterol study, and show that it identifies an RV, known as FH North Karelia in LDLR gene with 3 carriers in 1,155 individuals, which is missed by both SKAT and Granvil.

Recovering a reaction network after linear elimination of species

Meritxell Saez
Abstract: The formalism of chemical reaction network theory (CRNT) puts chemical reaction networks into a mathematical, particularly algebraic, framework. In this framework, the steady states of a chemical reaction network with mass-action kinetics are solutions to a system of polynomial equations. Even for small systems, finding the steady states of the system is a very demanding task and therefore methods that reduce the number of variables are desirable. Feliu and Wiuf in 2012 give one such method, in which so-called non-interacting species are eliminated at steady state.

Elimination of certain species in the network is closely linked to the procedure known as the quasi-steady state approximation, which is often employed to simplify the modeling equations. Under the quasi-steady state approximation, some reactions are assumed to occur at a much faster rate than other reactions, that is, there is a separation of time scales, such that a steady state effectively has been reached for the fast reactions. In this setting, the goal is to study the evolution of the species that have not reached steady state (slow variables), as a new reaction network on their own.

Following the ideas introduced in the paper by Feliu and Wiuf (2012), we give a graphical method to find the reaction network on the slow variables as well as their production rates. Our method reinterprets the system of equations obtained after substitution of the eliminated variables (fast variables), as a new reaction network with mass-action kinetics, Michaelis-Menten-like kinetics or similar. The procedure is based on the analysis of the species graph and the subgraph corresponding to the eliminated variables and, importantly, allows for computer implementation. Our procedure generalises the work done by Feliu and Wiuf (2013) where the authors consider elimination of intermediate species, like enzyme-substrate complexes, of the original system.

Obtaining Persistence from Simplified Models

Michael Marcondes de Freitas
Abstract: For many decades, reaction network theory has provided a framework to understand biochemical circuits as systems of ordinary differential equations describing how the concentrations of the involved chemical species evolve over time. Of great interest is the long-term behavior of such systems, in particular, whether or not they exhibit persistence---the property that, as long as there was a positive amount of each chemical species present in the beginning, they cannot go extinct.

It can be very difficult to determine if a system is persistent or not in general. A recent contribution was given by Angeli, De Leenheer and Sontag (2007), who provided checkable graph conditions which are sufficient for persistence in reaction networks.

In our work, we take a model simplification approach to further advance their method. After generalizing their results, we describe a process through which one can extend or simplify a reaction network by adding or removing intermediate complexes and/or catalysts, showing that these operations do not break sufficient conditions for persistence. In fact, for some important classes of reaction networks, such as the class of post-translational modification networks introduced by Thomson and Gunawardena (2009), our conditions are shown to be not only sufficient for, but equivalent to persistence.

To illustrate the scope and reach of our method, consider a two-site phosphorylation process, which can be modeled by the reaction network

E + S_0 <---> ES_0 ---> E + S_1 <---> ES_1 ---> E + S_2
F + S_2 <---> FS_2 ---> F + S_1 <---> FS_1 ---> F + S_0

where S_0, S_1, S_2 represent various forms of a substrate S_0 carrying none, one or two phosphate groups, E and F are enzymes, and the other complexes represent intermediate steps. We show that persistence for the model above is a consequence of persistence for its much simpler underlying substrate model

S_0 <---> S_1 <---> S_2

For substrate models such as this one, the aforementioned necessary and sufficient conditions for persistence are much easier to check, thus giving our model simplification approach a computational advantage.

Parameter-free methods for systems biology distinguish Wnt pathway models and guide design of experiments

Adam L MacLean, Zvi Rosen, Helen M Byrne and Heather A Harrington
Abstract: The canonical Wnt signalling pathway, mediated by beta-catenin, is crucially involved in essential cellular processes including development, adult stem cell tissue maintenance, and a host of diseases including cancer. We analyse existing mathematical models of Wnt and compare them to a new Wnt signalling model that targets spatial localisation; our aim is to distinguish between the models and distill biological insight from them.

We use Bayesian methods to infer parameters for each model from mammalian Wnt signalling data and we find that all models can fit this time course, highlighting a data--model disparity. To circumvent the lack of data, we appeal to algebraic methods (concepts from chemical reaction network theory and matroid theory) to analyse the models without recourse to specific parameter values. These approaches provide insight into aspects of Wnt regulation. We find that the new model, via control of shuttling and degradation parameters, permits multiple stable steady states corresponding to stem-like vs committed cell states in the differentiation hierarchy. Our analysis also identifies groups of variables that should be measured to fully characterise and discriminate between competing models, and thus provides guiding principles for performing minimal experiments for model comparison.

Efficient experimental design for systems biology dynamical models

Karol Nienałtowski and Michał Komorowski
Abstract: Dynamical models in quantitive biology are characterised by much more complex structures and substantially larger sets of parameters than models used in physics and engineering. Moreover available experiments usually provide limited set of data, usually corresponding only to fragments of studied systems. In consequence the reliability of used models is limited and our knowledge of studied processes misrepresented.

Potential remedy is provided by experimental design techniques. Such tools aim at selection of optimal experimental setup to maximise missing information about model parameters or model structure. As models in quantitative biology differ qualitatively from conventional models, experimental design tools need to be adapted to their specificity.

Here we present a method specifically tailored for multi-parameter models of quantitative biology. Our framework enables to determine which parameters of a given model can be identified in a given experiment and predict which experiment should be performed next to maximise the number of identifiable parameters. Our tool is different from methods developed so far as it is focused in verifying identifiability of individual parameters in large dynamical models, which contain even hundreds of parameters.

We present applicability of our tool analysing models of NF-kB and MAPK signaling. Surprisingly, we show that certain parameters are virtually impossible to estimate in intuitively designed experiments. Our method helps to guide experimental design in order to render such parameters identifiable.

Reachable state set computation of stochastic biological systems with controlled input and uncertainty

Eszter Lakatos and Michael P H Stumpf
Abstract: In this work we present a method to approximate the reachable states of arbitrary stochastic biological systems under varying levels of uncertainty. Our focus is on the inevitable levels of noise in a certain biological design, e.g. variability coupled to the production of a biological molecule. We derive a system of ODEs for the stochastic system using the Linear Noise Approximation and conduct reachability analysis using a zonotope-based representation of reachable state sets. We take into account properties of uncertain or controlled input, as well as possible uncertainty about parameter values and initial states. The method is demonstrated on the controlled stochastic gene expression system and Michaelis-Menten enyzme kinetics.

Bayesian Optimization for Synthetic Gene Design

Javier Gonzalez, Joseph Longworth, David James and Neil D. Lawrence
Abstract: Synthetic biology concerns with the design and construction of new biological elements of living systems and the redesign of the existing ones for useful purposes. In this context, there exists a current excitement in the pharmaceutical industry in the development of new scalable cell engineering methods to produce the key components of the next-generation drugs. A modern approach to this problem is the use of synthetic genes, which once `inserted' in the cells can modify their natural behavior and activate the production of useful proteins.

The gene design problem is addressed in this work by means of Bayesian optimization, which allows to explore the gene design space with a minimal experimental effort. A Gaussian process model, that uses as input a set of biologically relevant gene sequence features, is used as a emulator of the translational cell efficiency. An acquisition function is optimized to deal with the exploration-exploitation trade-off across the different gene design rules. New recombinant genes are obtained by means of an evaluation function that allows to rank sets of candidate gene sequences, selecting the gene that is the most coherent with the optimal design strategy.

In a simulation study, the efficiency of a set of low expressed genes was improved in mammalian cells based on previously published data. Although this work focuses on the optimization of the translational efficiency of the cells, this framework can be generalized to multiple synthetic biology design problems, such as the optimization of media design, or the optimization of multiple gene knock-outs.

Phylogenetic trees without a single sequence alignment

Marcin Bogusz and Simon Whelan
Abstract: Modern statistical methods for estimating trees from molecular data rely on a two-step process of multiple sequence alignment (MSA) followed by tree inference. There is substantial evidence that MSA quality affects the accuracy and reliability of phylogenetic inference, since flaws in the alignment tend to cascade downstream. MSA reconstruction uses measures of sequence similarity to obtain a point estimate of sitewise homology in sequences, which in molecular evolution is interpreted as the character has evolved from a common ancestor. Popular MSA programs use different heuristics that are prone to different types of error, leading to the downstream tree estimate being dependent on the nature of the heuristics used. This awkward interaction between MSA and tree inference has lead to several alternative solutions to the two-step process being proposed. Alignment-free methods offer a fast alternative by inferring pairwise distances between sequences from the relative occurrence of k-mers and using clustering methods to infer trees. Full statistical alignment can jointly estimate both the phylogeny and MSA from the sequences, but is computationally intensive.
Here we propose an efficient approach that uses the principles of statistical alignment to estimate pairwise distances from unaligned sequences, and then uses bioNJ to infer a tree. To estimate pairwise distances we apply pair-hidden Markov models (pair-HMMs), which describe insertion and deletion as well as substitution. Pair-HMMs allow us to account for alignment uncertainty by integrating over all possible homologies and have the potential to avoid many of the biases associated with MSA. We find that integrating across alignments in a pair-HMM can provide more accurate estimates of pairwise distances than estimates taken from inferred fixed alignments and introduce only a modest increase in the variance of that estimate. We provide a fast Maximum Likelihood approach for estimating the parameters of the pair-HMM evolutionary model and pairwise distance estimates conditional on that model. Simulations show that the bioNJ trees estimated from our pair-HMM-derived distances tend to be more accurate than those obtained using conventional pairwise distance approaches and comparable with those obtained using state-of-the-art methods in the standard phylogenetic pipeline. We also present evidence that for more divergent sequences, tree estimates from pair-HMM distances can be more accurate than those obtained from the two-step process.

Inferring rooted phylogenies via non-reversible substitution models

Svetlana Cherlin
Abstract: Most phylogenetic models assume that the evolutionary process is reversible. This
means that the root of a phylogenetic tree cannot be inferred as part of the analysis because the likelihood of the data does not depend on the position of the root. Yet, defining the root of a phylogenetic tree is a key component of phylogenetic inference, because it provides a point of reference for polarising ancestor/descendant relationships and therefore interpreting the tree.
In this talk I present two related non-reversible models and their use for inferring the root of a phylogenetic tree. The non-reversibility of both models is achieved by extending a standard reversible model to allow a non-reversible perturbation of the instantaneous rates of substitution between the nucleotides. This perturbation makes the likelihood dependant on the position of the root, enabling inference about the root directly from the sequence
alignment. The two models differ in the way the underlying reversible model is perturbed. The inference is performed in a Bayesian framework using Markov chain Monte Carlo. The performance of the models is illustrated via analyses of simulated data sets.
The models are also applied to two real biological datasets for which there is robust biological opinion about the root position: the radiation of palaeopolyploid yeasts following a whole-genome duplication and the radiation of Drosophila species in the New World. The results suggest that non-reversibility is a suitable feature for learning about root position and that non-reversible models can be useful to infer the root position from real biological data sets.

The Impact of Ancestral Population Size and Incomplete Lineage Sorting on Bayesian Estimation of Species Divergence Times

Konstantinos Angelis and Mario Dos Reis
Abstract: Estimating species divergence times is one of the most important and interesting problems in evolutionary biology. Recently, powerful Bayesian models have been developed to estimate divergence times using multi-locus sequence data and information from the fossil record. Some of the most popular models used by biologists assume that gene coalescent times coincide with the speciation events and that there are no conflicts among gene trees. However, in some cases, (for example when ancestral population sizes are very large compared to the number of generations between speciation events), those assumptions may be seriously violated and may result in biased divergence time estimates. Here, we explore the impact of ancestral polymorphism and incomplete lineage sorting on Bayesian divergence time estimates under the molecular clock, when polymorphism and incomplete lineage sorting are ignored by the model. Using a combination of mathematical analysis and computer simulation, we study the impact on two-, three- and nine-species phylogenies. The results indicate that time and rate estimates can be significantly biased, particularly in the case of shallow phylogenies with large population sizes.

Computing the Internode Certainty using partial gene trees.

Kassian Kobert, Leonidas Salichos, Antonis Rokas and Alexandros Stamatakis
Abstract: Recently Salichos and Rokas [1] proposed a set of novel measures for quantifying the confidence of bipartitions in a phylogenetic tree. These measures are the so called Internode Certainty and Tree Certainty, which are calculated for a given reference tree and a collection of other trees with the exact same taxon set.
In contrast to classical scoring schemes for the branches, such as simple bipartition support or posterior probabilities, the IC score also reflects to which degree the most favored bipartition is contested.
The underlying idea of Internode Certainty is to asses a quality score for each inner branch of a phylogenetic reference tree, by calculating Shannon`s Measure of Entropy.
The tree collection, which is the basis for the calculations, may for example, come from running multiple phylogenetic searches on the same dataset, multiple bootstrap runs, or running the analyses separately on different genes. While for the first two cases the assumption of having the same taxon set is reasonable, this often does not hold for different genes. Gene sequences may be available for different subsets of taxa, simply due to sequence availability or the absence of some genes in certain species.

We present a set of methods for allowing the calculation of Internode Certainty (and Tree Certainty) from a set of trees with a partial taxon set. Previously this was only possible from a set of full trees. The different methods accommodate for different biological assumptions about the distribution of missing taxa in the partial gene trees.
Our tests indicate that even the most simplistic model we tested yields results that are comparable to those obtained under more complex assumptions.

[1] L.Salichos, A.Rokas, Inferring ancient divergences requires genes with strong phylogenetic signals, Nature (2013)

Coalescent Theory in Tissue Growth Modelling with Lineage Tracing Applications

Patrick Smadbeck and Michael Stumpf
Abstract: The field of population genetics has evolved tremendously over the past one hundred years, especially with the recent advancements in genome sequencing technology. One key to analysing genomic data of an evolving population is the ability to take present-day data and deduce information about that population’s history. Coalescent theory [1] together with its extensions [2] is a method by which statistical information about population lineages can be deduced from present day data. Coalescent theory, in effect, reverses time and describes a set of lineages as a coalescent process by which individuals converge to a common ancestor. This process allows us to elucidate genealogical processes that cannot be observed directly; for example we can estimate the time back to the most recent common ancestor of a sample of extant lineages.

Similarly, lineage tracing in tissue growth systems is becoming a fundamental part of developmental biology. Spatial structure and patterning is now understood to be a emerging feature of morphogenesis and tissue differentiation processes. However, directly monitoring and measuring the location of cell lineages in vivo remains a difficult task. Recently, several publications [3,4] suggest the possibility of coalescence-like behaviour in tissue growth models, but few direct application of coalescent theory to developmental biology have appeared in the literature.

Our research presented here concerns the application of coalescent theory to tissue growth models. During tissue growth a specific cell’s lineage and spatial location are intimately connected. Thus, coalescent methods typically applied to the time domain can be repurposed to trace cell lines in a spatial context. The results from three computational models will be presented. The first model is a simple Moran [5] like growth in unstructured and structured tissues. The second model is borrowed from Cheeseman et al (2014) [6], and describes a realistic, boundary-layer growth model during neural crest development. And the third model is a two-dimensional tumour growth model.

We show how direct connection between time and space in tissue growth models allows for the application of coalescent theory to tissue growth and developmental processes. The coalescent perspective allows us to reconcile previously apparently disparate observations from simulation models and lineage tracing and provides a helpful perspective on both healthy and malignant tissue development, which tend to be characterized by distinct genealogical behaviour.


[1] Kingman, JFC. "The coalescent." Stochastic processes and their applications 13.3 (1982): 235-248.

[2] Rosenberg, NA, Magnus N. "Genealogical trees, coalescent theory and the analysis of genetic polymorphisms." Nature Reviews Genetics 3.5 (2002): 380-390.

[3] Hu, Z, Fu Y-X, Greenberg AJ, Wu C-I, Zhai W. "Age-dependent transition from cell-level to population-level control in murine intestinal homeostasis revealed by
coalescence analysis." PLoS genetics 9.2 (2013): e1003326.

[4] Sottoriva, A, Kang, H, Zhicheng, M, Graham, TA, Salomon, M, Zhao, J, Marjorm, P, Siegmund, K, Press, MF, Shibata, D, Curtic, C. "A Big Bang model of human
colorectal tumor growth." Nature genetics 47.3 (2015): 209-216.

[5] Wakeley, J. Coalescent theory: an introduction. Vol. 1. Greenwood Village, Colorado: Roberts & Company Publishers, 2009.

[6] Cheeseman, BL, Zhang, D, Binder, BJ, Newgreen, DF, Landman, KA, "Cell lineage
tracing in the developing enteric nervous system: superstars revealed by experiment and simulation." Journal of The Royal Society Interface 11.93 (2014): 20130815.

Alignment by numbers: sequence assembly using compressed numerical representations

Avraam Tapinos, Bede Constantinides, Douglas Kell and David Robertson
Abstract: DNA sequencing instruments are enabling genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and interpret sequence data. Established methods for computational DNA sequence analysis generally consider the nucleotide-level resolution of sequences. While these approaches are sufficiently accurate, increasingly ambitious and data-intensive analyses are rendering them impractical for demanding applications such as genome and metagenome assembly.

Increases in the length of the DNA short reads generated by the current sequencing technologies, combined with the overall size of the datasets introduce a number of analytical challenges. Due to size of the datasets (usually several gigabytes), not all reads can be loaded in RAM memory simultaneously, thus data have to be accessed from hard disk memory during the analysis, slowing down the overall execution time of the analysis process. In addition, due to the high dimensionality of DNA reads, data similarities can be overlooked, hindering the statistical significance of the analysis; a phenomenon related to the curse of dimensionality [1]. Future sequencing technologies will provide considerably longer reads (several thousand bases long), intensifying these problems.

Comparable challenges involving high dimensional sequential data have been investigated thoroughly in fields such as signal processing and time series analysis, and a number of effective dimensionality reduction methods have been proposed. Methods include the discrete Fourier transform (DFT)[2], the discrete wavelet transform (DWT)[3], and piecewise aggregate approximation (PAA)[4]. These transformation methods are commonly used for compressing sequence data to a lower dimensional space and reduce the dataset’s size prior to the analysis process. Due to the smaller size of the transformed dataset more data can be loaded in RAM memory simultaneously, hence speeding the analysis process. Furthermore, data approximations at a lower dimensional space lessen the effects introduced by the high dimensionality of the data.
Representing DNA sequences as numerical sequences, allows these application of the transformation and approximation methods (fig. 1). Proposed DNA sequences numerical representation techniques include the Voss method, the integer number representation, the real number representation (fig. 1), and DNA walk [5].
To explore the applicability of data transformation and approximation techniques for sequence assembly, we implemented a short read aligner and evaluated its performance against simulated high diversity viral sequences alongside four existing aligners [6]. Using our prototype implementation, approximate sequence representations reduced overall alignment time by up to 14-fold compared to the performance of our implementation on uncompressed sequences, and without any reduction in alignment accuracy. Despite using heavily approximated sequence representations, our implementation yielded alignments of similar overall accuracy to existing aligners, outperforming all other tools tested at high levels of sequence variation. Our prototype approach was also applied to the de novo assembly of a simulated diverse viral population.
Our approach demonstrates that full sequence resolution is not a prerequisite of accurate sequence alignment and that analytical performance may be retained or even enhanced through appropriate dimensionality reduction of sequences.

Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses

Matthew Ritchie, Ruijie Liu and Gordon Smyth
Abstract: Variations in sample quality are frequently encountered in small RNA-sequencing experiments, and pose a major challenge in a differential expression analysis. Removal of high variation samples reduces noise, but at a cost of reducing power, thus limiting our ability to detect biologically meaningful changes. Similarly, retaining these samples in the analysis may not reveal any statistically significant changes due to the higher noise level. A compromise is to use all available data, but to down-weight the observations from more variable samples. We describe a statistical approach that facilitates this by modelling heterogeneity at both the sample and observational level as part of the differential expression analysis. At the sample-level this is achieved by fitting a log-linear variance model that includes common sample-specific or group-specific parameters that are shared between genes. The estimated sample variance factors are then converted to weights and combined with observational level weights obtained from the mean-variance relationship of the log-counts-per-million using voom. A comprehensive analysis involving both simulations and experimental RNA-sequencing data demonstrates that this strategy leads to a universally more powerful analysis and fewer false discoveries when compared to conventional approaches. This methodology has wide application and is implemented in the open-source limma package.

Clustering Gene Expression Time Series of Mouse Model for Speed Progression of ALS

Sura Zaki Alrashid, Muhammad Arifur Rahman, Nabeel H. Al-Aaraji, Paul R. Heath and Neil D. Lawrence
Abstract: The analysis of gene expression time series is invaluable in obtaining insight to gene regulatory network and their underlying biological mechanisms. When dealing with biological replicates, crude averaging of the expression values may lead to discarding valuable information. Amyotrophic lateral sclerosis (ALS) is an irreversible neurodegenerative disorder that kills the motor neurons and results in death within 2 to 3 years from the symptom onset. Speed of progression for different groups are heterogeneous with significant variability. Nardo et al. [2013] reported that SOD1-G93A transgenic mice from different backgrounds 129Sv and C57 showed consistent phenotypic differences for disease progression. Here we used a hierarchy of Gaussian processes of Hensman et al. [2013] to model replicate-specific and gene-specific temporal covariances. This study demonstrated some significant gene expression profiles and clustered them together from four groups of data SOD1-G93A and Ntg from 129Sv and C57 backgrounds). Further gene enrichment score analysis and ontology pathway analysis of some specified clusters for a particular group may lead toward identifying features underlying the differential speed of disease progression. Our study shows the effectiveness of sharing information between replicates and different model conditions when modelling gene expression time series.

Shrinkage estimators in data analysis for comparative high-throughput sequencing experiments

Simon Anders, Michael I Love and Wolfgang Huber
Abstract: High-throughput sequencing assays are a powerful technology to identify differences between groups of biological samples at a genome-wide scale, e.g., to find differentially expressed genes in RNA-Seq data. I will report on DESeq2, a method and software tool to perform such comparisons. DESeq2 is the successor to our popular "DESeq" method, and is now based on a fast but flexible empirical-Bayes approach to cater for a wide range of experimental designs, with special considerations to deal with outliers and model misspecification.

In modern experiments, users are often overwhelmed by a too long list of "significant" genes or features, prompting a conceptual shift from "mere testing" to more quantitative analyses. This, however, is complicated by the heteroskedasticity of count data. To address this, DESeq2 provides non-statistician users with a sophisticated but easy-to-use tool to obtain robust shrinkage estimates for dispersions and fold changes, thus providing more reproducible and interpretable quantification of effect sizes and enabling more reliable downstream analyses.

Poster abstracts

Optimizing permutation methods for extended competitive gene set analysis methods

Pashupati Mishra, Petri Törönen and Liisa Holm
Abstract: Gene set analysis (GSA) is differential gene expression analysis where gene sets rather than individual genes are associated with sample phenotypes. Competitive GSA methods test whether a gene set is differentially expressed between the compared sample groups as compared to the genes outside the analyzed gene sets. Most competitive GSA methods are either parametric or permutation based depending on the method used to assign P-values to the gene set scores. Parametric gene set analysis methods are computationally faster but make strong statistical assumptions. However, the only assumption made by permutation based methods is independent and identical distribution of the data points under null hypothesis. This property makes permutation based methods more suitable to GSA. The usefulness of sample permutation in competitive GSA, in addition to explicit or implicit gene permutation has been frequently shown in GSA literature. Here we focus exclusively on sample permutation.

Current permutation based competitive gene set analysis methods are applicable only to two group dataset with at least six replicates in each group. In this talk, we address the applicability of permutation based competitive gene set analysis methods to data sets with more than two sample groups and less than six replicates per group. Despite the arguments about the lack of statistical power with fewer biological replicates, such datasets are prevalent, mostly for economical reasons. However, to our knowledge, no permutation based competitive GSA methods exist that can be applicable to such datasets. GSA of two sample groups in such dataset by permuting the sample labels of only the compared groups is unreliable as the number of unique permutations will be less than 200. However, with advanced permutation techniques, an adequate number of permutations (>200) can be generated by involving all the samples in the dataset such that the null distribution models the actual background noise in the data.

Lack of a priori knowledge of true gene sets that are affected in an experiment (positive genesets) makes the evaluation of GSA methods difficult. We propose a novel evaluation method that splits data into two parts; test data with less than six replicates per group and reference data with more than ten replicates per group. The method generates reference gene sets from the reference data using many popular GSA methods. The evaluation of GSA methods is then done by comparing the ability of the compared GSA methods to report the reference gene sets as the most significant gene sets in the test data. However, we underline that multiple evaluation methods are needed to evaluate GSA methods for a full picture of performance.

We evaluate five different permutation methods ranging from basic to advanced in terms of their ability to model the actual background noise. Our results show that the choice of a permutation method has significant effect on results of gene set analysis methods and is not as trivial as usually considered. The optimal permutation method is the one that utilizes the extended permutation space (>2 sample groups) to model the actual background noise and prevents leakage of real biological signal into the null distribution. We propose a new competitive GSA method, mGSZm, based on Gene Set Z-score and the most optimal permutation method. Our results show that mGSZm outperforms current competitive GSA methods in our evaluation setting.

Reproducible RNA-seq prognostic biomarker detection for clear cell renal cell cancer

Fatemeh Seyednasrollah, Krista Rantanen, Panu Jaakkola and Laura Elo
Abstract: Recent comprehensive assessments of RNA-seq technology support its utility in quantifying gene expression in various samples. Biomedical and molecular characterization projects are the fields which are benefiting the most from this newly developed technology. Especially, differential expression analysis across distinct biological groups of interest (e.g. healthy vs cancer) is the most common downstream goal. Although a number of advanced statistical methods have been developed, several studies demonstrate that their performance depends on the data under analysis, which compromises practical utility in real biomedical studies. As a solution, we propose to use a data-adaptive procedure that selects an optimal statistic capable of maximizing reproducibility of detections.

Mashed potato reconstituted — new technologies for plant genome assembly

Pirita Paajanen, Matt Clark, Glenn Bryan, Bernardo Clavijo and Gonzalo Garcia
Abstract: Plant genomes are older and often more complex than mammalian genomes. They have gone through several duplication events and can appear as diploid, tetraploid or even hexaploid. Some of the common crops are polyploids such as hexaploid wheat, or tetraploid potato. Such genomes presents unique challenges to assemblies, but as common food crops good reference genomes are very important. The status of genome assemblies for wheat and potato are still in permanent draft stage, because they have several areas that are very difficult to resolve due to the genomes being highly heterozygous, polyploid and having repetitive regions. In this project we are sequencing a wild potato from Mexico, called Solanum Verrucosum, that is diploid relative to S. Tuberosum that we normally eat. S. Verrucosum is self-crossing, and the plant to be sequenced is highly homozygous. The assembly uses the newest technologies of several Illumina libraries that are suitable to a variety of assemblers, including Abyss, Allpaths, Discovar, SOAP. We also use PacBio long sequnces to complete the assembly, in particular on the long repetitive regions. In this talk we will discuss the pros and cons of the different libraries and sequencers and how to combine different types of data for optimal assembly.

Multivariate Meta-Analysis of Genome-Wide Association Studies using Univariate Summary Statistics

Anna Cichonska, Juho Rousu, Pekka Marttinen, Antti J Kangas, Pasi Soininen, Terho Lehtimäki, Olli T Raitakari, Marjo-Riitta Järvelin, Veikko Salomaa, Mika Ala-Korpela, Samuli Ripatti and Matti Pirinen
Abstract: A common approach to genome-wide association studies (GWAS) is to perform univariate tests between genotype-phenotype pairs. There are many summary-level results of such analyses publicly available. Using multivariate phenotypes results in increased statistical power and richer GWAS findings. However, low sample sizes in individual studies, and public unavailability of complete multivariate data are the current limitations. Thus, the goal of this work is to establish a computational framework for a multivariate meta-analysis of GWAS using univariate summary statistics.
Our approach is based on canonical correlation analysis, which operates on pooled covariance matrices Cxy (of univariate genotype-phenotype association results), Cxx (of genotype-genotype correlations) and Cyy (of phenotype-phenotype correlations), rather than requiring the original genotypic data X and phenotypic data Y. Cxx can be estimated from a reference database representing the study population, such as the 1000Genomes database. Phenotypic correlation structure Cyy is computed from Cxy. Furthermore, we apply some shrinkage to the full covariance matrix to add robustness to the method.
A multivariate meta-analysis of two studies of nuclear magnetic resonance metabolomics by our approach shows a good agreement with the pooled analysis of the original data sets. Average F1 score (harmonic mean of precision and recall) is equal to 0.84, and root mean squared error between the p-values from our method and the original ones is 5.83 (-log10 scale). Our framework circumvents the unavailability of complete multivariate individual-level data sets, and can provide novel biological insights from already published summary statistics, upon which it is entirely based on.

Canonical Correlation Methods for Exploring Microbe-Environment Interactions in Deep Subsurface

Viivi Uurtio, Juho Rousu, Malin Bomberg and Merja Itävaara
Abstract: Microbe-environment interactions have been studied by means of canonical correlation
analysis (CCA), combinations of univariate and multivariate regression, multiple factor analysis
and principle component analysis to infer correlation patterns in the human microbiome, soil and
marine environments. These methods are limited by the assumption of linear dependencies among the variables and the fact that the resulting models are overly complicated for human interpretation. We apply and tailor the recently presented non-linear variants of CCA, kernel CCA (KCCA) [2], and primal-dual sparse CCA (SCCA) [3] to microbe-environment interaction studies. We extend the proposed visualization theories of correlation and score plots [1], [4] to SCCA and KCCA, and also present a novel way of representing the information of a correlation plot by a clustergram. The objectives of this study are to further develop the computational tools in microbe-environment interaction studies, and demonstrate them on data originating from the deep subsurface which has not been analysed in this way before.
Our two datasets originate from deep bedrock drill holes of the Fennoscandian shield in which
research has been conducted with a focus on the potential final storage of spent nuclear fuels. The datasets 1 and 2 consisted of 43 (58 bacterial and 15 geochemical variables) and 38 (58 bacterial and 23 geochemical variables) samples respectively. The samples of dataset 1 were collected from three different sites around Finland while the samples of dataset 2 were obtained from one site only. We performed KCCA and SCCA on both datasets by kernelizing the geochemical data view using both the linear and Gaussian kernel functions and the bacterial data view using the linear kernel, in the case of the KCCA. Parameter optimization was performed on SCCA only. We optimized the level of sparsity of the primal bacterial view of the data and the width of the Gaussian kernel by means of 3-fold cross validation. The optimal parameters were used for both KCCA and SCCA settings, and the statistical significance of the results were tested by permutation tests. The correlations between the variables of the two views of the data were analysed by correlation plots and clustergrams. The samples were analysed by score plots.
The major correlation pattern that was extracted from dataset 1 was found to relate the Pepto-
coccaceae family positively with salinity and negatively with the total number of cells. The salinity
seemed to also explain some of the variance among the samples of dataset 1. In dataset 2, the main correlation pattern related the order Desulfobacterales and its lower resolutions to alkalinity and salinity. The variance among the samples of dataset 2 seemed to be explained by both alkalinity and the level of silicon content. In general, the linear kernel function yielded more and greater correlations in comparison to the Gaussian kernel regardless of the CCA method. On the other hand, Gaussian kernel function yielded more correlation patterns in SCCA than in KCCA. The analysis of the score plots demonstrated that linear kernel function resulted in a more distinct separation of samples from both datasets than Gaussian kernel function when SCCA was performed. On the contrary, the Gaussian kernel function was shown to find more distinct patterns in both datasets in the case of KCCA.
The novelties of this study are the application and development of KCCA and SCCA to microbe-environment interaction studies, visualization of the results of KCCA and SCCA by correlation, score plots and clustergrams, including the analysis of the interdependencies occurring among bacteria and the geochemistry of their environment in the deep subsurface. We have extended the levels of analysis of microbe-environment interactions by advancing the practice of methods that apply kernel functions and sparsity inducing norms allowing for analysis of non-linear correlation patterns occurring among the complete set or subsets of variables.

Integrating prior biological knowledge into machine learning models for predicting drug responses

Lu Cheng, Gopal Peddinti, Muhammad Ammad-Ud-Din, Alok Jaiswal, Himanshu Chheda, Suleiman Ali Khan, Kerstin Bunte, Jing Tang, Matti Pirinen, Pekka Marttinen, Janna Saarela, Jukka Corander, Krister Wennerberg, Samuel Kaski and Tero Aittokallio
Abstract: For complex genetic diseases such as rheumatoid arthritis, treatment effects can vary significantly among different patients. We integrate prior biological knowledge into machine learning models to explain the differences. We use GWAS, CCA based selection, PharmGKB, differential gene expression analyses to select the SNPs. We use GEMMA (linear mixture model) and BEMKL (multiple kernel methods) for our predictions. Our results show that the clinical information contributes the most for explaining the differences. Genetic information contributes a relatively small amount for the predictions, which is validated by comparing with clinical only predictions and random SNP set predictions. Correctly utilizing the methods is also very important. Our results show that changing the settings within these methods can create huge differences in predictions.

Metabolite identification through multiple kernel learning

Huibin Shen, Kai Duhrkop, Sebastian Bocker and Juho Rousu
Abstract: Motivation: Metabolite identification from tandem mass spectrometric data is a key task in metabolomics. Various computational methods have been proposed for the identification of metabolites from tandem mass spectra. Fragmentation tree methods explore the space of possible ways in which the metabolite can fragment, and base the metabolite identification on scoring of these fragmentation trees. Machine learning methods have been used to map mass spectra to molecular fingerprints; predicted fingerprints, in turn, can be used to score candidate molecular structures.

Results: Here, we combine fragmentation tree computations with kernel-based machine learning to predict molecular fingerprints and identify molecular structures. We introduce a family of kernels capturing the similarity of fragmentation trees, and combine these kernels using recently proposed multiple kernel learning approaches. Experiments on two large reference datasets show that the new methods significantly improve molecular fingerprint prediction accuracy. These improvements result in better metabolite identification, doubling the number of metabolites ranked at the top position of the candidates list.

Integrative and personalized QSAR analysis in cancer by kernelized Bayesian matrix factorization

Muhammad Ammad-Ud-Din, Elisabeth Georgii, Mehmet Gönen, Tuomo Laitinen, Olli Kallioniemi, Krister Wennerberg, Antti Poso and Samuel Kaski
Abstract: We develop in-silico models to find drugs with a potential for cancer treatment. Recent large-scale drug sensitivity measurement campaigns offer the opportunity to build and test models that predict responses for more than one hundred anti-cancer drugs against several human cancer cell lines. So far, these data have been used for searching dependencies between genomic features and drug responses, addressing the personalized medicine task of predicting sensitivity of a new cell line to an a-priori fixed set of drugs. On the other hand, traditional quantitative structure- activity relationship (QSAR) approaches investigate small molecules in search for structural properties predictive of the biological activity of these molecules, against a single cell line. We extend this line of research in two directions: (1) an integrative QSAR approach, predicting the responses to new drugs for a panel of multiple known cancer cell lines simultaneously, and (2) a personalized QSAR approach, predicting the responses to new drugs for new cancer cell lines. To solve the modeling task, we apply a novel kernelized Bayesian matrix factorization method. For maximum applicability and predictive performance, the method optionally utilizes multiple side-information sources such as genomic features of cell lines and target information of drugs, in addition to chemical drug descriptors. In a case study on 116 anti-cancer drugs and 650 cell lines from Sanger Institute Wellcome Trust, we studied the usefulness of the method in several relevant prediction scenarios, differing in the amount of available information, and analyze the importance of various types of drug features for the response prediction. We showed that the use of multiple side information sources for both drugs and cell lines simultaneously improved the prediction performance. In particular, combining chemical and structural drug properties, target information, and genomic features yielded more powerful drug response predictions than drug descriptors or targets alone. Furthermore, the method achieved high performance (RMSE=0.46, R2=0.78, Rp=0.89) in predicting missing drug responses, allowing to re-construct a global map of drug responses, which is then explored to assess treatment potential and treatment range of therapeutically interesting anti-cancer drugs.

Detection of differentially expressed alternatively spliced transcripts in RNA-seq time-series experiments

Hande Topa and Antti Honkela
Abstract: Alternative splicing is an important mechanism where regions of pre-mRNAs are differentially joined in order to form different protein isoforms. It is known that alternative splicing is involved not only in the regulation of normal physiological functions but also in the development of diseases such as cancer. Recent advances in high-throughput sequencing technologies have enabled the genome-wide analysis of transcription and hence alternative splicing. Here, we investigate differential expression of the alternatively spliced transcripts in an RNA-seq time series experiment. By using a Gaussian process (GP) - based test, we identify the genes whose alternatively spliced transcripts are differentially expressed in absolute or relative expression levels during the given time period. Our method consists of quantification of transcript expression levels by BitSeq and GP modeling of the expression levels under time-dependent and time-independent models. Later, we rank the genes and transcripts according to their Bayes factors, which are the ratios of the marginal likelihoods under the different models. In a simulation study, we also compare different methods for incorporating the expression level variances into the GP models. We use the best-performing GP model in real data application, which sheds light on the temporal changes in alternative splicing regulation.

Assessing multivariate gene-metabolome associations with rare variants using Bayesian reduced rank regression

Pekka Marttinen
A typical genome-wide association study searches for associations between single nucleotide polymorphisms (SNPs) and a univariate phenotype. However, there is a growing interest to investigate associations between genomics data and multivariate phenotypes, for example, in gene expression or metabolomics studies. A common approach is to perform a univariate test between each genotype–phenotype pair, and then to apply a stringent significance cutoff to account for the large number of tests performed. However, this approach has limited ability to uncover dependencies involving multiple variables. Another trend in the current genetics is the investigation of the impact of rare variants on the phenotype, where the standard methods often fail owing to lack of power when the minor allele is present in only a limited number of individuals.

We propose a new statistical approach based on Bayesian reduced rank regression to assess the impact of multiple SNPs on a high-dimensional phenotype. Because of the method’s ability to combine information over multiple SNPs and phenotypes, it is particularly suitable for detecting associations involving rare variants. We demon strate the potential of our method and compare it with alternatives using the Northern Finland Birth Cohort with 4702 individuals, for whom genome-wide SNP data along with lipoprotein profiles comprising 74 traits are available. We discovered two genes (XRCC4 and MTHFD2L) without previously reported associations, which replicated in a combined analysis of two additional cohorts: 2390 individuals from the Cardiovascular Risk in Young Finns study and 3659 individuals from the FINRISK study.

Marttinen, P., Pirinen, M., Sarin, A.P., Gillberg, J., Kettunen, J., Surakka, I., Kangas, A.J., Soininen, P., O’Reilly, P.F., Kaakinen, M., Kähönen, M., Lehtimäki, T., Ala-Korpela, M., Raitakari, O.T., Salomaa, V., Järvelin, M.-R., Ripatti, S. and Kaski, S. (2014). Assessing multivariate gene-metabolome associations with rare variants using Bayesian reduced rank regression. Bioinformatics, 30(14):2026-34. doi: 10.1093/bioinformatics/btu140

Experimentally-based mechanistic modeling of T helper 17 cell differentiation dynamics

Jukka Intosalmi, Helena Ahlfors, Sini Rautio, Zhi Jane Chen, Riitta Lahesmaa, Brigitta Stockinger and Harri Lähdesmäki
Abstract: T helper 17 (Th17) cell differentiation is steered by extracellular cytokine signals that activate and control the lineage specific transcriptional program. Experimental studies provide us with an extensive amount of information about the Th17 lineage specific regulatory network but precise mechanistic understanding of the transcription factor dynamics is yet to be attained. In this study, we construct a detailed description for the dynamics of the core network driving the Th17 cell differentiation and use this description to implement alternative quantitative models in the form of ordinary differential equations (ODEs). The ODE models consist of two lineage specific inducing cytokine signals (TGFβ and IL6) as well as mRNA and protein levels for three key genes (STAT3, RORγt, and FOXP3). Further, we combine the ODE models with time-course RNA-seq measurements using a novel statistical framework designed specifically for sequencing data and, based on rigorous statistical modeling, quantify the evidence for alternative models. Our results show significant evidence, for instance, for inhibitory mechanisms between the transcription factors and also confirm that our description of dynamics is on a feasible level to explain the data. Besides these findings related to our application to T cell biology, we discuss how our statistical framework can be parameterized using the well-established data analysis methods for sequencing data. Further, we discuss the role of state-of-the-art computational and statistical methods, including population based MCMC and thermodynamic integration, that are used to obtain our results.

Metabolite identification using Input Output Kernel Regression

Celine Brouard and Juho Rousu
Abstract: An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. In this work, we address the metabolite identification problem using a structured prediction method. These methods make use of structural dependencies existing between complex outputs (for example graph, tree, image) to improve the accuracy and make prediction efficient.

We use a structured prediction method, called Input Output Kernel Regression (IOKR), that requires to define a kernel in input and a kernel in output. The spectrum-metabolite mapping problem is decomposed in two tasks. The first task consist in learning a function from the input set to the output feature space using kernel ridge regression. In the second task, the metabolite identification is done by solving a pre-image problem.

We present the results obtained with the IOKR method using different output kernels: path kernel, shortest-path kernel, graphlet kernel and kernel defined on fingerprints.

A discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data

Juhani Kähärä and Harri Lähdesmäki
Abstract: Transcriptional regulation is largely controlled by DNA binding proteins called transcription factors. Understanding transcription factor binding is integral to understanding gene expression and the function of gene regulatory networks.

Currently transcription factor binding sites are determined by chromatin immunoprecipitation followed by sequencing, but this method has several limitations. To overcome these caveats, DNase I hypersensitive sites sequencing is increasingly being used for mapping gene regulatory sites. Computational tools are needed to accurately determine transcription factor binding sites from this new type of data.

In this work a novel method is developed for detecting transcription factor binding sites using DNase I hypersensitivity data. The method utilizes feature selection for choosing relevant features from DNase-seq data for optimal discrimination of bound and unbound genomic sites. The procedure is designed to ignore features resulting from the intrinsic sequence bias of the Dnase I and to choose only features that improve the binding prediction performance.

The method is applied to 57 different transcription factors in cell type K562. We demonstrate that the prediction performance of the method exceeds the performance of other existing methods. Our results indicate that DNase I hypersensitivity data should be used in multiple resolutions instead of the highest possible resolution. We also show that the binding predictions should be made separately for each
transcription factor and that the sequencing depth of currently available data sets is sufficient for binding predictions for most transcription factors. Finally, we show that models built with our method generalize between different cell types making the method a powerful tool in transcription factor binding predictions using DNase I hypersensitivity data.

Parameter inference and model selection of small gene regulatory networks with single cell data

Henrik Mannerstrom and Harri Lähdesmäki
Abstract: Time course measurements of RNA and protein expression with single
protein resolution are available for both prokaryotic and eukaryotic
cells. In order to utilise this precise data to its full extent, one
has to apply mechanistic models that capture the discreteness of gene
expression. However, the living cell is inherently noisy, so these
models must also to be stochastic. Forward simulation of such
stochastic biochemical network models has been an active research
field for over a decade, but the inverse problems of parameter
inference and model selection has received much less attention.

We have implemented a likelihood free sequential particle Markov chain
Monte Carlo to perform parameter inference and model selection on
small gene regulatory models. We show that exact inference is feasible
although computationally demanding. From protein bioluminescence time
course data we infer posterior distributions for promoter state and
abundance of mRNA and protein. The posterior samples are also used to
compute the model probabilities conditional on the data.

PoMo: An Allele Frequency-based Approach for Species Tree Estimation

Carolin Kosiol, Dominik Schrempf and Nicola De Maio
Abstract: Incomplete lineage sorting can cause incongruencies of the overall species-level phylogenetic tree with the phylogenetic trees for individual genes or genomic segments. If these incongruencies are not accounted for, it is possible to incur several biases in species tree estimation. Here, we present a simple maximum likelihood approach that accounts for ancestral variation and incomplete lineage sorting. We use a POlymorphisms-aware phylogenetic MOdel (PoMo) that we have recently shown to efficiently estimate mutation rates and fixation biases from within and between-species variation data. We extend this model to perform efficient estimation of species trees. We test the performance of PoMo in several different scenarios of incomplete lineage sorting using simulations and compare it with existing methods both in accuracy and computational speed. In contrast to other approaches, our model does not use coalescent theory but is allele-frequency based. We show that PoMo is well suited for genome-wide species tree estimation and that on such data it is more accurate than previous approaches.

A probabilistic model for cell type specific protein-DNA binding analysis

Sini Rautio and Harri Lähdesmäki
Abstract: Transcription factors (TFs) are DNA-binding proteins that regulate expression of neighbouring or distal genes. Most of the TFs bind only a small proportion of potential genomic sites as defined by their DNA binding elements. Detailed mapping of TF binding in different cell types, conditions, diseases and among individuals is central for understanding transcriptional regulation.

In some applications, measurements of DNA-protein interactions are generated from biological samples that contain multiple cell or tissue types. Tissue heterogeneity can be a major confounding factor in clinical studies. Sample heterogeneity requires manual isolation of cell type of interest from complex tissue samples, such as cell sorting or laser-capture microdissection. However, instead of manual purification we can use computational methods to estimate cell type specific measurements. Computational purification methods have emerged as a way to solve these problems. In silico purification allows us to process data that are measured from a mixture of several cell types by performing computational purification after measuring the samples.

We propose a probabilistic model to identify cell type specific transcription factor (TF) binding sites from mixed tissue chromatin immunoprecipitation sequencing (ChIP-seq) samples when proportions of different cell types in the samples are unknown. With simulations with mixed reads from two cell lines and data from breast cancer patients, measuring oestrogen receptor (ER) binding in primary breast cancer tissues, we show that the method achieves increased accuracy compared to a traditional way of identifying binding sites using multiple heterogeneous samples and estimates correct cell-type proportions.

Wasabi: an integrated platform for evolutionary sequence analysis and data visualization

Andres Veidenberg, Alan Medlar and Ari Löytynoja
Abstract: Wasabi is an open source, web--based, integrated environment for evolutionary sequence analysis. A key feature of Wasabi is to display the sequence data together with the corresponding phylogenetic tree and thus enable visualization of information, e.g., ancestral sequences, associated with specific tree nodes. Wasabi's graphical interface is user-friendly: it hides extraneous options, provides context sensitive menus and allows interactive drag-and-drop editing of the tree and the alignment; it can access local and remote files and integrates with Ensembl for easy inclusion of external data. Wasabi currently supports sequence alignment and alignment extension with phylogeny--aware methods PRANK and PAGAN and provides a plugin system to extend the environment with additional tools. The Wasabi environment was designed with collaboration and sharing in mind and features built--in functions to share data between users and to publish analysis results; during analyses, it automatically stores the results of intermediate steps and thus supports reproducible research. Wasabi runs inside a web browser and does not require any installation. One can start using the application at

Predicting the text description of protein function

Petri Törönen, Patrik Koskinen and Liisa Holm
Abstract: Sequencing projects are generating novel protein sequences at exponential rate. This sequence information on its own has very little use without some functional annotation of actual coded proteins. This annotation can consist of classification to categories, like Gene Ontology (GO) or Enzyme Catalogue categories, or it can be a free text description about the sequence. Although GO prediction is active topic, the free text description is unfortunately currently forgotten bioinformatics challenge, and there seems to be no clear definition for the task in the literature. It is still important task, as the sequences submitted to public databases must have free text description, and current methods cannot handle the high error rate in databases
Functional annotations are normally done using classification based methods. With free text descriptions classification methods are sub-optimal as descriptions do not form clear categories but rather have continuously changing similarities and dissimilarities. Therefore we propose the regression approach, where we estimate the functional similarity of the correct description and various predicted descriptions, obtained with BLAST sequence search. We propose TF-IDF, a popular text mining score, to estimate the functional similarity between different descriptions.
We use a large set of various score functions as an input for the discussed regression step. These are used to collect various signals from the BLAST output that could be used to predict description. These score functions collect the signal for each unique description, seen in the BLAST result list, mainly by summing score values of hits having the scored description or by monitoring the frequency of the description in BLAST result list. Some score functions process these results further and take into account the frequency of analyzed description in the database. We also show two alternative pipelines for creating scores for different descriptions. Various score functions from each pipeline are next used as input to our sparse regression analysis. With sparse regression we generate a weighted version of a few best performing score functions. We train the regression on training dataset to predict the TF-IDF-based similarity between the proposed description and the known correct prediction.
We evaluate the performance of our method on two separate evaluation datasets, where one consist eukaryotic sequences and the other prokaryotic sequences. We compare the method against naïve BLAST based methods, RAST annotation server, and another tool, BLANNOTATOR, that also aims to summarize the BLAST list results. Naïve BLAST methods report the best hit from BLAST results or the best informative hit (hit without words like putative, unknown etc.). Our results show clear margin to competing methods. Our work also highlights the serious weakness of naïve BLAST based methods. Discussed methods are included into our PANNZER program (

Deciphering the function of long non-coding RNAs through statistical analysis of gene expression and gene interaction network inference.

Ildem Akerman, Delphine Rolando and Jorge Ferrer
Abstract: Over the past years high throughput sequencing has identified tens of thousands of long non-coding RNAs (lncRNAs); however, in most cases their function, if any, is still unclear. For selected cases, lncRNAs have been shown to exert a gene regulatory function. LncRNA transcripts have also been demonstrated to be more tissue specific than most other genes, hinting at a possible regulatory function in the context of cell specific regulatory program. Understanding their potential role in gene expression regulation in disease-relevant tissues could therefore prove essential to therapeutic development.
A recent study identified more than 1000 active lncRNAs in Human Pancreatic Islet cells . These lncRNAs appear to be induced during beta cell differentiation, and, moreover, selected examples show dynamic regulation in response to glucose concentration changes. In the current study, we examine the function of a representative set of islet lncRNAs by analysing expression data from a panel of Human Islet samples, as well as from knock-downs of a set of islet lncRNAs and transcription factors in a human beta cell line (EndoCB-H1).
I will present an analysis of the gene interaction network that supports a regulatory function for islet lncRNAs. The analysis further exposes different classes of lncRNAs, which appear to be acting through distinct mechanisms.
These results could help understand the role of lncRNAs in the genetic regulation of pancreatic islets, and might thus help develop new drug targets for the treatment of diabetes.