Analysis of protein and RNA sequence

Bateman group (EMBL-EBI)The flow of information from sequence to knowledge of function through classification and hypothesis generation used in the Bateman group.

Our work has centred around the idea that there are a finite number of families of protein and RNA genes. We wish to enumerate all of these families to gain an understanding of how complex biological processes have evolved from a relatively small number of components. We have produced a number of widely used biological database resources such as Pfam, Rfam, TreeFam and MEROPS to collect and analyse these families of molecules. Over the years we have published a large number of novel protein domains and families of particularly high interest. For example, we discovered the Paz and Piwi domains which allowed us to identify the Dicer proteins as having an important role in RNAi several months before this was experimentally verified. More recently, we showed that the scramblase genes may act as membrane tethered transcription factors.

Our research interests focus on how proteins and non-coding RNAs interact with each other and how these interaction networks can be rewired due to disease mutations or natural variation. We are interested in how proteins have evolved through the gain and loss of new protein domains.  Recently we have been involved in using Wikipedia for collecting community annotation and other biological information for biological databases. Wikipedia provides an enormous opportunity for public engagement in science and we have been encouraging scientists in a number of ways to edit Wikipedia. Current research is looking at identification of non-coding RNAs and understanding the function through computational analysis.

Future projects and goals

We will continue to develop tools and databases to understand the function and evolution of RNA and proteins. Using this data and computational analyses we aim to investigate interaction networks in two directions. Firstly, we will investigate the plasticity of the protein interaction network between individuals. To do this we will identify natural human variation such as SNPs and CNVs that rewire the protein interaction network. The second direction we will take is to explore the large and growing set of important molecular interactions involving RNA that are currently dispersed among diverse databases and experimental studies. By bringing this data together we wish to uncover the extent and evolution of the RNA interaction network compared to the protein interaction network. In another strand of our research we will develop automated techniques to identify spurious protein predictions that are polluting sequence databases. We have collected thousands of examples of proteins which are unlikely to be translated. These examples will form a good training set for machine learning techniques to identify further suspicious proteins.

Selected publications

Schuster-Böckler, B. and Bateman, A. (2008) Protein interactions in human genetic diseases. Genome Biolog 9 (1), R9.

Buljan, M., et al. (2012) Tissue-specific splicing of disordered segments that embed binding motifs rewires protein interaction networks. Mol Cell 46 (6), 871-883.

Buljan, M., Frankish, A., Bateman, A. (2010) Quantifying the mechanisms of domain gain in animal proteins. Genome Biology 11 (7), R74.

Bateman, A., et al. (2009) Phospholipid scramblases and Tubby-like proteins belong to a new superfamily of membrane tethered transcription factors. Bioinformatics (Oxford, England) 25 (2), 159-162.