spacer

GeneWise - Probability Appendix

Authors Ewan Birney, Mor Amitai
Address EMBL - European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton, Cambridge CB10 1SD,
England.
Email
birney@ebi.ac.uk

Contents

Intron Probabilities

Definitions and Preamble

This document describes the derivation of probabilities for the case of an intron of a known phase occuring between two match states. Precisely the same process can be used for the other phase introns and the introns occuring between two insert states or insert to match or match to insert.

The way we will model the introns is as 5 regions, two of fixed length, which can then be modeled as 3 states in the resulting HMM. Because we allow overlaps in the 'emission' of characters from different states there are conditional probabilities which require carefully handling; for this reason we explicitly expand the probability calculations and then collect relevant terms to construct the HMM.

The final HMM is slightly incorrect in that the correction terms needed to be multipled on the non-intron, coding sequence path cannot take into account the more complicated intron modeling we have provided here. However because these probabilities for introns are small, with the corresponding correction term being 1-( tex2html_wrap_inline763 intron probabilites) this effect is likely to be small.

We will use the following notations:

intron = intronV = intron of phase V.
tex2html_wrap_inline769 - l bases before the intron.
tex2html_wrap_inline771 - t bases after the intron.
tex2html_wrap_inline773 - the first m bases of the intron.
tex2html_wrap_inline777 - the central n bases of the intron.
tex2html_wrap_inline781 - the p bases of the pyrimidine tract.
tex2html_wrap_inline785 - the b bases between the pyrimidine tract and the 3' splice site.
tex2html_wrap_inline789 - the last r bases of the intron.

tex2html_wrap_inline793 is the 5' splice site information, Y the pyrimidine tract information and tex2html_wrap_inline797 is the 3' splice site information. A and B are spacers between the 5'SS and the pyrimidine tract and the pyrimidine tract and the 3'SS which will mainly provide length information.

Overview of the probabilities

We want to calculate Prob(sequence , alignment | model). In most Hidden Markov Models, this probability can be calculated in two parts:

  1. Prob(alignment | model) - The transition probabilities
  2. Prob(sequence | model,alignment) - The emission probabilities

In calculating the intron probability there are terms equivalent to the alignment path and terms equivalent to the sequence emitted, but because the same sequence is 'emitted' by two separate processes (for example, the bases before the 5'SS must be evaluted both as part of the protein sequence and as part of the splice site consensus), the probability calculation is not a mechanical decomposition of path and emissions.

So, starting from the complete probability of a intron of phase V between match states i and i+1, we will first decompse the probabilities into a product in which each term can be calculated using some biological assumptions. Following this, the different terms of the product can be placed at appropiate transitions or emissions in the expanded HMM for the best path to be calculated by standard dynamic programming methods.

Biological assumptions

The biological assumptions which we make are as follows

  1. Amino acids, or in this case, codons are independent. This assumption is made in standard HMM models of protein sequence.
  2. The different regions of the intron are independent and so the probability of an intron is simply the product of the different regions. This assumption is often used in current gene parsing methods.
  3. The intron can be adaquately modelled as a having 5 regions: 5'SS, central region, pyrimidine tract, spacer, 3'SS.
  4. The length distributions of the separate regions are independant of any other features of the introns.
  5. The length distributions of the different regions can be adquately modeled by an exponential decay or fixed lengths. This assumption can be changed at the cost of greater computional complexity.
  6. That there is no exon of less than l+t+2 bases. l and t are the splice site information which overlap the coding sequence, and are currently fixed at 3.
  7. That the number of phase 1 or 2 introns in a gene are small such that ignoring the information in the split codons in these introns (which will be necessary for calculating the probability of an intron) does not significantly change the overall probability.
Some assumptions which we have kept however in the model are
  1. The 5'SS is dependent on the previous coding sequence bases
  2. The 3'SS is dependent on the following coding sequence bases

Probability

In the following formula the terms such as tex2html_wrap_inline825 encorporates three separate parts of the probability: firstly that the length of the region is p, secondly that the bases tex2html_wrap_inline829 ... tex2html_wrap_inline831 , come from the Y region, ie the pyrimidine tract, and finally that the Y region starts at position j+m+n. Thus these terms represent both alignment and emission components.

We are interested in the probability of an intron of a defined path between positions j and h in the DNA sequence and between match state i and match state i+1 in the Hidden Markov model.

  equation66

Notice that

displaymath845

As everything is dependant on the model, we can omit explicit tex2html_wrap_inline847 terms.

displaymath849

Using

displaymath851

iteratively we can decompose this probability into a product

  equation83

  equation88

  equation94

  equation101

  equation109

  equation118

Now, we will treat each of these equations in turn, and, by using the biological assumptions stated above, decompose them into measurably quanities. For the terms 2-6 generally we will have three components

  1. the probability of entering this region from the previous region
  2. the probability of the length of this region
  3. the probability of the bases in this region

equation 2

displaymath853

displaymath855

The first and last terms (for appropiates choice of l and t), and the middle tex2html_wrap_inline861 term for V = 0 are emission probabilities (whether match or insert) from the model, and are calculated in the standard manner. For tex2html_wrap_inline865 The term tex2html_wrap_inline867 indicates the split codon made from the end of tex2html_wrap_inline869 and the beginning of tex2html_wrap_inline871 , depending on V. This codon cannot be calculated correctly without a large increased in computational complexity, and so we ignore the information in this codon.

equation 3

displaymath875

  equation151

  equation156

  equation161

The term tex2html_wrap_inline877 indicates that the intron starts at position j and ends at position h. h is defined to be the end of the intron, and we assume that all introns have an end. And so tex2html_wrap_inline885 .

displaymath887

displaymath889

displaymath891

Notice that tex2html_wrap_inline893 and similarly for 3'SS.

displaymath897

displaymath899

  equation199

tex2html_wrap_inline907 We have fixed m to be a set length (7 for the current human model), and thus this ``transition'' probability is one.

  equation211

equation 4

displaymath913

  equation229

  equation234

  equation239

tex2html_wrap_inline923 is 1 because we always have a central region following a 5'SS.
tex2html_wrap_inline925 assuming an exponetial decay distribution with parameter tex2html_wrap_inline927

  equation245

tex2html_wrap_inline933 is

  equation252

equation 5

displaymath935

  equation265

  equation270

  equation275

As the previous region, the first term ( tex2html_wrap_inline945 ) is one, whereas the other two terms are as before,

  equation280

  equation287

equation 6

displaymath951

  equation302

  equation307

  equation312

As the previous two regions, the first term ( tex2html_wrap_inline961 ) is one, whereas the other two terms are as before,

  equation317

  equation324

equation 7

displaymath967

displaymath969

  equation341

  equation347

  equation352

tex2html_wrap_inline977 is one tex2html_wrap_inline979 is also one because we have a fixed length of r (3 for the current human model) and

  equation359

Applying Probabilities to the expanded HMM

The above probabilities, although now measurable cannot be placed on the HMM architecture. We need to mulitple the terms tex2html_wrap_inline983 and then redistribute the resulting product over the HMM. In fact the central,pyrimidine tract and spacer regions (A,Y and B) behave exactly as one expects in a standard HMM. However the X and Z regions are not calculable in the HMM as the stand at must be multiplied out first

So, the equations which need to be multiplied together are:

  equation368

displaymath995

displaymath997

displaymath999

Cancelling the tex2html_wrap_inline1001 and tex2html_wrap_inline1003 terms, and rearranging the terms involving tex2html_wrap_inline869 and tex2html_wrap_inline871 together we arrive at

  equation400

  equation405

  equation410

displaymath1011

HMM transition probabilities

Despite having 5 regions in the intron, two of them, the 5'SS and the 3'SS are of fixed length. This means that these regions can be precisely modeled in the transition of states leading into them or leading from them with appropiate transition emission lengths (notice that these HMMs must 'emit' on the transitions as the probability depends on the sequence emitted). Thus the five regions of the intron can be represented by three states in the HMM, called CENTRAL, PY and SPACER

We have some freedom as where to place the term ( tex2html_wrap_inline1013 ), which must be multipled once during the parse of the intron, but could go at any non-looping transition. We will place it at the 3'SS. [NB, this term, which will be greater than one is a 'correction' term for the fact that we have to incorporate the probability of having an intron twice, once in the 5'SS calculations and once in the 3'SS calculations]. In addition some probabilities will be the same for all positions in the protein model and depends only on the DNA sequence, for example ( tex2html_wrap_inline1015 ) which indicates the consensus information at the 5'SS. These probabilities can be preprocessed on the sequence before the final dynamic programming method.

Transitions from MATCH to CENTRAL

From state MATCH to CENTRAL

  1. sequence dependant: tex2html_wrap_inline1017
  2. sequence dependant: tex2html_wrap_inline1019 [the first term in tex2html_wrap_inline1021 ] partial match information: See note

Transitions from CENTRAL

From CENTRAL to CENTRAL

  1. constant: tex2html_wrap_inline927
  2. sequence dependant: tex2html_wrap_inline1025 [the tex2html_wrap_inline1027 term in tex2html_wrap_inline1029 ]

From CENTRAL to PY

  1. constant: tex2html_wrap_inline1031
  2. sequence dependant: tex2html_wrap_inline1033 [the first term in tex2html_wrap_inline1035 ]

Transitions from PY

From PY to PY

  1. constant: tex2html_wrap_inline1037
  2. sequence dependant: tex2html_wrap_inline1039 [the tex2html_wrap_inline1027 term in tex2html_wrap_inline1043 ]

From PY to SPACER

  1. constant: tex2html_wrap_inline1045
  2. sequence dependant: tex2html_wrap_inline1047 [the first term in tex2html_wrap_inline1049 ]

Transitions from SPACER

From SPACER to SPACER

  1. constant: tex2html_wrap_inline1051
  2. sequence dependant: tex2html_wrap_inline1053 [the tex2html_wrap_inline1027 term in tex2html_wrap_inline1057 ]

From SPACER to MATCH

  1. constant: tex2html_wrap_inline1059
  2. constant: tex2html_wrap_inline1061
  3. sequence dependant: tex2html_wrap_inline1063
  4. artifical match information: See note

Partial Match information

The Match information in phase 1 and 2 must be split but the split bases can still be evaluated as cases in which only one or two bases of the codon are known. Thus some of the information in the protein HMM is retained.

Tranistion from MATCH to MATCH (in coding sequence)

The protein MATCH to MATCH transition has now become four transitions, MATCH to MATCH, and MATCH to phase 0,1,2 introns. The sum of these four probabilities must sum to the original MATCH to MATCH probability. This can be achieved by the MATCH to MATCH (and MATCH to INSERT) taking a sequence dependant multipler to its probability

Defining the term S to mean the sequence around the codon ending at j.

displaymath1069

Notice that l and m are fixed for a particular model.

displaymath1075

displaymath1077

The latter half of this product can be preprocessed on the sequence outside of the dynamic programming routine.

spacer
spacer