|
This document describes the derivation of probabilities for the
case of an intron of a known phase occuring between two match states.
Precisely the same process can be used for the other phase introns
and the introns occuring between two insert states or insert to match or
match to insert.
The way we will model the introns is as 5 regions, two of fixed length,
which can then be modeled as 3 states in the resulting HMM. Because we
allow overlaps in the 'emission' of characters from different states there
are conditional probabilities which require carefully handling; for this
reason we explicitly expand the probability calculations and then collect
relevant terms to construct the HMM.
The final HMM is slightly incorrect in that the correction terms needed to
be multipled on the non-intron, coding sequence path cannot take into
account the more complicated intron modeling we have provided here. However
because these probabilities for introns are small, with the corresponding correction
term being 1-( intron probabilites) this effect is likely to be small.
We will use the following notations:
-
- intron = intronV = intron of phase V.
-
-
- l bases before the intron.
-
-
- t bases after the intron.
-
-
- the first m bases of the intron.
-
-
- the central n bases of the intron.
-
-
- the p bases of the pyrimidine tract.
-
-
- the b bases between the pyrimidine tract
and the 3' splice site.
-
-
- the last r bases of the intron.
is the 5' splice site information, Y the pyrimidine
tract information and is the 3' splice
site information. A and B are spacers between the 5'SS and
the pyrimidine tract and the pyrimidine tract and the 3'SS which
will mainly provide length information.
We want to calculate Prob(sequence , alignment | model).
In most Hidden Markov Models, this probability can be calculated
in two parts:
- Prob(alignment | model) - The transition probabilities
- Prob(sequence | model,alignment) - The emission probabilities
In calculating the intron probability there are terms equivalent to
the alignment path and terms equivalent to the sequence emitted, but
because the same sequence is 'emitted' by two separate processes (for
example, the bases before the 5'SS must be evaluted both as part of
the protein sequence and as part of the splice site consensus), the
probability calculation is not a mechanical decomposition of path and
emissions.
So, starting from the complete probability of a intron of phase V
between match states i and i+1, we will first decompse the
probabilities into a product in which each term can be calculated
using some biological assumptions. Following this, the different terms
of the product can be placed at appropiate transitions or emissions in
the expanded HMM for the best path to be calculated by standard
dynamic programming methods.
The biological assumptions which we make are as follows
- Amino acids, or in this case, codons are independent. This assumption
is made in standard HMM models of protein sequence.
- The different regions of the intron are independent and so the
probability of an intron is simply the product of the different regions.
This assumption is often used in current gene parsing methods.
- The intron can be adaquately modelled as a having 5 regions: 5'SS, central
region, pyrimidine tract, spacer, 3'SS.
- The length distributions of the separate regions are independant of
any other features of the introns.
- The length distributions of the different regions can be adquately
modeled by an exponential decay or fixed lengths. This assumption can be
changed at the cost of greater computional complexity.
- That there is no exon of less than l+t+2 bases. l and t are
the splice site information which overlap the coding sequence, and are
currently fixed at 3.
- That the number of phase 1 or 2 introns in a gene are small such
that ignoring
the information in the split codons in these introns (which
will be necessary for calculating the probability of an intron) does not
significantly change the overall probability.
Some assumptions which we have kept however in the model are
- The 5'SS is dependent on the previous coding sequence bases
- The 3'SS is dependent on the following coding sequence bases
In the following formula the terms such as encorporates three
separate parts of the probability: firstly that the length of the region is
p, secondly that the bases ... ,
come from the Y region, ie the pyrimidine tract, and finally
that the Y region starts at position j+m+n. Thus these terms represent
both alignment and emission components.
We are interested in the probability of an intron of a defined path between
positions j and h in the DNA sequence and between match state i and
match state i+1 in the Hidden Markov model.
Notice that
As everything is dependant on the model, we can omit explicit terms.
Using
iteratively we can decompose this probability into a
product
Now, we will treat each of these equations in turn, and, by using
the biological assumptions stated above, decompose them into measurably
quanities. For the terms 2-6 generally we will have three components
- the probability of entering this region from the previous region
- the probability of the length of this region
- the probability of the bases in this region
The first and last terms (for appropiates choice of l and t), and the
middle term for V = 0 are
emission probabilities (whether match or insert) from the model, and are
calculated in the standard manner. For
The term indicates the
split codon made from the end of and the beginning of ,
depending on V. This codon cannot be calculated correctly without a
large increased in computational complexity, and so we ignore the
information in this codon.
The term indicates that the intron starts
at position j and ends at position h. h is defined to
be the end of the intron, and we assume that all introns have an end.
And so .
Notice that and similarly for 3'SS.
We have fixed m to be a set length (7 for the current human model), and thus this ``transition'' probability
is one.
is 1 because we always have a central region following a 5'SS.
assuming an exponetial decay distribution with parameter
is
As the previous region, the first term ( ) is one, whereas
the other two terms are as before,
As the previous two regions, the first term ( ) is one, whereas
the other two terms are as before,
is one
is also one because we have a fixed length of r (3 for the
current human model) and
The above probabilities, although now measurable cannot be placed on the
HMM architecture. We need to mulitple the terms
and then redistribute the
resulting product over the HMM. In fact the central,pyrimidine tract and
spacer regions (A,Y and B) behave exactly as one expects in a standard
HMM. However the X and Z regions are not calculable in the HMM as the stand
at must be multiplied out first
So, the equations which need to be multiplied together are:
Cancelling the and
terms, and rearranging the terms involving and
together we arrive at
Despite having 5 regions in the intron, two of them, the 5'SS and the
3'SS are of fixed length. This means that these regions can be precisely
modeled in the transition of states leading into them or leading from them
with appropiate transition emission lengths
(notice that these HMMs must 'emit' on the transitions as the probability
depends on the sequence emitted). Thus the five regions of the intron can
be represented by three states in the HMM, called CENTRAL, PY and SPACER
We have some freedom as where to place the term
( ), which must be multipled once during the
parse of the intron, but could go at any non-looping transition. We will
place it at the 3'SS. [NB, this term, which will be greater than one is a 'correction'
term for the fact that we have to incorporate the probability of having an intron
twice, once in the 5'SS calculations and once in the 3'SS calculations].
In addition some probabilities will be
the same for all positions in the protein model and depends only on
the DNA sequence, for example ( ) which indicates the
consensus information at the 5'SS. These probabilities can be
preprocessed on the sequence before the final dynamic programming
method.
From state MATCH to CENTRAL
- sequence dependant:
- sequence dependant:
[the first term in ]
partial match information: See note
From CENTRAL to CENTRAL
- constant:
- sequence dependant:
[the term in ]
From CENTRAL to PY
- constant:
- sequence dependant:
[the first term in ]
From PY to PY
- constant:
- sequence dependant:
[the term in ]
From PY to SPACER
- constant:
- sequence dependant:
[the first term in ]
From SPACER to SPACER
- constant:
- sequence dependant:
[the term in ]
From SPACER to MATCH
- constant:
- constant:
- sequence dependant:
- artifical match information: See note
The Match information in phase 1 and 2 must be split but the split bases can still be evaluated
as cases in which only one or two bases of the codon are known. Thus some of the
information in the protein HMM is retained.
The protein MATCH to MATCH transition has now become four transitions, MATCH to MATCH, and MATCH to phase 0,1,2 introns.
The sum of these four probabilities must sum to the original MATCH to MATCH probability. This can
be achieved by the MATCH to MATCH (and MATCH to INSERT) taking a sequence dependant multipler to its
probability
Defining the term S to mean the sequence around the codon ending at j.
Notice that l and m are fixed for a particular model.
The latter half of this product can be preprocessed on the sequence outside of the dynamic programming routine.
|