Prediction of Exact Boundaries of Exons.
T. A. Thanaraj and Alan J Robinson
,European Bioinformatics Institute,
Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
thanaraj@ebi.ac.uk
Tel: +44-1223-494650
Fax : +44-1223-494468.
(To appear in the Briefings in Bioinformatics (2000) 1:4 – Special issue on Methods in Gene Prediction)
Abstract
It is known that while the programs used to predict genes are good at determining coding nucleotides, there are considerable inaccuracies in the determination of the gene structural elements. Among them, the most notable is that of the exact boundaries of exons. In order to assess this, we had earlier reviewed various programs that predict potential splice sites and exons. The results led to the following two observations: (i) A high proportion of false positive splice sites from computational predictions occur in the vicinity of real splice sites. (ii) Current algorithms are misled to predict wrong splice sites more often when the coding potential ends within ± 25 nucleotides from real sites than when it ends at farther positions. In this report, we review decision tree models for human splice sites and the resultant software tool, namely SpliceProximalCheck (available at http://www.ebi.ac.uk/~thanaraj/SpliceProximalCheck.html), that discriminates such ‘proximal’ false positives from real splice sites. An integrated system (named as MZEF-SPC) with SpliceProximalCheck as a front-end tool operating on the results of MZEF, Michael Zhang’s exon finder program (Zhang, 1997, Proc. Natl. Acad. Sci., USA, 94:565-568) is available at http://www.ebi.ac.uk/~thanaraj/MZEF-SPC.html. Examination of the output of the integrated program on an illustrative gene set revealed that as much as 61 of 93 MZEF-predicted false positive exons could be eliminated by SPC for a loss of only 3 out of 33 MZEF-predicted true positive exons.
Keywords: gene prediction; exons boundaries; decision trees; splice sites; benchmarking; training and test data sets; proximal false positives; SpliceProximalCheck.
INTRODUCTION
Systematic analysis of the performance of publicly available programs for gene prediction highlighted that while the current programs perform well at predicting coding nucleotides, exon boundaries are predicted with lower accuracy levels1. Specificity of the models for splice signals used currently are only 35% at a sensitivity threshold of 50%; however, usage of ‘in context’ information improved the prediction accuracy2. The ‘in context’ information included reading frame compatibility across splice sites and assembly of the exons onto a single optimal gene by an integrated method for gene finding. However, it is highly desirable to improve the performance of methods that predict splice sites in isolation of such integrated methods for gene finding. Such a need becomes obvious in situations where there is a lack of homology between the translated product and any known proteins, when only a minimal amount of other context information is known, or when there is ambiguity in the interpretations of the coding region predictions. Prediction of alternative gene products also depends largely on the accuracy of predicting the exact location of splice signals in isolation of the current integrated methods that usually predict only the most probable gene.
We recently benchmarked various publicly available computational tools that can predict human splice sites3. The programs differed from one another in the degree of discriminatory information used for prediction. A clean data set of EST-confirmed human splice sites4 was used in the benchmarking studies. Results of benchmarking revealed that one in every three false positives of predicted splice sites (as obtained by programs that use coding potential information in addition to splice signals for predicting splice sites) is located in the vicinity of a real splice site (i.e. within a distance of ± 50 nucleotides). Such an observation persisted with programs that can predict all the potential exons (including optimal and sub-optimal). In a high proportion (greater than 50%) of the partially correct predicted exons, the incorrect ends were located in the vicinity of the real splice sites. Further analysis of the distribution of proximal false positives (in comparison with that of GT/AG dinucleotides, that could act as cryptic splice sites) indicated that the splice signals used by the algorithms are not strong enough to discriminate particularly those false predictions that occur within ± 25 nucleotides around the real sites. Thus the programs tend to pick up the exon boundaries in the regions where the coding characteristics disappear. Small shifts due to false predictions around real sites do not much change the characteristics that are normally associated with real splice site sequences. Current programs are not sensitive to such subtle changes. It is therefore suggested that specialised statistics that can discriminate real splice sites from such proximal false positives be additionally incorporated in gene prediction programs.
In this paper, we demonstrate use of decision trees to build models that help to discriminate such proximal fake splice sites from real splice sites. Decision trees provide an automated means of segmenting a data set according to a user-specified ‘objective’. The resultant segments are then subsequently used for predictions. The real splice sites of the learning data set were taken from a clean data set of EST-confirmed human splice sites4. The proximal false splice sites of the learning data set were generated from regions –50 to +50 nucleotides (around real splice sites) that have high confidence of not containing any other functional splice sites4. Decision trees were then built based on only the features local to these splice sites. The decision trees yielded a small set of validation rules that can be used to distinguish proximal false splice sites from real splice sites. The quality of the decision trees built with the learning data set was evaluated using a test data set comprising different EST-confirmed real and false splice sites. The computer program with the implementation of the reported decision tree model is available at http://www.ebi.ac.uk/~thanaraj/SpliceProximalCheck.html. An integrated system with SpliceProximalCheck operating on the results of Michael Zhang’s exon finder program, MZEF5, is available at
http://www.ebi.ac.uk/~thanaraj/MZEF-SPC.html.
MATERIALS AND METHODS
Derivation of the Learning Data Set
The set of EST-confirmed splice sites from our data set published previously4 was used as the source of real splice sites. Regions fifty nucleotides in length both upstream and downstream of a subset of these sites were used to generate a control set of false splice sites (the subset included those sites that did not possess potential alternative splice sites in their vicinity). All occurrences of the dinucleotide GT (or AG) in the 50 nucleotide length regions from donor (or acceptor) sites were considered as proximal false donor (or false acceptor) sites.
Derivation of Test Data Set
Carrying out the clean-up procedures, as described in the earlier work4, on new human gene entries not published when building the earlier clean data set generated a test data set of EST-confirmed splice sites. Such a test set included a total of 229 donor and 236 acceptor sites. The false splice sites of this test data set were generated as described above.
Sizes of the Learning and Test Data Sets
Donor Sites
. The learning data set of 2520 donor sites comprised 619 real splice sites and 1901 false splice sites. The test data set of 978 donor sites comprised 229 real splice sites and 749 proximal false splice sites.Acceptor Sites
. The learning data set of 2960 acceptor sites comprised 623 real splice sites and 2337 false splice sites. The test data set of 1105 acceptor sites comprised 236 real splice sites and 869 proximal false splice sites.The Decision Tree Approach
A decision tree finds rules that recursively bifurcate a data set in order to produce subsets that contain homogeneous data within subsets and heterogeneous data between subsets. These sets of rules can then be used to classify other data sets. The objective of the decision tree was to discriminate true splice sites from proximal false ones in the learning data set using properties of the nucleotide sequence around the splice sites. The decision tree implemented in the commercial decision support system Decisionhouse (Quadstone Ltd., http://www.quadstone.co.uk) was used to classify the learning data set of splice sites. There are other publicly available as well as commercial decision tree systems (see
http://www.channel1.com/users/gps1/software/classification.html#Decision). The approach used in this review is general in nature and is applicable with other decision tree systems.Determination of Local Features Characterising Splice Sites
Different properties of the nucleotide sequence local to the splice sites were evaluated to determine if they could be used to differentiate between real and fake splice sites. These properties were used as input data to the decision tree as candidates for the generation of the rules to distinguish between the real and false splice sites in the learning data set.
Positional Nucleotide Frequencies.
Nucleotide frequencies at every position in the range of –20 to +20 nucleotides around the splice sites were calculated for both real and false splice sites (data not shown). The labelling scheme used for nucleotide positions around the splice junction is as shown in Figure 1.
Comparison of nucleotide frequency distributions in real splice sites with those in false sites revealed the following observations:
The above results suggest the appropriateness of using nucleotide positions around the splice sites as analysis candidates to build decision trees.
Dinucleotide Positions
. Preference/avoidance of nucleotides at certain positions would imply a similar pattern with regard to the occurrence of particular dinucleotides at these positions. Hence, we used the sequential dinucleotides involving adjacent nucleotide positions as additional analysis candidates. Preference/avoidance of nucleotides at certain positions located on either side of the splice junction could imply a potential pattern in the occurrence of interacting pairs of mononucleotides across the splice junction. We considered 25 such long-range base pairs involving five bases on either side of the site (excluding the GT/AG positions).Contrast across the acceptor sites
. Introns very often end with a poly-pyrimidine tract2. Nucleotides thymine and cytosine occur more frequently than adenine and guanine in the –20 to –3 intronic region, while it is largely random in the +1 to +20 exonic region. Consequently, the ratio of the compositional sum of thymine and cytosine to that of adenine and guanine is higher in the intronic region than in exonic region. This was not observed with the false splice sites, irrespective of whether the sites were derived from real introns or from exons of a gene. The values for contrast for regions of different nucleotide ranges around acceptor sites are shown in Table 1.|
Region |
Real sites |
Proximal false sites |
|
-10 to –3 / +1 to +8 |
5.33 |
0.91 |
|
-16 to –3 / +1 to +14 |
4.87 |
0.92 |
|
-20 to –3 / +1 to +18 |
4.08 |
0.92 |
|
-30 to –3 / +1 to +28 |
2.89 |
0.94 |
|
-40 to –3 / +1 to +38 |
2.27 |
0.98 |
|
-50 to –3 / +1 to +48 |
1.94 |
1.05 |
Table 1: Values of contrast for different ranges of
regions around the acceptor sites.
It is observed that even when the range is extended as far as 40 nucleotides on either side of the acceptor site, the contrast value remains higher for true splice sites than for false sites. It was decided to use the –20 to –3 / +1 to +18 range in further analyses to calculate contrast values because such a range is of medium length and shorter ranges may not adequately represent odd splice sites. For this range of nucleotide positions, it was found that while 92% of proximal false acceptor sites have a contrast value less than 3.0, only 31% of real splice sites have such a value.
For reasons mentioned above, we chose to use the following as analysis candidates. In the case of donor sites, they are: mononucleotide positions at –7 to –1 and +3 to +7; sequential dinucleotides involving these positions; and long-range dinucleotides involving each of the mononucleotide positions at –5 to –1 and each at +3 to +7. Similar candidates for acceptor sites are: mononucleotides at -7 to –3 and +1 to +5, sequential dinucleotides involving these positions; long-range dinucleotides involving each of the positions at –3 to –7 and each at +1 to +5; and the contrast value.
RESUTLS AND DISCUSSIONS
The Concept of Decision Trees
We start with a learning data set consisting of records, each of which has been classified as either a real splice site or a proximal false one. The tree builder utility of the Decisionhouse application builds a binary tree by splitting the data set at each node according to the function of a single analysis candidate. At each branch point of the tree, it determines which analysis candidate makes the best split to bifurcate and separate the records of different classes into different groups (whilst those of the same class remain within the same group). A more general description of decision trees as applied to gene finding algorithms can be seen from the work of Salzberg and coworkers6.
The Gini Value
. The decision tree recursively segments a starting population of mixed real and false splice sites into sub-populations. Gini (named after an Italian economist) measures the goodness of segmentation. Traditionally, Gini is a measure of inequality among different groups. It has a low value when there is a similar distribution of the records among different groups, and has a high value when there is an inequality. For an illustrative 3-layer tree (data is not shown) built for acceptor sites with a set of three far-away nucleotide positions namely +18 to +20 as analysis candidates, Gini was reported as 11.9%. Such a low value indicated that the match rates at the end nodes (viz., 16.7%, 21.1%, 21.1% and 26.3%) are not very different from one another and from that (21%) of the starting node. When a set of more relevant nucleotide positions –3 to -5 were specified as analysis candidates, Gini was reported as 68% indicating that the match rates of the end nodes (0.6%, 3.8%, 12.1% and 50%) are much different from that of the starting node (21%) as well as different from one another. Thus Gini can be used as a good indication of the validity and quality of the decision tree.Decision Trees for Splice Site Classification
>


Interpretation of Decision Trees.
Figures 2 and 3 show a four-layer decision tree for acceptor sites and donor sites using the learning data set. The decision tree has a Gini value of 89% and has eight end nodes of differing match rates; a minimum of 0% at node 7 and a maximum of 87.5% at node 14. The decision tree used six of the possible analysis candidates as fields to achieve the best bifurcation at branch points: contrast, [-6, -5], [-7, -6], [-5, +3], [-3, +1], and [-3, +3]. Each of the end nodes can be interpreted as a class of splice sites satisfying a specific rule, which is formulated by tracing a path from the root node to the end node. The rule can be termed as a ‘negative rule’ (if the end node is enriched with false splice sites) or as a ‘positive rule’ (if the end node is enriched with real splice sites). Negative rules are of interest in the context of the present work wherein the emphasis is to filter out the proximal false positives during the computational prediction of splice sites.Inclusiveness of Decision Tree Model
. In order to examine how inclusive the tree is, we randomly divided the learning data set into two equally sized sets (subset-1 and subset-2). The decision tree obtained for subset-1 was applied to subset-2 as well as to the full data set. Changes in Gini values were examined in each case. The exercise was repeated with subset-2 being used to generate the decision tree and then applied to the full data set and subset-1. It was found that there was no considerable difference in the Gini values when the tree built for one set was applied to other sets (the maximum difference observed was 5%). It was also observed that the same set of analysis candidates appeared in the decision tree that was built individually for subset-1, subset-2 or for the full data set. This suggests that the decision tree built for a set of records is inclusive.How Meaningful is the Segmentation by the Decision Tree?
In order to test whether the segmentation of the decision tree for the learning data set of real and false splice sites may also be brought about in a random data set, we shuffled the records of the learning data set for acceptor sites. Thus, a subset of false acceptor sites from the learning data set were re-classified as real ones and the remaining splice sites were re-classified as false splice sites. The complete set of records in such a shuffled data set was distributed randomly into three subsets. A decision tree of four layers was built for each such subset and then applied to the other two subsets. Gini values were recorded for each. In a similar manner, we re-classified a subset of real acceptor sites from the learning data set as false sites and the exercise was repeated. It was observed that for the randomly classified data sets, the average Gini value was 38% much lower than that of 89% obtained for the learning data set previously. Gini value was reduced in an average by 19% (as compared to 5% change with the learning data set as noted earlier) when applied to other subsets.Extending the Four-Layer Decision Trees
End nodes of the four-layer decision trees (as shown in Figures 2 and 3) are of the following three types:
Each of the eight end nodes, especially those of type (iii), of the four-layer decision tree should be segmented further. This was achieved by extending the decision tree to further layers until the Gini value is maximised (attaining a value close to 100%). However, a termination criterion is needed to assess if a split at a branch point is reasonable and is not over-fitting the data. Such a criteria is to stop segmenting a node when its population size is less than 16 or when the split leads to a child node of population size less than 16. A value of 16 was chosen because the dinucleotide analysis candidate could assume 16 different values. The final set of decision trees for donor and acceptor sites was built and the rules were extracted. These are shown in Table 2 for donor sites and Table 3 for acceptor sites.
|
End nodes (Figure 3) – (and scheme of classification) |
Population of end nodes of the tree for the learning set. Given in brackets are those from test set when the tree was applied. |
Rules |
|
1. RULES AS DERIVED FROM THE FOUR-LAYER TREE (see Figure 3) |
||
|
7 – (I) |
1/1162 [4/466] |
([+5,+6] != RT, GA) & ([-2,-1] != CG, AG) & ([+3,+4] != AN, GA). |
|
8 – (II) |
11/221 [6/97] |
([+5,+6] = RT, GA) & ([-2,-1] != CG, AG) & ([+3,+4] != AN, GA). |
|
9 – (III) |
26/166 [7/57] |
([+5,+6] != GY) & ([-2,-1] = CG, AG) & ([+3,+4] != AN, GA). |
|
10 – (IV) |
41/54 [21/26] |
([+5,+6] = GY) & ([-2,-1] = CG, AG) & ([+3,+4] != AN, GA). |
|
11 – (V) |
0/189 [1/60] |
([-5,-4] !=RC) & ([-1,+5] != GN, {T,C,A}G) & ([+3,+4] = AN,GA). |
|
12 – (VI) |
3/21 [0/5] |
([-5,-4] =RC) & ([-1,+5] != GN, {T,C,A}G) & ([+3,+4] = AN,GA). |
|
13 – (VII) |
14/84 [2/34] |
([-2,+5] != NG, A{A,C,T}) & ([-1,+5] = GN, {T,C,A}G) & ([+3,+4] = AN,GA). |
|
14 – (VIII) |
523/623 [188/233] |
([-2,+5] = NG, A{A,C,T}) & ([-1,+5] = GN, {T,C,A}G) & ([+3,+4] = AN,GA). |
|
2. RULES AS DERIVED FROM THE EXTENDED TREE |
||
|
I |
1/1162 [4/466] |
(I) |
|
II.1 |
0/143 [0/62] |
([-1,+4] != GN, YC) & (II). |
|
II.2.1 |
0/34 [0/10] |
([-4,-3] = CY, TA, GT, TT, AG, AC) & ([-1,+4] = GN, YC) & (II). |
|
II.2.2 |
11/44 [6/25] |
([-4,-3] != CY, TA, GT, TT, AG, AC) & ([-1,+4] = GN, YC) & (II). |
|
III.1 |
0/64 [0/20] |
([-3,+5] = TN, CT, G{C,T,A}) & (III). |
|
III.2.1 |
5/60 [4/19] |
([-7,-6] != CN, A{T,A,C} & ([-3,+5] != TN, CT, G{C,T,A}) & (III). |
|
III.2.2 |
21/42 [3/18] |
([-7,-6] = CN, A{T,A,C} & ([-3,+5] != TN, CT, G{C,T,A}) & (III). |
|
IV |
41/54 [21/26] |
(IV) |
|
V |
0/189 [1/60] |
(V) |
|
VI |
3/21 [0/5] |
(VI) |
|
VII.1 |
0/46 [1/17] |
([-4,+7] != RA, A{G,C}, C{T,A}, TG)) & (VII). |
|
VII.2 |
14/38 [1/17] |
([-4,+7] = RA, A{G,C}, C{T,A}, TG)) & (VII). |
|
VIII.1.1 |
1/33 [0/17] |
([-3,+4] != CY, {G,A,C}A) & ([-1,+6] != NT, G{G,C,A}) & VIII). |
|
VIII.1.2 |
16/32 [21/28] |
([-3,+4] = CY, {G,A,C}A) & ([-1,+6] != NT, G{G,C,A}) & VIII). |
|
VIII.2 |
506/558 [167/188] |
([-1,+6] = NT, G{G,C,A}) & (VIII). |
Table 2: Validation Rules for donor sites and their performances on the test data set.
Population of learning set comprises 619 real sites, 1901 false sites with a total = 2520 sites. Population of test set comprises 229 real sites, 749 false sites with a total of 978 sites. (i) The classification scheme (column 1) used for the end nodes (in the case of extended tree) indicates the number of layers required to segment further the end nodes of the four-layer tree. (e.g VIII.1.2 indicates that the end node 14 of the four-layer tree was segmented to two further layers). Negative rules are given in italic font while positive rules are given in normal font. (ii) The population (column 2) given as X/Y indicates that of the Y number of sites, X are real sites. The population given in square bracket is that for the test set when the tree derived using the learning set was applied on the test data set. (iii) The symbols used in the last column mean as below: "!=" indicates the Boolean operator "not equal to"; "=" indicates "equal to"; and "&" indicates the Boolean operation "and". R = purine (A, G); Y = pyrimidine (T, C); N = any nucleotide (T, C, A, G); {X,Y}Z represents the dinucleotides XZ and YZ; X{Y,Z} represents the dinucleotides XY and XZ. (iv) VIII.2 can be split further but it did not improve the model.
|
End nodes (Figure 2) - (and scheme of classification) |
Population of end nodes of the tree for the learning set. Given in brackets are those from test set. |
Rules |
|
1. RULES AS DERIVED FROM THE FOUR-LAYER TREE (see Figure 2) |
||
|
7 – I |
0/894 [2/345] |
([-7,-6] != C{C, T, G}, AA,TT) & ([-3,+1] != CN,T{C,G}) & (contrast < 2.38). |
|
8 – II |
7/356 [3/144] |
([-7,-6] =C{C, T, G}, AA,TT) & ([-3,+1] != CN, T{C,G}) & (contrast < 2.38). |
|
9 – III |
32/600 [15/214] |
([-6,-5] != YY) & ([-3,+1] = CN, TC, TG) & (contrast < 2.38). |
|
10 - IV |
83/294 [38/116] |
([-6,-5] = YY) & ([-3,+1] = CN, TC, TG) & (contrast < 2.38). |
|
11 - V |
0/98 [5/36] |
([-5,+3] != {A,T}A, C{C,T,G}) & ([-3,+3] = GN, A{G,C,A} & (contrast >= 2.38). |
|
12 - VI |
15/67 [3/29] |
([-5,+3] = {A,T}A, C{C,T,G}) & ([-3,+3] = GN, A{G,C,A} & (contrast >= 2.38). |
|
13 - VII |
79/186 [28/51] |
([-7,-6] != {AT, YY) & ([-3,+3] != GN, A{G,C,A}) & (contrast >= 2.38). |
|
14 - VIII |
407/465 [142/170] |
([-7,-6] = {A,C,T}T, YC) & ([-3,+3] != GN, A{G,C,A}) & (contrast >= 2.38). |
|
2. RULES AS DERIVED FROM THE EXTENDED TREE |
||
|
I |
0/894 [2/345] |
(I). |
|
II.1 |
0/242 [0/104] |
([-5,+4] != RC, YT, TA) & (II). |
|
II.2.1 |
0/71 [1/26] |
([+1,+2] != G{A,C,G}, {A,T}T) & ([-5,+4] = RC, YT, TA) & (II). |
|
II.2.2 |
7/43 [2/14] |
([+1,+2] = G{A,C,G}, {A,T}T) & ([-5,+4] = RC, YT, TA) & (II). |
|
III.1 |
1/354 [2/102] |
([-7,+1]!= NG, GT, T{A,C}) & (III). |
|
III.2.1 |
4/114 [0/45] |
(contrast < 0.88) & ([-7,+1]= NG, GT, T{A,C}) & (III). |
|
III.2.2.1 |
0/28 [2/17] |
([+5,+6] = CR, G{G,T}, AC) & (contrast >= 0.88) & ([-7,+1]= NG, GT, T{A,C}) & (III). |
|
III.2.2.2.1 |
7/63 [5/28] |
([+3,+4] = C{A,C}, {A,T}G, G{A,G,C}) & ([+5,+6] != CR, G{G,T}, AC) & (contrast >= 0.88) & ([-7,+1]= NG, GT, T{A,C}) & (III). |
|
III.2.2.2.2 |
20/41 [6/22] |
([+3,+4] != C{A,C}, {A,T}G, G{A,G,C}) & ([+5,+6] != CR, G{G,T}, AC) & (contrast >= 0.88) & ([-7,+1]= NG, GT, T{A,C}) & (III). |
|
IV.1.1 |
0/65 [2/22] |
([-7,+3] = G{C,G,A}, AR, T{T,G}, CC) & (contrast < 1.1) & (IV). |
|
IV.1.2.1 |
0/30 [0/9] |
([-4,+1] = C{G,T}, T{A,C,T},GA) & ([-7,+3] != G{C,G,A}, AR, T{T,G}, CC) & (contrast < 1.1) & (IV). |
|
IV.1.2.2 |
11/41 [1/11] |
([-4,+1] != C{G,T}, T{A,C,T},GA) & ([-7,+3] != G{C,G,A}, AR, T{T,G}, CC) & (contrast < 1.1) & (IV). |
|
IV.2.1.1 |
1/26 [3/10] |
([+2,+3] = {C,A}A, RT) & ([-4,+5] != RC, GT,TG) & (contrast >= 1.1) & (IV). |
|
IV.2.1.2 |
39/89 [17/43] |
([+2,+3] != {C,A}A, RT) & ([-4,+5] != RC, GT,TG) & (contrast >= 1.1) & (IV). |
|
IV.2.2 |
32/43 [15/21] |
([-4,+5] = RC, GT,TG) & (contrast >= 1.1) & (IV). |
|
V |
0/98 [5/36] |
(V) |
|
VI.1 |
0/36 [0/15] |
([+1,+2] = TN, AR, C{A,C}) & (VI). |
|
VI.2 |
15/31 [3/14] |
([+1,+2] != TN, AR, C{A,C}) & (VI). |
|
VII.1.1 |
0/50 [1/7] |
([-5,+2] != A{G,T}, T{G,T,C}) & ([-6,-5] = G{A,G,T}, AR, TA, CT) & (VII). |
|
VII.1.2 |
13/35 [6/11] |
([-5,+2] = A{G,T}, T{G,T,C}) & ([-6,-5] = G{A,G,T}, AR, TA, CT) & (VII). |
|
VII.2 |
66/101 [21/33] |
([-6,-5] != G{A,G,T}, AR, TA, CT) & (VII). |
|
VIII |
407/465 [142/170] |
(VIII) |
Table 3: Validation Rules for acceptor sites and their performances on the test data set.
Population of learning set comprises 623 real sites, 2337 false sites with a total = 2960 sites. Population of test set comprises 236 real sites, 869 false sites with a total of 1105 sites. (i) The classification scheme (column 1) indicates the number of layers required to segment further the end nodes of the four-layer tree. (e.g VIII.1.2 indicates that the end node 14 of the four-layer tree was segmented to two further layers). Negative rules are given in italic font while positive rules are given in normal font. (ii) The population (column 2) given as X/Y indicates that of the Y number of sites, X are real sites. The population given in square bracket is that for the test set when the tree derived using the learning set was applied on the test data set. (iii) Symbols from last column mean as below: "!=" indicates the Boolean operator "not equal to"; "=", "equal to"; and "&", i"and". R = purine (A, G); Y = pyrimidine (T, C); N = any nucleotide (T, C, A, G); {X,Y}Z represents the dinucleotides XZ and YZ; X{Y,Z} represents the dinucleotides XY and XZ. (iv) VII.2 and VIII could be split further but it did not improve the model.
Validation Rules as Derived from the Extended Decision Trees
Donor Sites.
A set of nine negative rules (shown in italic font in Table 2) accounted collectively for 92% of the false donor sites from the learning data set with an error rate of 0.6%. On applying the decision tree to the test data set, these nine negative rules identified 89% of the false donor sites with an error rate of 1.5%. A set of six positive rules (shown by normal font in Table 2) accounted collectively for 98% of the real donor sites from the learning data set with a specificity of 79%. On applying the decision tree to the test data set, these six positive rules identified 96% of the real donor sites with a specificity of 73%.Acceptor sites
. A set of twelve negative rules (shown in italic font in Table 3) accounted collectively for 86% of the false acceptor sites from the learning data set with an error rate of 0.3%. On applying the decision tree to the test data set, these twelve negative rules identified 83% of the false acceptor sites from the test data set with an error rate of 2.4%. A set of ten positive rules (shown by normal font in Table 3) accounted collectively for 99% of the real acceptor sites from the learning data set with a specificity of 65%. On applying the decision tree to the test data set, these ten positive rules identified 92% of the real acceptor sites with a specificity of 60%.The set of nine negative rules were enough to classify 92% of false proximal donor sites from the learning data set and a set of twelve rules were enough to classify 86% of the proximal false acceptor sites from the learning data set. Such a set of rules predicted correctly 89% of the false proximal donor sites and 83% of false acceptor sites in a test data set. Given the earlier observations3, that one in every three false positives, as well as that more than half the number of wrong ends from partially correct predicted exons occur in the vicinity of real sites, the rules presented herein can be used to help to improve the prediction accuracy of the current programs. Thus these rules can either form an integral part of the splice site prediction programs to affect the scoring system or act as filters on the predicted splice sites. The rules are deterministic and they are simple to code in computer programs.
It is to be noted that the decision tree models reported herein could identify only 92% of the false donor sites and 86% of the false acceptor sites. The reasons for the low values are probably that the decision tree algorithm implemented in Decisionhouse may not be the most efficient one and the analysis candidates that are used (such as the sequential, long-range dinucleotides and contrast) do not adequately describe biologically relevant information that is part of the splicing mechanism. Characteristics delineating branch points (in the case of acceptor sites) might have served as another set of analysis candidates, but they could not be used because annotation of them is lacking in the data bases.
Splice Signals
Examination of the rules (as given in Tables 2 and 3) gave a strong indication of the possibility of interactions involving nucleotide positions that arise from either side of the splice sites, in addition to interactions among bases within either the upstream or downstream regions from the site. In 82% of the cases of real donor sites (as seen from rule VIII.2 in Table 2), the signals were in the form of long-range dinucleotide positions involving –2 to –1 and +5 to +6 and of the sequential dinucleotide positions at (+3, +4). In 72% of the proximal false donor sites (as seen from the rules I and II in Table 2), the signals were in the form of only the sequential dinucleotide positions (-2, -1) and (+3, +4). In 65% of the real acceptor site cases (as seen from rule VIII in Table 3), the signals were in the form of long-range dinucleotide position (-3, +3) and the sequential dinucleotide position (-7, -6). In 53% of the false acceptor site cases (as seen from rules I and II in Table 3), the signals were in the form of only the long-range dinucleotide position (-3, +1). The significant positional interactions that carry splice signals are enumerated below.
Donor Sites:
The rules (Table 2) revealed that only 4 of the considered 12 sequential dinucleotide positions carry splice signals. The most prominent dinucleotides are (+3, +4), (-2, -1), (+5, +6) which occurred in the four-layer tree. The fourth one namely (-7, -6) occurred in the extended tree. The patterns also revealed that only 6 of the possible 25 long-range dinucleotide positions carried splice signals. The most prominent long-range dinucleotides are (-1, +5) and (-2, +5). They occurred in the four-layer tree. The remaining four namely (-3, +5), (-4, +7), (-3, +4), (-1, +6) occurred in the extended tree. The overall scenario for donor sites is as follows: (i) There are strong long-range interactions from position +5 to positions –1, -3 and to –2. (ii) The sequential dinucleotides at (-2, -1), (+3, +4) and (+5, +6) act as distinct motifs.Acceptor Sites:
Contrast has been identified as the primary split candidate emphasising that the poly-pyrimidine tract present at the end of introns acts as a primary splice signal for acceptor sites. The rules revealed that six of the considered 10 sequential dinucleotide positions carry splice signals. The most prominent sequential dinucleotides are (-7, -6) and (-6, -5) and they occurred in the 4-layer tree. The remaining four namely (+1, +2), (+2, +3), (+3, +4), (+5, +6) occurred in the extended tree. The rules also revealed that 8 of the possible 25 long-range dinucleotide positions carried splice signals. The most prominent long-range dinucleotides are (-3, +1), (-3, +3), (-5, +3) and they occurred in the four-layer tree. The remaining six namely (-4, +1), (-4, +5), (-5, +2), (-5, +3), (-7, +1), and (-7, +3) occurred in the extended tree. These 14 dinucleotide units were enough to discriminate real acceptor sites from the proximal false sites. The overall scenario for acceptor sites is as follows: (i) The poly-pyrimidine tract is a strong splice signal. (ii) There are strong long-range interactions from position –3 to positions +1 and +3. (iii) The sequential dinucleotides at (-7, -6) and (-6, -5) act as distinct motifs.The observed possibility that the nearby or long-range nucleotide positions can form dinuleotide units of splice signals has been discussed in literature earlier7. This is quite agreeable with the splicing mechanism of donor site selection. U1 snRNA first base pairs with a region of –4 to +6 around donor site and the same region upstream and downstream regions are later recognised by U5 and U6 snRNAs. Thus the long-range nucleotide positions might inter-depend one another. In the case of acceptor site selection, the splicing factor U2AF35 recognises the 3’ splice site AG in a sequence specific manner8.
AVAILABILITY OF THE PROGRAMS AND DATA SETS
The learning data set used in this work is available on the WWW at http://www.ebi.ac.uk/~thanaraj/splice.html. The test data set of EST-confirmed sites is also available from the same web site.
SpliceProximalCheck program
Computer implementation of the trees is available at
http://www.ebi.ac.uk/~thanaraj/SpliceProximalCheck.html. As input it takes sequences of length 7 nucleotides (in the case of donor sites) or 20 nucleotides (in the case of acceptor sites) from either side of a putative splice site. The sequences are then scrutinised against the validation rules and the putative site is appropriately marked as either a proximal false site, a real splice site or as undecided. The utility of the program becomes obvious under the following situation: when the putative splice site (as predicted by exon prediction programs that use coding potential among others) is identified as a false proximal site by SpliceProximalCheck, then the user can scrutinise the nearby cryptic sites using the same program; thus improving the predictive ability of the programs.Integration of SpliceProximalCheck with publicly available tools for exon predictions
As discussed earlier, the motivation behind developing SpliceProximalCheck is to enable the further scrutiny of the exons derived by gene prediction programs. Thus, it is highly desirable to present to the gene annotation community an integrated system with SpliceProximalCheck as a front-end tool to exon prediction programs. For this purpose an integrated system with SpliceProximalCheck as a front-end tool to MZEF, a publicly available program for exon predictions5, is being available at http://www.ebi.ac.uk/~thanaraj/MZEF-SPC.html.
ILLUSTRATIVE EXAMPLES FOR MZEF-SPC AND PERFORMANCE TESTS
As discussed so far, the reported program specialises in identifying the proximal false positive splice sites. The program labels a given site as either ‘FALSE’ or ‘Possibly true’ or as ‘Undecided’. We demonstrate below two real examples of improvement by SPC program on the results of MZEF predictions. The emphasis is on illustrating how the false positive predictions from MZEF are identified by SPC. For this purpose, we randomly chose two human DNA sequences, namely HSERPG and HSU52852, from EMBL nucleotide sequence database9. The results from MZEF-SPC are shown in Table 4. The predicted exons are ordered as per their P-value (as determined by MZEF). The following observations were made.
|
MZEF-predicted EXONS |
MZEF P Value |
ACCEPTOR Sites |
DONOR Sites |
EXACT EXONS |
||||||
|
As per EMBL |
Output of SPC |
Agree? |
As per EMBL |
Output of SPC |
Agree? |
As per EMBL |
Output of SPC |
Agree? |
||
|
I. EMBL ID : HSU52852 |
||||||||||
|
3806 – 3957 |
1.00 |
T |
T |
YES |
T |
T |
YES |
T |
T |
YES |
|
3849 – 3957 |
0.99 |
F |
F |
YES |
F |
F |
YES |
|||
|
3906 – 3957 |
0.99 |
F |
F |
YES |
F |
F |
YES |
|||
|
6729 – 6837 |
0.99 |
T |
T |
YES |
F |
T |
NO |
F |
T |
NO |
|
3806 – 3961 |
0.97 |
F |
F |
YES |
F |
F |
YES |
|||
|
4062 – 4191 |
0.96 |
T |
T |
YES |
F |
T |
NO |
F |
T |
NO |
|
4062 – 4187 |
0.95 |
T |
T |
YES |
T |
T |
YES |
|||
|
3886 – 3957 |
0.94 |
F |
T |
NO |
F |
T |
NO |
|||
|
4277 – 4374 |
0.92 |
T |
T |
YES |
T |
T |
YES |
T |
T |
YES |
|
6437 – 6606 |
0.92 |
F |
F |
YES |
T |
T |
YES |
F |
F |
YES |
|
6426 – 6606 |
0.91 |
T |
T |
YES |
T |
T |
YES |
|||
|
5584 – 5710 |
0.87 |
T |
T |
YES |
T |
T |
YES |
T |
T |
YES |
|
5806 – 5900 |
0.85 |
T |
T |
YES |
T |
T |
YES |
T |
T |
YES |
|
3876 – 3957 |
0.82 |
F |
F |
YES |
F |
F |
YES |
|||
|
6770 – 6837 |
0.79 |
F |
F |
YES |
F |
F |
YES |
|||
|
6388 – 6606 |
0.75 |
F |
F |
YES |
F |
F |
YES |
|||
|
4062 – 4147 |
0.69 |
F |
F |
YES |
F |
F |
YES |
|||
|
2676 - 2734 |
0.67 |
F |
F |
YES |
T |
T |
YES |
F |
F |
YES |
|
6481 – 6606 |
0.66 |
F |
F |
YES |
F |
F |
YES |
|||
|
3816 – 3957 |
0.66 |
F |
F |
YES |
F |
F |
YES |
|||
|
5584 – 5652 |
0.57 |
F |
F |
YES |
F |
F |
YES |
|||
|
3849 – 3961 |
0.57 |
F |
F |
YES |
||||||
|
3915 – 3957 |
0.57 |
F |
T |
NO |
F |
T |
NO |
|||
|
6437 – 6620 |
0.55 |
F |
T |
NO |
F |
F |
YES |
|||
|
6426 – 6620 |
0.52 |
F |
T |
NO |
||||||
|
I. EMBL ID : HSERPG |
||||||||||
|
1596 – 1682 |
1.00 |
T |
T |
YES |
T |
T |
YES |
T |
T |
YES |
|
2294 – 2473 |
1.00 |
T |
T |
YES |
T |
T |
YES |
T |
T |
YES |
|
1288 – 1339 |
0.98 |
F |
F |
YES |
T |
T |
YES |
F |
F |
YES |
|
512 – 627 |
0.97 |
F |
T |
NO |
T |
T |
YES |
F |
T |
NO |
|
1619 – 1682 |
0.97 |
F |
T |
NO |
F |
T |
NO |
|||
|
375 – 627 |
0.92 |
F |
F |
YES |
F |
F |
YES |
|||
|
1254 – 1339 |
0.88 |
F |
T |
NO |
F |
T |
NO |
|||
|
351 – 627 |
0.86 |
F |
F |
YES |
F |
F |
YES |
|||
|
1194 – 1339 |
0.76 |
T |
F |
NO |
T |
F |
NO |
|||
|
2303 – 2473 |
0.72 |
F |
F |
YES |
F |
F |
YES |
|||
|
2387 – 2473 |
0.69 |
F |
T |
NO |
F |
T |
NO |
|||
|
585 – 627 |
0.67 |
F |
T |
NO |
F |
T |
NO |
|||
|
1301 – 1339 |
0.63 |
F |
T |
NO |
F |
T |
NO |
|||
|
1303 – 1339 |
0.55 |
F |
F |
YES |
F |
F |
YES |
|||
Table 4. Illustrative examples of the results from MZEF-SPC program.
"T" under the ‘SPC’ column means "Possibly True". The results under "ACCEPTOR" and "DONOR" columns are shown only for unique acceptor and donor sites. The analysis was extended to 8 more entries and the consolidated results are discussed in the text. ‘Exact exon’ indicates the exons for which both the boundaries are correctly predicted.
Thus, SPC brings about an improvement by helping to eliminate false positive sites. In order to substantiate the observation, we did the analysis for 8 more randomly chosen entries (namely AF101475, AF134406, AF135027, AF166330, AF184072, AF189277, AF195508, AF228497) and consolidated the results. The following observations were made:
Thus for a small loss in the sensitivity, SPC brings about a substantial improvement in specificity. 60% - 66% of MZEF-predicted false positives can be eliminated for a loss of only 3 out of ~33 true positive predictions.
.
ACKNOWLEDGEMENTS
The authors thank Martin Senger for help with developing the MZEF-SPC web server and Alvis Brazma for discussions on decision trees.
REFERENCES