 |
AltSplice Human Release 2 (April 2005)
Introduction
The data in this release is generated for
annotated genes from Ensembl release2_27.35a.1. The extracted nucleotide region
includes the gene region as defined in Ensembl. Such a region is
extended both at the 5' and 3' ends by 3000 bases. Human EST and mRNA
(transcript) sequences are mapped to these extended gene regions.
Transcript confirmed introns and exons are delineated from these
alignments. The matching transcript sequences are further classified
into groups in a manner that each of these groups represents an isoform
splice pattern. Each group is represented by a transcript
representative structure - called splice pattern. Isoform peptide
sequences as expressed by the splice patterns have been delineated and
are presented as part of the database.
Such isoform splice patterns are compared
with one another to delineate the alternative events. Thus the
presented data lists all the alternative splice events as seen in the
observed transcripts for a gene. The basic events that are identified
in this work are: exon isoforms (extension/truncation of an exon),
cassette exons (an exon is present in one transcript but absent in an
isoform of the transcript), alternating exons (exons are used in
alternative transcripts in a mutually exclusive manner), and intron
retention (a nucleotide region is used as an exon in a transcript while
it is an intron in an alternative transcript). The latter three events
(namely cassette exon, alternating exon, intron retention) are further
characterised as 'complex' or 'simple' depending on whether the 5'
or/and 3' flanking exons also undergo modifications (e.g. the flanking
exon may be extended or truncated or the exon that flanks a retained
intron is cassetted or alternated).
Introns/exons are annotated for splice
signals such as strength donor/acceptor sites, branch points, and
polypyrimidine tracts. Conserved exons/introns/events in the
orthologous genes from human and mouse have been identified and are
annotated in the database. SNP positions and alleles used have been
mapped to our data and we display them for isoform splice patterns as
well as for individual events. Annotation pertaining the expression
states of the isoforms is being added to the data. Subtractive library
expression queries can now be raised from the interfaces. We will carry
out further work on this data towards annotating through various other
features.
We have implemented, in this release,
integration of AEdb with AltSplice (for both human and mouse entries).
Entries that are common between AltSplice and AEdb are associated and
are indicated so in the display pages that are resultant of queries to
Altsplice and/or AEdb. Queries can be raised for common entries.
AltSplice exons and splice events that have experimental evidence from
AEdb are indicated so. In addition, we have built a wrapper that passes
on queries to both the AEdb and AltSplice.
We have further implemented a
SplicePatternViewer to visualise the isoform splice patterns. AltSplice
data can also be seen from the geneview and contigview pages of Ensembl.
Documentation
Concise documentation of the
procedure followed to produce this data and the naming conventions used
is available in PDF format.
The document is a work in progress and will be updated now and then to
reflect now developments.
Statistics
Gene set
| Start-up gene set (Ensembl 19.34b2) |
22216 human genes |
| After cleanup |
21796 genes |
| No. of genes with one or more confirmed intron/exon features |
16293 |
Grand totals of genes, transcript sequences, transcript classes, and events
| Genes |
16293 |
EST/mRNA sequences
|
915500 |
| |
|
| Confirmed introns |
184731 |
| Confirmed exons |
137195 |
| |
|
| Total number of transcript structures |
898295 |
| Avg. contexts per unique exon |
289646 / 137195 = 2.1 |
| Avg. contexts per unique intron |
366019 / 184731 = 2.0 |
| |
|
| Genes with >1 splice pattern |
13572 |
Genes with delineated events
|
9945 |
| |
|
| Total number of exon events |
33338 |
| Exon Isoform events |
7575 |
| Cassette exon events |
18815 |
| Alternating exon events |
1678 |
| Intron retention events |
5270 |
| |
|
| Intron Isoform events |
13874 |
| Total number of intron events |
39637 |
| Events per gene |
4.0 |
Distribution
statistics
Various distributions are located on the distribution
statistics page (genes per classes, intron types, event
types, length of retained introns and cassette exons, effective length
change of exon isoforms).
Data files
|
|
|
In this release the following data files are available:
|
|
Examples and formats
of the data files.
Gene file
The gene file has a standard FASTA format
with a header, and a section right after the header that lists the
sequence. The header will show the Ensembl gene identifier, and various
flags. The important flag is the 'ext' flag which indicates how many
bases we have added up- and downstream to the listed sequence.
Example:
>ENSG00000170613
chrom: 5 strand: -1 orientation: reversed ext: 3000 map_start:
3001(local) => 161664496(ensembl) map_end: 6941(local) =>
161668436(ensembl)
TGCTATATTCCTGCACCTAAAACAGGGTCTGGCACAAAGTAAACAAT
TAATTATATTAATGGAGTGAATGGATAAATTTATGCTGCTTTGCATT
TGTATGTTTGTATTCTATCTGTCCTATTAGTGTCACCAGTCTAGTCC
:
GTCTTTGAAGCAGAGGAAA
Reference
transcript structure file
The reference transcript structure file
lists the transcript structure from the Ensembl annotation that we
chose to be the point of reference with regard to numbering the
features and comparing the new features that we found. Each gene only
has one reference transcript structure; in the file Ensembl gene ID,
Ensembl Transcript ID, and the complete reference transcript structure
are given. UFR and UDR features are added features by us that denote
the upstream and downsteam flanking region that was added to the gene
sequence.
Example:
>ENSG00000167987
ENST00000301765 UFR(1..3000 3000),e1(3001..3102 102),i1(3103..7631
4529),e2(7632..7803 172),i2(7804..8501 698),e3(8502..8584
83),i3(8585..9299 715),e4(9300..11583 2284),DFR(11584..14583 3000)
Transcript file
The transcript file lists all those EST's
and mRNA's that were used in determining the individual introns and
exons. Besides listing the transcript identifier and version, it also
gives a pointer to the gene where it confirmed a feature, the
description of the transcript, and the alignment as we found it.
Example:
K057543.1
[ENSG00000170613] Homo sapiens cDNA FLJ32981 fis, clone TESTI3000002,
weakly similar to L.mexicana lmsap2 gene for secreted acid phosphatase
2 (SAP2). g(3004..3705)e(1..702),g(5610..6928)e(703..2021)
Intron file
The file lists all the introns that were
determined by matching the EST/mRNA's against the genes listed in the
gene file.
Example:
>ENSG00000128891
(3018..4965)
TYPE: GT-AG
ELM: UFR(3018..4964 4964)e1(1..1 257)
NUMT: 24
FSDE:
cgcccctcccgatttcctccgggctacaggcgacagagctgagccaagcgtttactgggcagctgttacg
FSDI:
GTAAGTGAGGAGGGGCTGGGGTGCCCAGCGTTTTGGATCTCCCACTCTGGCCCGGCCCCGGAATACCACA
FSAI:
AGCCACTGTGCTCAACCTTATGCTGTATTCTTAAAGCCAGTTCTTACTCACTTGAGCTTCTGTTTTATAG
FSAE:
ctcagattccaaatgaaaatgtttgagagcgctgactctacagccacaagatctggccaggatctctggg
CNTX: ~2940..3017,4966..5221,10621..10777,13866..~14025
CNTX: ~2977..3017,4966..5221,10621..10777,35691..~35902
CNTX: ~2958..3017,4966..5221,5986..~6012
CNTX: ~2974..3017,4966..5221,10621..~10815
BPPPT: PPT(-67, -57), PPT(-54, -38), BP(-50,4.17), BP(-36,3.3),
PPT(-29, -17), BP(-24,4.67), BP(-20,4.67), BP(-15,3.42), PPT(-13, -2),
BP(-3,3.09)
END
The first line lists the Ensembl id and the start/end of the intron. TYPE indicates the type of the intron, which can be any of the three GT-AG,
GC-AG, or AT-AC. ELM shows how this
feature relates to the reference feature - in the above example the
intron covers part of the upstream flanking region and the first exon. NUMT is the number of transcripts that confirmed this intron. FSDE and FSAE are the up- and downstream 70
bases into the flanking exons of this intron, respectively. FSDI and FSAI are the 70 bases intronic
sequence on the donor and acceptor side of the intron, respectively.
The CNTX lines show in which context
(read: isoform splice pattern) this intron was observed. BPPPT indicates the branchpoint position and scores and the polypyrimidine
tract positions within the intron.
Exon file
The exon file lists all exons that were
confirmed by the EST/mRNA's in the transcript file.
Example:
>ENSG00000174815
(20264..20442)
TYPE: GT-AG
ELM: DFR(2581..2759 3000)
NUMT: 2
FSAI:
gggattgtcctcagaaatctaggtgcagagtgggagaaagggttagcgatcatctctctgtgttctccag
FSAE:
GTCCCTATGCCTCCCCCACGTTCCTCCCGACGGCTCCGAGCTGGCACTCTGGAGGCCCTGGTCAGACACC
FSDE:
TGTCAGCCTTCCTGGCTACCCACCGGGCCTTCACCTCCACGCCTGCCTTGCTAGGGCTTATGGCTGACAG
FSDI:
gtcagagtcataagggacgcagggtagtggagtatctgcccggatttcctaaagccgcaacatcccacca
CNTX: ~19382..19826,19925..20008,20264..20442,20576..~20627
CNTX: ~19654..20008,20264..20442,20576..~20627
END
The first line lists the Ensembl id and the start/end of the exon. The
second line gives the dinucleotides from the introns at the donor (3'
end of exon) and acceptor (5' end of exon) sites. FSAI lists 70 bases
of intronic sequence at the acceptor site. FSAE lists 70 bases of
exonic sequence at the acceptor site. FSDE lists 70 bases of exonic
sequence at the donor site. FSDI lists 70 bases of intronic sequence at
the donor site. CNTX lines show in which context (read: isoform splice
pattern) this exon was observed.
Splice
pattern file
The different EST/mRNAs sequences that
map to a gene are grouped into classes. The longest
EST/mRNA sequence in each class is chosen as a representative. A class
is composed of all
the EST/mRNAs confirming the same splice pattern. The region
of overlapping between
classes may contain different introns. If this happens, the
respective classes represent
alternative splice patterns. Classes that do not overlap with one
another represent
different regions of the gene and do not represent alternative
splice patterns. Every EST/mRNA identifier is followed by the
identifiers of the classes to which the
sequence belong. It is possible, for non-representative EST/mRNAs, to
belong to more than one class.
Example:
>ENSG00000137274
CLASS 1
X81372-1
~3081..3120,3558..3785,7930..8033,11515..11681,13318..13471,21635..21766,24659..24782,36761..~37301
BG121617-1,2
~21651..21766,24659..24782,36761..~37286
BM455725-1,2
~21705..21766,24659..24782,36761..~37206
BU933348-1,2,3 ~24663..24782,36761..~37155
BU594882-1,2,3 ~24715..24782,36761..~37181
BG714668-1 ~2649..3120,3558..~3776
CLASS 2
AJ617684-2
~3001..3120,7930..8033,11515..11681,13318..13471,21635..21766,24659..24782,36761..~37301
CLASS 3
AL832502-3
~7928..8033,11515..11681,13318..13471,21635..21766,22314..24782,36761..~37301
CLASS 4
BU902205-4
~11217..11681,13318..13471,21635..~21695
END
Splice pattern
sequence file
This FASTA formatted file lists all the
sequences of the observed splice patterns, together with the structure
of the splice pattern.
Example (some sequence deleted for
brevity):
>ENSG00000136114
SP:2 STRUCTURE:~6418..6558,10907..11869,30312..~31934 THSD1 (HUGO)
AGGTGTTTTTGGGGAAAAAAATCACAATCTGGACGTGAGAAAGGACATGAGGAGACTAAAG
ACCTGGGATTTTGTCAATCAGAATGAAACCAATGTTGAAAGACTTTTCAAATCTATTGTTG
.
.
TAACTATTTGTACCGTAGGACAGAATGTGAGGAGGAAGTAACACACAGAGGAGGATGTGTG
TGTATGCATGTGTTTGAATTCACAAGGAAGAAATTATTTATCTTGAGCTTTTTCCTTTGTT
ATTCAATTTCTATTGATTTATTAGTAATAACAATGATAATAAAATGTAAATGAGCAAA
Peptide
sequence file
This FASTA formatted file lists all the
peptide sequences that could be derived from the observed splice
patterns, together with indication of the splice pattern number (e.g.
SP2, see events file) and a gene symbol if known.
Example (some sequence deleted for
brevity):
>ALTS_HUM_PP:ENSG00000004399_SP2_17861
PLXND1
MKELLVDLIDASAAKNPKLMLRRTESVVEKMLTNWMSICMYSCLRETVGEPFFLLLCAIKQQINKGSIDAITGKARYTLS
EEWLLRENIEAKPRNLNVSFQGCGMDSLSVRAMDTDTLTQVKEKILEAFCKNVPYSQWPRAEDVDLEWFASSTQSYILRD
LDDTSVVEDGRKKLNTLAHYKIPEGASLAMSLIDKKDNTLGRVKDLDTEKYFHLVLPTDELAEPKKSHRQSHRKKVLPEI
YLTRLLSTKGTLQKFLDDLFKAILSIREDKPPLAVKYFFDFLEEQAEKRGISDPDTLHIWKTNRWRPSSPVLGEHPEEPP
VCL
>ALTS_HUM_PP:ENSG00000004469_SP3_23085 KCNK4
MRSTTLLALLALVLLYLVSGALVFRALEQPHEQQAQRELGEVREKFLRAHPCVSDQELGLLIKEVADALGGGADPETNST
SNSSHSAWDLGSAFFFSGTIITTIGYGNVALRTDAGRLFCIFYALVGIPLFGILLAGVGDRLGSSLRHGIGHIEAIFLKW
HVPPELVRVLSAMLFLLIGCLLFVLTPTFVFCYMEDWSKLEAIYFVIVTLTTVGFGDYVAGADPRQDSPAYQPLVWFWIL
LGLAYFASVLTTIGNWLRVVSRRTRAEMGGLTAQAASWTGTVTARVTQRAGPAAPPPEKEQPLLPPPPCPAQPLGRPRSP
SPPEKAQPPSPPTASALDYPSENLAFIDESSDTQSERGCPLPRAPRGRRRPNPPRKPVRPRGPGRPRDKGVPV
|
Events file
The events file lists for every gene all the
events that were observed among the transcript classes (see documentation).
Every event is annotated with (i) the splice patterns of isoform
transcript classes (from which the event is delineated); (ii) the
different manipulations of the event as seen between all combinations
of di-transcript classes; and (iii) the participating introns/exons in
the event.
For each gene, the Ensembl identifier is
given, followed by a listing of representatives of the transcript
classes, any possible relation between the classes, and the grouped
events themselves.
Example:
>ENSG00000170312
Class 1
~3122..3191,4664..4725,9228..9384,10187..10310,12583..12753,16413..16576,16671..~16804
Class 2
~3016..3093,4664..4725,9228..9384,10187..10310,12583..12753,16413..~16454
Class 3
~9226..9384,10187..10310,12583..12753,16413..16576,16671..16812,18400..~18527
Class 4 ~9226..9384,10187..10310,16413..16576,16671..16812,18400..~18491
Class 5 ~12194..12753,16413..~16450
Class 6 ~9336..9384,10187..~10405
Classes with staggered overlap + same structure : (3 & 1), (3
& 2)
Classes with staggered overlap only : (2 & 1), (4 & 1),
(4 & 2), (4 & 3)
Type
: INTRON ISOFORM (II-5P)
Struct : 3094..4663 (intron) <=> 3192..4663 (intron)
Length change: -98, 0 (-98)
Occurs in: (2 & 1)
(2
& 1) :
3094..4663 (intron) <=>
3192..4663 (intron)
e2 (4664..4725) <=> e2 (4664..4725) [0, 0]
23 Confirm. EST's <=> 4 Confirm. EST's
TRPT-ISO1:
~3016..3093,4664..4725,9228..9384,10187..10310,12583..12753,16413..~16454
TRPT-ISO2:
~3122..3191,4664..4725,9228..9384,10187..10310,12583..12753,16413..16576,16671..~16804
Type
: CASSETTE EXON (SCE)
Cassette exons: 12583..12753 [171 b]
Occurs in: (4 & 2) is Complete SCE, (4 & 1) is Complete
SCE, (4 & 3) is Complete SCE
(4
& 2) (4 & 1) (4 & 3) :
10311..16412 (intron)
<=> 10311..12582,12754..16412 (introns)
e1 (10187..10310) <=> e1 (10187..10310) [0, 0]
e2 (16413..16576) <=> e2'' (16413..~16454) [0, -122]
3 Confirm. EST's <=> 27 Confirm. EST's
TRPT-ISO1:
~9226..9384,10187..10310,16413..16576,16671..16812,18400..~18491
TRPT-ISO2:
~3122..3191,4664..4725,9228..9384,10187..10310,12583..12753,16413..16576,16671..~16804
TRPT-ISO2:
~3016..3093,4664..4725,9228..9384,10187..10310,12583..12753,16413..~16454
TRPT-ISO2:
~9226..9384,10187..10310,12583..12753,16413..16576,16671..16812,18400..~18527
Each event block consists of:
- a basic type
(e.g. EXON ISOFORM, CASSETTE EXON, etc) and a list of all detailed
forms in which such a basic type exists
- an event specific
structure, e.g. location changes for an exon isoform, or a list of
cassette exons.
- for exon and
intron isoforms a length change (donor/acceptor side and total)
- between which
transcript class combinations this event occurs, and exactly in which
form. Exon isoforms can be part of another event (e.g. CCE complex
cassette exon) as the flanking exons, part of none if they flank only
an intron isoform, or part of unknown if the intron it flanks is not
completely defined in the other transcript class.
- a (grouped) list
of flanking features and class combinations
- a list of
transcript isoforms, with the reference (isoform 1) pertaining to the
left-hand side of the structure relationship (listed as part of the
above flanking features or as part of the event specific structure in
case of exon/intron isoforms).
A more detailed description of the events file
and our naming conventions can be found in the documentation.
 |