In this release the following data files are
available:
Examples and formats of the data
files.
Gene file
The gene file has a standard FASTA format with a
header, and a section right after the header that
lists the sequence. The header will show the Ensembl
gene identifier, and various flags. The important
flag is the 'ext' flag which indicates how many bases
we have added up- and downstream to the listed sequence.
Example:
>ENSG00000170613 chrom: 5 strand:
-1 orientation: reversed ext: 3000 map_start: 3001(local)
=> 161664496(ensembl) map_end: 6941(local) =>
161668436(ensembl)
TGCTATATTCCTGCACCTAAAACAGGGTCTGGCACAAAGTAAACAAT
TAATTATATTAATGGAGTGAATGGATAAATTTATGCTGCTTTGCATT
TGTATGTTTGTATTCTATCTGTCCTATTAGTGTCACCAGTCTAGTCC
:
GTCTTTGAAGCAGAGGAAA
Transcript file
The transcript file lists all those EST's and mRNA's
that were used in determining the individual introns
and exons. Besides listing the transcript identifier
and version, it also gives a pointer to the gene where
it confirmed a feature, the description of the transcript,
and the alignment as we found it.
Example:
K057543.1 [ENSG00000170613] Homo
sapiens cDNA FLJ32981 fis, clone TESTI3000002, weakly
similar to L.mexicana lmsap2 gene for secreted acid
phosphatase 2 (SAP2). g(3004..3705)e(1..702),g(5610..6928)e(703..2021)
Reference transcript structure
file
The reference transcript structure file lists the transcript
structure from the Ensembl annotation that we chose
to be the point of reference with regard to numbering
the features and comparing the new features that we
found. Each gene only has one reference transcript structure;
in the file Ensembl gene ID, Ensembl Transcript ID,
and the complete reference transcript structure are
given. UFR and UDR features are added features by us
that denote the upstream and downsteam flanking region
that was added to the gene sequence.
Example:
>ENSG00000167987
ENST00000301765 UFR(1..3000 3000),e1(3001..3102 102),i1(3103..7631
4529),e2(7632..7803 172),i2(7804..8501 698),e3(8502..8584
83),i3(8585..9299 715),e4(9300..11583 2284),DFR(11584..14583 3000)
Intron file
The file lists all the introns that were determined
by matching the EST/mRNA's against the genes listed
in the gene file.
Example:
>ENSG00000128891 (3018..4965)
TYPE: GT-AG
ELM: UFR(3018..4964 4964)e1(1..1 257)
NUMT: 24
FSDE: cgcccctcccgatttcctccgggctacaggcgacagagctgagccaagcgtttactgggcagctgttacg
FSDI: GTAAGTGAGGAGGGGCTGGGGTGCCCAGCGTTTTGGATCTCCCACTCTGGCCCGGCCCCGGAATACCACA
FSAI: AGCCACTGTGCTCAACCTTATGCTGTATTCTTAAAGCCAGTTCTTACTCACTTGAGCTTCTGTTTTATAG
FSAE: ctcagattccaaatgaaaatgtttgagagcgctgactctacagccacaagatctggccaggatctctggg
CNTX: ~2940..3017,4966..5221,10621..10777,13866..~14025
CNTX: ~2977..3017,4966..5221,10621..10777,35691..~35902
CNTX: ~2958..3017,4966..5221,5986..~6012
CNTX: ~2974..3017,4966..5221,10621..~10815
BPPPT: PPT(-67, -57), PPT(-54, -38), BP(-50,4.17), BP(-36,3.3),
PPT(-29, -17), BP(-24,4.67), BP(-20,4.67), BP(-15,3.42), PPT(-13, -2),
BP(-3,3.09)
END
The first line lists the Ensembl id and the start/end
of the intron. TYPE indicates the
type of the intron, which can be any of the three
GT-AG, GC-AG, or AT-AC. ELM
shows how this feature relates to the reference feature - in the above
example the intron covers part of the upstream flanking region and the
first exon. NUMT
is the number of transcripts that confirmed this intron.
FSDE and FSAE are
the up- and downstream 70 bases into the flanking
exons of this intron, respectively. FSDI
and FSAI are the 70 bases intronic
sequence on the donor and acceptor side of the intron,
respectively.
The CNTX lines show in which context
(read: isoform splice pattern) this intron was observed.
BPPPT indicates the branchpoint position
and scores and the polypyrimidine tract positions
within the intron.
Exon file
The exon file lists all exons that were confirmed
by the EST/mRNA's in the transcript file.
Example:
>ENSG00000174815 (20264..20442)
TYPE: GT-AG
ELM: DFR(2581..2759 3000)
NUMT: 2
FSAI: gggattgtcctcagaaatctaggtgcagagtgggagaaagggttagcgatcatctctctgtgttctccag
FSAE: GTCCCTATGCCTCCCCCACGTTCCTCCCGACGGCTCCGAGCTGGCACTCTGGAGGCCCTGGTCAGACACC
FSDE: TGTCAGCCTTCCTGGCTACCCACCGGGCCTTCACCTCCACGCCTGCCTTGCTAGGGCTTATGGCTGACAG
FSDI: gtcagagtcataagggacgcagggtagtggagtatctgcccggatttcctaaagccgcaacatcccacca
CNTX: ~19382..19826,19925..20008,20264..20442,20576..~20627
CNTX: ~19654..20008,20264..20442,20576..~20627
END
The first line lists the Ensembl id and the start/end
of the exon. The second line gives the dinucleotides
from the introns at the donor (3' end of exon) and
acceptor (5' end of exon) sites. FSAI lists 70 bases
of intronic sequence at the acceptor site. FSAE lists
70 bases of exonic sequence at the acceptor site.
FSDE lists 70 bases of exonic sequence at the donor
site. FSDI lists 70 bases of intronic sequence at
the donor site. CNTX lines show in which context (read:
isoform splice pattern) this exon was observed.
Events file
The events file lists for every gene all the events
that were observed among the transcript classes (see
documentation).
Every event is annotated with (i) the splice patterns
of isoform transcript classes (from which the event
is delineated); (ii) the different manipulations of
the event as seen between all combinations of di-transcript
classes; and (iii) the participating introns/exons
in the event.
For each gene, the Ensembl identifier is given, followed
by a listing of representatives of the transcript
classes, any possible relation between the classes,
and the grouped events themselves.
Example:
>ENSG00000170312
Class 1 ~3122..3191,4664..4725,9228..9384,10187..10310,12583..12753,16413..16576,16671..~16804
Class 2 ~3016..3093,4664..4725,9228..9384,10187..10310,12583..12753,16413..~16454
Class 3 ~9226..9384,10187..10310,12583..12753,16413..16576,16671..16812,18400..~18527
Class 4 ~9226..9384,10187..10310,16413..16576,16671..16812,18400..~18491
Class 5 ~12194..12753,16413..~16450
Class 6 ~9336..9384,10187..~10405
Classes with staggered overlap + same structure :
(3 & 1), (3 & 2)
Classes with staggered overlap only : (2 & 1),
(4 & 1), (4 & 2), (4 & 3)
Type : INTRON ISOFORM
(II-5P)
Struct : 3094..4663 (intron) <=> 3192..4663
(intron)
Length change: -98, 0 (-98)
Occurs in: (2 & 1)
(2 & 1) :
3094..4663 (intron) <=> 3192..4663
(intron)
e2 (4664..4725) <=> e2 (4664..4725) [0, 0]
23 Confirm. EST's <=> 4 Confirm. EST's
TRPT-ISO1: ~3016..3093,4664..4725,9228..9384,10187..10310,12583..12753,16413..~16454
TRPT-ISO2: ~3122..3191,4664..4725,9228..9384,10187..10310,12583..12753,16413..16576,16671..~16804
Type : CASSETTE EXON
(SCE)
Cassette exons: 12583..12753 [171 b]
Occurs in: (4 & 2) is Complete SCE, (4 & 1)
is Complete SCE, (4 & 3) is Complete SCE
(4 & 2) (4 &
1) (4 & 3) :
10311..16412 (intron) <=> 10311..12582,12754..16412
(introns)
e1 (10187..10310) <=> e1 (10187..10310) [0,
0]
e2 (16413..16576) <=> e2'' (16413..~16454) [0,
-122]
3 Confirm. EST's <=> 27 Confirm. EST's
TRPT-ISO1: ~9226..9384,10187..10310,16413..16576,16671..16812,18400..~18491
TRPT-ISO2: ~3122..3191,4664..4725,9228..9384,10187..10310,12583..12753,16413..16576,16671..~16804
TRPT-ISO2: ~3016..3093,4664..4725,9228..9384,10187..10310,12583..12753,16413..~16454
TRPT-ISO2: ~9226..9384,10187..10310,12583..12753,16413..16576,16671..16812,18400..~18527
Each event block consists of:
- a basic type (e.g. EXON ISOFORM,
CASSETTE EXON, etc) and a list of all detailed forms
in which such a basic type exists
- an event specific structure,
e.g. location changes for an exon isoform, or a
list of cassette exons.
- for exon and intron isoforms
a length change (donor/acceptor side and total)
- between which transcript class
combinations this event occurs, and exactly in which
form. Exon isoforms can be part of another event
(e.g. CCE complex cassette exon) as the flanking
exons, part of none if they flank only an intron
isoform, or part of unknown if the intron it flanks
is not completely defined in the other transcript
class.
- a (grouped) list of flanking
features and class combinations
- a list of transcript isoforms,
with the reference (isoform 1) pertaining to the
left-hand side of the structure relationship (listed
as part of the above flanking features or as part
of the event specific structure in case of exon/intron
isoforms).
A more detailed description of the events file and
our naming conventions can be found in the documentation.
Splice pattern sequence file
This FASTA formatted file lists all the sequences
of the observed splice patterns, together with the structure of the
splice pattern.ASD/alttrans/human/latest
Example (some sequence deleted for brevity):
>ENSG00000136114 SP:2 STRUCTURE:~6418..6558,10907..11869,30312..~31934 THSD1 (HUGO)
AGGTGTTTTTGGGGAAAAAAATCACAATCTGGACGTGAGAAAGGACATGAGGAGACTAAAG
ACCTGGGATTTTGTCAATCAGAATGAAACCAATGTTGAAAGACTTTTCAAATCTATTGTTG
.
.
TAACTATTTGTACCGTAGGACAGAATGTGAGGAGGAAGTAACACACAGAGGAGGATGTGTG
TGTATGCATGTGTTTGAATTCACAAGGAAGAAATTATTTATCTTGAGCTTTTTCCTTTGTT
ATTCAATTTCTATTGATTTATTAGTAATAACAATGATAATAAAATGTAAATGAGCAAA
|