******************************************************************************** |
|
TABLE OF CONTENTS 1. INTRODUCTION 2. ACCESS 3. CHANGES IN CURRENT RELEASE 4. FORTHCOMING CHANGES 5. COMPREHENSIVE STATISTICS 6. CONTACT DETAILS 7. CITATION 8. ACKNOWLEDGEMENTS 1. INTRODUCTION The Alternative Splicing and Transcript Diversity (ASTD) database project is creating a database of alternative splice events and transcripts of genes from human, mouse and rat. Full length transcripts are generated with the aim of understanding the mechanism of alternative splicing on a genome-wide scale. The current release of the human genome consists of: 16715 genes, 14101 have more than one splice isoform, with an average of 5.6 splice patterns per gene. 10831 transcripts are annotated as full length with a transcription start site and a poly(A). The current release of the mouse genome consists of: 16491 genes, 13028 have more than one splice isoform, with an average of 4 splice patterns per gene. 6011 transcripts are annotated as full length with a transcription start site and a poly(A). The current release of the rat genome consists of: 10424 genes, 6344 have more than one splice isoform, with an average of 2.6 splice patterns per gene. 1250 transcripts are annotated as full length with a transcription start site and a poly(A). More comprehensive statistics are available at the end of this document. The data in this release is generated for genes from Ensembl version 41 for human and mouse and version 42 for rat. The extracted nucleotide region includes the gene region as defined in Ensembl with an extension at the 5' and 3' ends by 10,000 bases for all species. EST and mRNA (transcript) sequences, publicly available from the INSCD are mapped to these extended gene regions. Transcript confirmed introns and exons are delineated from these alignments. The matching transcript sequences are further classified into groups, each group represents an isoform splice pattern. Each group is represented by a transcript representative structure (as defined by having the most introns). Isoform peptide translations are also presented. Isoform splice patterns are compared with one another to delineate the alternative events. The basic events that are identified in this work are: exon isoforms (extension/truncation of an exon), intron isoforms (extension/truncation of an intron), cassette exons (an exon is present in one transcript but absent in an isoform of the transcript), mutually exclusive exons (exons are used in alternative transcripts in a mutually exclusive manner) and intron retention (a nucleotide region is used as an exon in a transcript while it is an intron in an alternative transcript). The latter three events (namely cassette exon, mutually exclusive exon and intron retention) are further characterised as 'complex' or 'simple' depending on whether the 5' or/and 3' flanking exons also undergo modifications (e.g. the flanking exon may be extended or truncated or the exon that flanks a retained intron is a cassette or mutually exclusive event). Introns/exons are annotated for splice signals such as donor/acceptor sites, branch points, and polypyrimidine tracts. Conserved exons/introns/events in the orthologous genes from human and mouse have been identified and are annotated in the database. SNP positions and alleles used have been mapped to our data and we display them for isoform splice patterns as well as for individual events. Annotation pertaining the expression states of the isoforms is extracted from the supporting clone libraries, using the eVOC annotation (human, mouse) or MeSH terms (rat). Subtractive library expression queries are available from the advanced query pages. Each transcript is scrutinised for the presence of a poly(A) tail, poly(A) sites upstream of the cleavage site and for a transcription start site (TSS). The manually annotated database, AEdb, has been integrated to some extent with AltSplice (for both human and mouse entries). Entries that are common between AltSplice and AEdb are associated and are indicated so in the display pages that are resultant of queries to AltSplice and/or AEdb. Queries can be raised for common entries. AltSplice exons and splice events that have experimental evidence from AEdb are indicated so. In addition, we have built a wrapper that passes on queries to both the AEdb and AltSplice. 2. ACCESS Access to the data from the automatic pipeline is via the simple all text query on the home page: http://www.ebi.ac.uk/astd/ An advanced search is available: http://www.ebi.ac.uk/astd/asearch.html Download of associated flat files is available: http://www.ebi.ac.uk/astd/download.html 3. CHANGES IN CURRENT RELEASE This is the second release of the Alternative Splicing and Transcript Diversity (ASTD) database. Please feel free to send any comments concerning these improvements. 1) The TSS prediction pipeline has been greatly improved. For human, we used 1.4 millions 5' expressed-sequence tags (ESTs) and RIKEN 5' end oligo-cap cDNA sequences from cDNA libraries using the oligo-capping method (Suzuki and Sugano, 2003) available from DDBJ/EMBL/GenBank. For mouse, we used 722 642 5' expressed-sequence tags (ESTs) from the FANTOM3/RIKEN full-length enriched library. We used 5 EST rather than the RIKEN full-length cDNAs since 5' ESTs have more libraries sequenced and have a better coverage. For rat, we used 5 349 MGC full-length cDNA ORF clones in total from the NIH Mammalian Gene Collection (MGC) deposited in DDBJ/EMBL/GenBank(Gerhard et al. 2004). For each ASTD transcript, we extract the transcript sequence with its UTR region up to 10kb and align it against the oligo-capped CDNAs sequences by using the NCBI-Blast algorithm. 2) We define putative promoter groups by clustering Transcription Start Sites (TSSs) separated by 500 bases for human and 300bp for mouse to assign unambiguously a TSS position to a cluster. For rat, we do not cluster the TSS positions since we observe only one TSS position per transcript. We used several intervals to define alternative promoter (AP) clusters. The statistical distribution of the TSS positions among the transcripts shows a plateau before the interval size reaches 500bp for human and 300bp for mouse. For each cluster, we define the centroid and the standard deviation. 3) The TSS cluster identifier takes the form of four letters and 9 integers: CTSSnnnnnnnnnn 4) Improved mapping of the ASTD transcript translations to Uniprot proteins. 5) Rat transcripts are associated to developmental stages and pathology states terms based on cDNA libraries information extraction. Rat developmental stages evidences are mapped to the Witschi embryology classification. Anatomical system and disease states evidence are mapped to the MeSH ontology. 6) Normalized transcript digital expression values, introduced by the NCBI and known as TPM (transcript per million), are derived for each tissue/development/pathology related evidences from cDNA libraries. For instance, the tissue TPM values of transcripts are calculated as follow: for any transcript, divide the number of EST evidences found in a particular tissue by the total number of clones from the cDNA libraries related to this tissue. The resulting value is then normalized to one million. 7) Tissue/Development/Pathology fold-change is displayed in the transcript page. 8) Digital differential expression significance of transcripts is calculated by t-tests with a p-value cutoff of 0.05. We applied a benjamini-Hochberg correction for false discovery rate of less than 5%. It provides a resulting set of statistically significant variant transcript between two conditions (normal vs. cancer for instance). 9) Integration of experimental validations of mouse PolyA sites from the ATD consortium work described in Moucadel et al. 2007 (see reference at the end of the notes in the citation section). 10) Prediction of miRNA target sites is being studied by the Zavolan group and is publically available at: http://www.mirz.unibas.ch/Computational_prediction_of_microRNA_targets.shtml The latest EIMMo human and mouse predictions from January 2008 have been incorporated into the ASTD schema to provide additional annotations to the genomic DNA. 11) Previous ASTD releases are now archived. An archive web page enables users to trace the history of transcript features. 12) The advanced search has been refactored. The query response time dramatically reduced. 13) Splice events and TPM calculations are available for download on the FTP site. 4. FORTHCOMING CHANGES 1) RefSeq mRNAs will be compared to ASTD transcripts and their structure will be displayed in the genomic view. 2) Integration of exon array data will happen in the next release. 3) There is no plan for 2008 to include additional species. 5. COMPREHENSIVE STATISTICS Please see the statistics page for comprehensive statistics: Human at http://www.ebi.ac.uk/astd/statistics.html?tax=9606 Mouse at http://www.ebi.ac.uk/astd/statistics.html?tax=10090 Rat at http://www.ebi.ac.uk/astd/statistics.html?tax=10116 6. CONTACT DETAILS For all queries please contact: ASTD team The EMBL Outstation - The European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: (+44 1223) 494 680 Telefax: (+44 1223) 494 468 We welcome any comments/questions about the data. Please go to http://www.ebi.ac.uk/support/index.php?query=ASTD with any queries. 7. CITATION Experimental validation of PolyA sites: Moucadel V. , Lopez F. , Ara T. , Benech P. , Gautheret D. Beyond the 3' end: experimental validation of extended transcript isoforms. Nucleic Acids Res 2007 35: 1947-57. If you want to cite ASD in a publication, please use one of the following reference: Stamm S, Riethoven J-JM, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang Y, Barbosa-Morais NL, Thanaraj TA. ASD: a bioinformatics resource on alternative splicing. Nucleic Acids Res 2006 34: D46-D55. If you want to cite ATD in a publication, please use one of the following reference: Le Texier, V., Riethoven, J-J., Kumanduri, V., Gopalakrishnan, C., Lopez, F., Gautheret, D. and Thanaraj, T.A. (2006) AltTrans: Transcript pattern variants annotated for both alternative splicing and alternative polyadenylation. BMC Bioinformatics 7: 169 (2006). 8. ACKNOWLEDGEMENTS ------------------------------------------------------------- ASD consortium members: Stefan Stamm, Institute of Biochemistry, University Erlangen Nurenberg, 91054 Erlangen, GERMANY. Peer Bork, Structural and Comp. Biology Dept., European Molecular Biological Laboratory, 69177 Heidelberg, GERMANY. Roderic Guigo, Genomics Laboratory Group, IMIM/Research on Biomedical Informatics, 08003, Barcelona, SPAIN. Laurent Bracco, Exonhit Therapeutics, 75013 Paris, FRANCE. Hermona Soreq, Department of Biological Chemistry, The Hebrew University of Jerusalem, 91904 Jerusalem, ISRAEL. Rolf Apweiler, EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, Juan Valcárcel, CRG Barcelona Spain ------------------------------------------------------------- ------------------------------------------------------------- ATD consortium members: Peer Bork, Structural and Comp. Biology Dept., European Molecular Biological Laboratory, 69177 Heidelberg, GERMANY. Christiane Dascher-Nadel, INSERM Transfert, France Daniel Gautheret, INSERM, France Roderic Guigo, Genomics Laboratory Group, IMIM/Research on Biomedical Informatics, 08003, Barcelona, SPAIN. Winston Hide, SANBI, South Africa Magnus von Knebel, University of Heidelberg, Germany Jans Reich, Max-Delbruck-Centrum fur Molecular Medizin, Germany Rolf Apweiler, EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, Jaak Vilo, Estonian Biocenter, Estonia ------------------------------------------------------------- ------------------------------------------------------------- Eurasnet consortium members: Reinhard Lührmann, MPG Göttingen Germany Göran Akusjärvi, UU Uppsala Sweden Rolf Apweiler, EMBL Hinxton UK Gil Ast, TAU Tel Aviv Israel Didier Auboeuf, INSERM - Paris Francisco Baralle, ICGEB Trieste Italy Andrea Barta , MUW Vienna Austria Jean Beggs, UEDIN Edinburgh UK Giuseppe Biamonti, CNR Pavia Italy Albrecht Bindereif, LUG Giessen Germany Peer Bork, EMBL Heidelberg Germany Christiane Branlant, CNRS Nancy France John Brown, SCRI Dundee UK Javier F. Caceres, MRC Edinburgh UK Maria Carmo-Fonseca, IMM Lisbon Portugal Ian Eperon, Unileic Leicester UK Davide Gabellini, Stem Cell Research Institute - Italy Artur Jarmolowski, AMU Poznan Poland Jorgen Kjems, UAAR Århus Denmark Alberto R. Kornblihtt, FCEN-UBA Buenos Aires Argentina Angela Krämer, UNIGE Geneva Switzerland Angus Lamond, UNIVDUN Dundee UK Karla Neugebauer, MPG Dresden Germany Daniel Schümperli, UNIBE Bern Switzerland Bertrand Séraphin, CNRS Gif sur Yvette France Chris Smith, UCAM-DBIOC Cambridge UK Hermona Soreq, HUJI Jerusalem Israel Stefan Stamm,UERLN Erlangen Germany James Stévenin, IGBMC-GIE Illkirch France Jamal Tazi, CNRS Montpellier France Glauco Tocchini-Valentini, CNR Monterotondo Scalo Italy Henning Urlaub, Max Planck Institute for Biopyhsical Chemistry - Germany Juan Valcárcel, CRG Barcelona Spain Mihaela Zavolan, Biozentrum, Switzerland ------------------------------------------------------------- References Kouichi Kimura et al. 'Diversification of transcriptional modulation: Large-scale identification and characterization of putative alternative promoters of human genes.', Genome Research 16:55-65, 2006 Okazaki et al., 'Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs', Nature 420, 563-573, 2002 |