********************************************************************************
ASTD Release 1.1 Note
12-Feb-2008
********************************************************************************

TABLE OF CONTENTS
1. INTRODUCTION
2. ACCESS
3. CHANGES IN CURRENT RELEASE
4. FORTHCOMING CHANGES
5. COMPREHENSIVE STATISTICS
6. CONTACT DETAILS
7. CITATION
8. ACKNOWLEDGEMENTS


1. INTRODUCTION
The Alternative Splicing and Transcript Diversity (ASTD) database project is creating a database of alternative splice events and transcripts of genes from human, mouse and rat. Full length transcripts are generated with the aim of understanding the mechanism of alternative splicing on a genome-wide scale.

The current release of the human genome consists of:
16715 genes, 14101 have more than one splice isoform, with an average of 5.6 splice patterns per gene.
10831 transcripts are annotated as full length with a transcription start site and a poly(A).

The current release of the mouse genome consists of:
16491 genes, 13028 have more than one splice isoform, with an average of 4 splice patterns per gene.
6011 transcripts are annotated as full length with a transcription start site and a poly(A).

The current release of the rat genome consists of:
10424 genes, 6344 have more than one splice isoform, with an average of 2.6 splice patterns per gene.
1250 transcripts are annotated as full length with a transcription start site and a poly(A).

More comprehensive statistics are available at the end of this document.

The data in this release is generated for genes from Ensembl version 41 for human and mouse and version 42 for rat. The extracted nucleotide region includes the gene region as defined in Ensembl with an extension at the 5' and 3' ends by 10,000 bases for all species. EST and mRNA (transcript) sequences, publicly available from the INSCD are mapped to these extended gene regions. Transcript confirmed introns and exons are delineated from these alignments. The matching transcript sequences are further classified into groups, each group represents an isoform splice pattern. Each group is represented by a transcript representative structure (as defined by having the most introns). Isoform peptide translations are also presented.

Isoform splice patterns are compared with one another to delineate the alternative events. The basic events that are identified in this work are: exon isoforms (extension/truncation of an exon), intron isoforms (extension/truncation of an intron), cassette exons (an exon is present in one transcript but absent in an isoform of the transcript), mutually exclusive exons (exons are used in alternative transcripts in a mutually exclusive manner) and intron retention (a nucleotide region is used as an exon in a transcript while it is an intron in an alternative transcript). The latter three events (namely cassette exon, mutually exclusive exon and intron retention) are further characterised as 'complex' or 'simple' depending on whether the 5' or/and 3' flanking exons also undergo modifications (e.g. the flanking exon may be extended or truncated or the exon that flanks a retained intron is a cassette or mutually exclusive event).

Introns/exons are annotated for splice signals such as donor/acceptor sites, branch points, and polypyrimidine tracts. Conserved exons/introns/events in the orthologous genes from human and mouse have been identified and are annotated in the database. SNP positions and alleles used have been mapped to our data and we display them for isoform splice patterns as well as for individual events. Annotation pertaining the expression states of the isoforms is extracted from the supporting clone libraries, using the eVOC annotation (human, mouse) or MeSH terms (rat). Subtractive library expression queries are available from the advanced query pages. Each transcript is scrutinised for the presence of a poly(A) tail, poly(A) sites upstream of the cleavage site and for a transcription start site (TSS).

The manually annotated database, AEdb, has been integrated to some extent with AltSplice (for both human and mouse entries). Entries that are common between AltSplice and AEdb are associated and are indicated so in the display pages that are resultant of queries to AltSplice and/or AEdb. Queries can be raised for common entries. AltSplice exons and splice events that have experimental evidence from AEdb are indicated so. In addition, we have built a wrapper that passes on queries to both the AEdb and AltSplice.


2. ACCESS
Access to the data from the automatic pipeline is via the simple all text query on the home page:
http://www.ebi.ac.uk/astd/

An advanced search is available: http://www.ebi.ac.uk/astd/asearch.html

Download of associated flat files is available: http://www.ebi.ac.uk/astd/download.html


3. CHANGES IN CURRENT RELEASE
This is the second release of the Alternative Splicing and Transcript Diversity (ASTD) database. Please feel free to send any comments concerning these improvements.

1) The TSS prediction pipeline has been greatly improved. For human, we used 1.4 millions 5' expressed-sequence tags (ESTs) and RIKEN 5' end oligo-cap cDNA sequences from cDNA libraries using the oligo-capping method (Suzuki and Sugano, 2003) available from DDBJ/EMBL/GenBank.
For mouse, we used 722 642 5' expressed-sequence tags (ESTs) from the FANTOM3/RIKEN full-length enriched library. We used 5’ EST rather than the RIKEN full-length cDNAs since 5' ESTs have more libraries sequenced and have a better coverage.
For rat, we used 5 349 MGC full-length cDNA ORF clones in total from the NIH Mammalian Gene Collection (MGC) deposited in DDBJ/EMBL/GenBank(Gerhard et al. 2004).
For each ASTD transcript, we extract the transcript sequence with its UTR region up to 10kb and align it against the oligo-capped CDNAs sequences by using the NCBI-Blast algorithm.

2) We define putative promoter groups by clustering Transcription Start Sites (TSSs) separated by 500 bases for human and 300bp for mouse to assign unambiguously a TSS position to a cluster. For rat, we do not cluster the TSS positions since we observe only one TSS position per transcript. We used several intervals to define alternative promoter (AP) clusters. The statistical distribution of the TSS positions among the transcripts shows a plateau before the interval size reaches 500bp for human and 300bp for mouse. For each cluster, we define the centroid and the standard deviation.

3) The TSS cluster identifier takes the form of four letters and 9 integers:     CTSSnnnnnnnnnn

4) Improved mapping of the ASTD transcript translations to Uniprot proteins.

5) Rat transcripts are associated to developmental stages and pathology states terms based on cDNA libraries information extraction. Rat developmental stages evidences are mapped to the Witschi embryology classification. Anatomical system and disease states evidence are mapped to the MeSH ontology.

6) Normalized transcript digital expression values, introduced by the NCBI and known as TPM (transcript per million), are derived for each tissue/development/pathology related evidences from cDNA libraries. For instance, the tissue TPM values of transcripts are calculated as follow: for any transcript, divide the number of EST evidences found in a particular tissue by the total number of clones from the cDNA libraries related to this tissue. The resulting value is then normalized to one million.

7) Tissue/Development/Pathology fold-change is displayed in the transcript page.

8) Digital differential expression significance of transcripts is calculated by t-tests with a p-value cutoff of 0.05. We applied a benjamini-Hochberg correction for false discovery rate of less than 5%. It provides a resulting set of statistically significant variant transcript between two conditions (normal vs. cancer for instance).

9) Integration of experimental validations of mouse PolyA sites from the ATD consortium work described in Moucadel et al. 2007 (see reference at the end of the notes in the citation section).

10) Prediction of miRNA target sites is being studied by the Zavolan group and is publically available at:
http://www.mirz.unibas.ch/Computational_prediction_of_microRNA_targets.shtml
The latest EIMMo human and mouse predictions from January 2008 have been incorporated into the ASTD schema to provide additional annotations to the genomic DNA.

11) Previous ASTD releases are now archived. An archive web page enables users to trace the history of transcript features.

12) The advanced search has been refactored. The query response time dramatically reduced.

13) Splice events and TPM calculations are available for download on the FTP site.


4. FORTHCOMING CHANGES
1) RefSeq mRNAs will be compared to ASTD transcripts and their structure will be displayed in the genomic view.

2) Integration of exon array data will happen in the next release.

3) There is no plan for 2008 to include additional species.


5. COMPREHENSIVE STATISTICS
Please see the statistics page for comprehensive statistics:

    Human at http://www.ebi.ac.uk/astd/statistics.html?tax=9606
    Mouse at http://www.ebi.ac.uk/astd/statistics.html?tax=10090
    Rat at http://www.ebi.ac.uk/astd/statistics.html?tax=10116


6. CONTACT DETAILS
For all queries please contact:

    ASTD team The EMBL Outstation - The European Bioinformatics Institute
    Wellcome Trust Genome Campus
    Hinxton Cambridge
    CB10 1SD United Kingdom
    Telephone: (+44 1223) 494 680
    Telefax: (+44 1223) 494 468

We welcome any comments/questions about the data. Please go to http://www.ebi.ac.uk/support/index.php?query=ASTD with any queries.

7. CITATION
Experimental validation of PolyA sites:

    Moucadel V. , Lopez F. , Ara T. , Benech P. , Gautheret D.
    Beyond the 3' end: experimental validation of extended transcript isoforms.
    Nucleic Acids Res 2007 35: 1947-57.

If you want to cite ASD in a publication, please use one of the following reference:

    Stamm S, Riethoven J-JM, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang Y, Barbosa-Morais NL, Thanaraj TA.
    ASD: a bioinformatics resource on alternative splicing.
    Nucleic Acids Res 2006 34: D46-D55.

If you want to cite ATD in a publication, please use one of the following reference:

    Le Texier, V., Riethoven, J-J., Kumanduri, V., Gopalakrishnan, C., Lopez, F., Gautheret, D. and Thanaraj, T.A. (2006)
    AltTrans: Transcript pattern variants annotated for both alternative splicing and alternative polyadenylation.
    BMC Bioinformatics 7: 169 (2006).


8. ACKNOWLEDGEMENTS
-------------------------------------------------------------
ASD consortium members:
Stefan Stamm, Institute of Biochemistry, University Erlangen Nurenberg, 91054 Erlangen, GERMANY.
Peer Bork, Structural and Comp. Biology Dept., European Molecular Biological Laboratory, 69177 Heidelberg, GERMANY.
Roderic Guigo, Genomics Laboratory Group, IMIM/Research on Biomedical Informatics, 08003, Barcelona, SPAIN.
Laurent Bracco, Exonhit Therapeutics, 75013 Paris, FRANCE.
Hermona Soreq, Department of Biological Chemistry, The Hebrew University of Jerusalem, 91904 Jerusalem, ISRAEL.
Rolf Apweiler, EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD,
Juan Valcárcel, CRG Barcelona Spain

-------------------------------------------------------------

-------------------------------------------------------------
ATD consortium members:
Peer Bork, Structural and Comp. Biology Dept., European Molecular Biological Laboratory, 69177 Heidelberg, GERMANY.
Christiane Dascher-Nadel, INSERM Transfert, France
Daniel Gautheret, INSERM, France
Roderic Guigo, Genomics Laboratory Group, IMIM/Research on Biomedical Informatics, 08003, Barcelona, SPAIN.
Winston Hide, SANBI, South Africa Magnus von Knebel, University of Heidelberg, Germany
Jans Reich, Max-Delbruck-Centrum fur Molecular Medizin, Germany
Rolf Apweiler, EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD,
Jaak Vilo, Estonian Biocenter, Estonia
-------------------------------------------------------------

-------------------------------------------------------------
Eurasnet consortium members:
Reinhard Lührmann, MPG Göttingen Germany
Göran Akusjärvi, UU Uppsala Sweden
Rolf Apweiler, EMBL Hinxton UK
Gil Ast, TAU Tel Aviv Israel
Didier Auboeuf, INSERM - Paris
Francisco Baralle, ICGEB Trieste Italy
Andrea Barta , MUW Vienna Austria
Jean Beggs, UEDIN Edinburgh UK
Giuseppe Biamonti, CNR Pavia Italy
Albrecht Bindereif, LUG Giessen Germany
Peer Bork, EMBL Heidelberg Germany
Christiane Branlant, CNRS Nancy France
John Brown, SCRI Dundee UK
Javier F. Caceres, MRC Edinburgh UK
Maria Carmo-Fonseca, IMM Lisbon Portugal
Ian Eperon, Unileic Leicester UK
Davide Gabellini, Stem Cell Research Institute - Italy
Artur Jarmolowski, AMU Poznan Poland
Jorgen Kjems, UAAR Århus Denmark
Alberto R. Kornblihtt, FCEN-UBA Buenos Aires Argentina
Angela Krämer, UNIGE Geneva Switzerland
Angus Lamond, UNIVDUN Dundee UK
Karla Neugebauer, MPG Dresden Germany
Daniel Schümperli, UNIBE Bern Switzerland
Bertrand Séraphin, CNRS Gif sur Yvette France
Chris Smith, UCAM-DBIOC Cambridge UK
Hermona Soreq, HUJI Jerusalem Israel
Stefan Stamm,UERLN Erlangen Germany
James Stévenin, IGBMC-GIE Illkirch France
Jamal Tazi, CNRS Montpellier France
Glauco Tocchini-Valentini, CNR Monterotondo Scalo Italy
Henning Urlaub, Max Planck Institute for Biopyhsical Chemistry - Germany
Juan Valcárcel, CRG Barcelona Spain
Mihaela Zavolan, Biozentrum, Switzerland

-------------------------------------------------------------

References
Kouichi Kimura et al. 'Diversification of transcriptional modulation: Large-scale identification and characterization of putative alternative promoters of human genes.', Genome Research 16:55-65, 2006 Okazaki et al., 'Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs', Nature 420, 563-573, 2002