|Annotation||The process of attaching additional information to biological entities. Annotation can be structural (i.e. identification of the elements from a sequence, such as protein coding regions or the location of regulatory motifs) or functional (i.e. adding biological information to the identified elements, such as the biological function of a protein domain or an entire protein, or the molecular interactions or regulatory role of a nucleotide sequence). Annotation can either be applied automatically or can be manually added (in a process called 'curation') from various sources, such as the scientific literature. At EMBL-EBI, we use a combination of automatic and manual annotation to enrich our databases. Annotation can either be applied automatically or it can be curated (manually) from the scientific literature. At EMBL-EBI, we use a combination of automatic and manual annotation to enrich our databases. |
|Application programming interface||In Ensembl: Written in Perl, a series of APIs act as a middle layer that can be used to directly interface with the Ensembl databases. http://www.ensembl.org/info/docs/api/index.html |
|BAM||BAM is a common file format for next-generation sequencing and analysis tools. It is the compressed binary version of a SAM file. |
|CCDS||Consensus Coding Sequence Set. A project to identify a core set of human and mouse protein coding sequences that are agreed upon between the EBI, Wellcome Trust Sanger Institute, UCSC and NCBI. |
|CLUSTAL||A format from the CLUSTALW program that is commonly used to display sequence alignments. |
|COSMIC||Catalogue Of Somatic Mutations in Cancer - This project stores and displays information about somatic mutation relating to human cancers. |
|DAS||The Distributed Annotation System (DAS) defines a protocol that is used to exchange biological annotations for biological entities (such as genomic regions). DAS allows a single machine to collate sequence annotation information from multiple distant servers and display it to the user in a single view, on an as-needed basis. |
|ENCODE||The ENCODE project used defined regions of the Human genome to test and evaluate different methods and technologies for finding various functional elements in Human DNA. It aims to apply these methods to identify all functional elements in the human genome, such as promoters, enhancers, and other sequences involved in gene regulation.
|Ensembl||Ensembl is a joint project between the EMBL-EBI and the Wellcome Trust Sanger Institute that aims to develop a system that maintains automatic annotation of large eukaryotic genomes. All the software and data are free to access without any constraints. The project is primarily funded by the Wellcome Trust. It is a comprehensive source of stable annotation with confirmed gene predictions that have been integrated from external data sources. Ensembl annotates known genes and predicts new ones, with functional annotation from InterPro, OMIM, SAGE and gene families. |
|Ensembl Genomes||The Ensembl Genomes resource is a collection of five portals for genome-scale data: Ensembl Bacteria, Protists, Fungi, Plants and Metazoa. The resources uses the Ensembl software suite for genome analysis and browsing. |
|FASTA||This tool provides sequence similarity searching against protein databases using the FASTA suite of programs.
You can find out more on FASTA on the WikiPedia page: http://en.wikipedia.org/wiki/FASTA |
|GFF||'General Feature Format' is a protocol for the transfer of feature information. |
|Gene ontology||Gene Ontology (GO) is a controlled vocabulary used to describe the biology of a gene product in any organism. There are 3 independent sets of vocabularies, or ontologies, that describe: the molecular function of a gene product, the biological process in which the gene product participates and the cellular component where the gene product can be found (http://www.geneontology.org). |
|Havana||The Havana team handles manual annotation of genes for vertebrate genomes, and contributes these annotations to Vega and Ensembl to form the human GENCODE set. Human, mouse, zebrafish, and other genomes are supported. http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/ |
|Homologous||To that evolved from a common ancestor [evolutionary context] |
|InterPro||The EBI’s integrated resource for protein motifs, families and domains. It provides a single, consistent interface of protein signatures contributed by ten different databases, each of which uses a slightly different method for deriving protein signatures. |
|Motif||Short segments of protein 3D structure, which are spatially close but not necessarily adjacent in the sequence |
|MySQL||My Structured Query Language (MySQL) is a software program for the design, population and provision of access to databases, conceptually similar to Oracle. http://www.mysql.com/ |
|NCBI Map Viewer||A genome browser housed at NCBI. |
|PDBe||The European resource for the collection, organisation and dissemination of data on biological macromolecular structures. |
|Perl||A text-oriented programming language, widely used in Internet applications. |
|Synteny||The term synteny was originally defined to mean that two gene loci share the same chromosome. In a genomic context we refer to syntenic regions if both sequence and gene order is conserved between two (closely related) species. |
|UCSC genome browser||A multi-species genome browser housed at University of California, Santa Cruz. http://genome.ucsc.edu/ |
|UniProt||UniProt – Universal Protein Resource: The world's most comprehensive catalogue of information on proteins and a central repository of protein sequence and function, created by joining the information contained in UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, and PIR http://www.ebi.ac.uk/uniprot/ |
|VEGA||The Vertebrate Genome Annotation database includes manual annotation of human, mouse, and zebrafish genes from the Havana team. http://vega.sanger.ac.uk/index.html |
|accession number||A unique, relatively stable, identifier given to database record which allows you to track different versions of that record over time in a single data repository.
For example, in in the ArrayExpress Archive, experiments and array designs are given unique accession numbers in the format of E-XXXX-n for experiments and A-XXXX-n for array designs. XXXX is a four letter code indicating the course of submission and n is a number e.g. E-MEXP-568. Some experiments also have secondary accession numbers.
In the UniProt database, proteins have unique UniProt Accession Numbers (e.g. P04637) and UniProt Protein ID's (e.g. P53_HUMAN). Uniprot accessions are unique to specific protein isoforms in specific species, and are used as the standard method for uniquely referencing a protein in EBI resources. Uniprot accessions cross-link the entries in various UniProt databases. Most often, researchers will find it useful to follow the Uniprot accession back to an entry in UniProtKB/Swiss-Prot to view a curated summary of known information about that protein.
There is a 'ID Mapping' Tool on the UniProt homepage which can be useful for converting Accession Numbers to corresponding idenfiers in other databases.
|chordate||A member of the phylum Chordata. Has a notochord (a flexible, rod-like structure that acts as the main support for the body) at some stage of development. For vertebrates, this notochord becomes the vertebral spine.
|dbSNP||An archive of simple sequence polymorphisms for multiple species. The collection is open to submissions and use by the public, and houses millions of sequence polymorphisms. http://www.ncbi.nlm.nih.gov/projects/SNP/index.html |
|gene regulation||Includes the processes that cells and viruses use to regulate the way that the information in genes is turned into gene products.
Gives the cell control over structure and function, and is the basis for cellular differentiation, morphogenesis and the versatility and adaptability of any organism.
|single nucleotide polymorphism||A single base pair of DNA that is polymorphic (has alternate alleles) with respect to a population. |
|site||A site refers to a particular nucleotide or protein sequence residue, often including the evolutionary equivalent or orthologous position in sequences from other species |