|Annotation||The process of attaching additional information to biological entities. Annotation can be structural (i.e. identification of the elements from a sequence, such as protein coding regions or the location of regulatory motifs) or functional (i.e. adding biological information to the identified elements, such as the biological function of a protein domain or an entire protein, or the molecular interactions or regulatory role of a nucleotide sequence). Annotation can either be applied automatically or can be manually added (in a process called 'curation') from various sources, such as the scientific literature. At EMBL-EBI, we use a combination of automatic and manual annotation to enrich our databases. Annotation can either be applied automatically or it can be curated (manually) from the scientific literature. At EMBL-EBI, we use a combination of automatic and manual annotation to enrich our databases. |
|Aspera||Aspera (http://asperasoft.com/) is a company owned by IBM that has produced software for the transmission of data through their patented Fast And Secure Protocol (FASP; http://asperasoft.com/technology/transport/fasp/). It is freely available through the Aspera Connect web browser plug-in (http://downloads.asperasoft.com/connect2/) which can be used for manually uploading big datasets to PRIDE, or downloading public dataset files from the PRIDE Archive. For more information, see: http://www.ebi.ac.uk/pride/help/archive/aspera |
|CDS||Coding DNA sequence - the region of a gene that codes for protein. |
|Capillary sequencing technology||Capillary sequencing technology forms part of the Sanger method for detecting the DNA when sequencing. In modern DNA sequencing, the DNA sample is applied to one end of a capillary tube filled with a viscous gel and an electric field is used to move the DNA through the capillary.
You can find out more about the Sanger method on the Wikipedia page: http://en.wikipedia.org/wiki/Chain_termination_method#Chain-termination_methods |
|Cross-reference||An instance within a database which refers to related or synonymous information in another database. Biological databases cross-reference each other using accession numbers and/or IDs as a way of linking their related knowledge together. |
|Data class||In the European Nucleotide Archive database, a data class divides entries in the database according to the type of data or method used to obtain it. For example, the WGS (whole genome shotgun) data class.
|Digital Object Identifier||A Digital Object Identifier [no-glossary](DOI)[/no-glossary] is a unique alphanumeric string that is used to identify content. The DOI can be associated with metadata, including a URL to the document. A DOI is useful because it is permanent, whereas a document's location and other metadata may change. http://www.doi.org/ |
|EMBL format||EMBL entries in the database are structured so as to be usable by human readers as well as by computer programs. Each entry in the database is composed of lines. Different types of lines, each with its own format, which are used to record the various types of data which make up the entry. Some entries will not contain all of the line types, and some line types occur many times in a single entry. As noted, each entry begins with an identification line (ID) and ends with a terminator line (//). |
|EMBL-Bank||The EBI’s database of nucleotide sequences. A member of the International Sequence Database Collaboration (www.insdc.org), EMBL-Bank exchanges data every 24 hours with the other INSDC databases to ensure that they are all comprehensive and up to date. EMBL-Bank can be accessed from www.ebi.ac.uk/embl/ |
|Ensembl||Ensembl is a joint project between the EMBL-EBI and the Wellcome Trust Sanger Institute that aims to develop a system that maintains automatic annotation of large eukaryotic genomes. All the software and data are free to access without any constraints. The project is primarily funded by the Wellcome Trust. It is a comprehensive source of stable annotation with confirmed gene predictions that have been integrated from external data sources. Ensembl annotates known genes and predicts new ones, with functional annotation from InterPro, OMIM, SAGE and gene families. |
|Ensembl Genomes||The Ensembl Genomes resource is a collection of five portals for genome-scale data: Ensembl Bacteria, Protists, Fungi, Plants and Metazoa. The resources uses the Ensembl software suite for genome analysis and browsing. |
|European Nucleotide Archive||The European Nucleotide Archive (ENA) is a comprehensive databank of primary nucleotide sequence information. ENA provides access to both assembled sequence and unassembled (raw) sequence reads, but places them in separate databases in order to optimise accessibility and analysis. http://www.ebi.ac.uk/ena/
|FASTA||This tool provides sequence similarity searching against protein databases using the FASTA suite of programs.
You can find out more on FASTA on the WikiPedia page: http://en.wikipedia.org/wiki/FASTA |
|FASTQ||A text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character. It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has become the de facto standard for storing the output of high throughput sequencing instruments. |
|GOA||The UniProt Gene Ontology Annotation (GOA) program aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB).
|InterPro||The EBI’s integrated resource for protein motifs, families and domains. It provides a single, consistent interface of protein signatures contributed by ten different databases, each of which uses a slightly different method for deriving protein signatures. |
|Intron||A segment (nucleotide sequence) of a DNA or RNA molecule that does not code for proteins. It is removed to generate the final mature RNA product of a gene. |
|Metadata||A term used to describe data that provides additional information about a particular data set. This information can include: how, when and where the data set was generated and what standards were used. In the proteomics context the addition of metadata such as peptide and protein identifications and quantification of their expression values gives meaning to a simple collection of mass spectra output files. |
|Next generation sequencing||Next generation sequencing or high-throughput sequencing technologies parallelise the sequencing process, producing thousands or millions of sequences at once.
You can find out more about NGS /HTS on the Wikipedia page: http://en.wikipedia.org/wiki/Next-generation_sequencing#High-throughput_sequencing |
|REST||Representational state transfer (REST) is a style of software architecture for distributed hypermedia systems such as the World Wide Web.
Definition source: Wikipedia. Full reference is here: http://en.wikipedia.org/wiki/REST) |
|Rfam||Rfam is an open access database, hosted at the Wellcome Trust Sanger Institute, containing information about RNA families.
You can find more information about Rfam on the Wikipedia page: http://en.wikipedia.org/wiki/Rfam |
|Sample||A biological material used in a study, e.g. a mouse, a tumour sample, a bacterial culture, a group of seedlings. |
|Sequence Read Archive||Reads of raw data consisting of short, unassembled fragments of sequence generated using Next Generation sequencing technology.
|Sequence Version||The sequence version appeared on the NI line in Release 47 (it did not exist before this time), with the format 'd' or 'e' or 'g' followed by digits (e.g. d12235345). This format was superseded by the SV line in Release 57, with the current format (e.g. AA123456.1). Queries are converted to uppercase by default, so users should tick the "case sensitive" box when querying by NI. |
|Taxonomic divisions||In the European Nucleotide Archive database, the entries are grouped into taxonomic divisions using NCBI taxonomy. For example, the HUM (human) taxonomy. |
|Translation||In molecular biology and genetics, translation is the third stage of protein synthesis. In translation, RNA and ribosomes work together to produce proteins.
For more information on protein translation visit the Wikipedia page: http://en.wikipedia.org/wiki/Translation_(biology) |
|UniProtKB||UniProtKB (UniProt Knowledgebase) is the central access point for extensive curated protein information, including function, classification, and cross-reference. |
|XML||Extensible Markup Language (XML) defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
You can find out more about XML on the Wikipedia page: http://en.wikipedia.org/wiki/XML |
|accession number||A unique, relatively stable, identifier given to database record which allows you to track different versions of that record over time in a single data repository.
For example, in in the ArrayExpress Archive, experiments and array designs are given unique accession numbers in the format of E-XXXX-n for experiments and A-XXXX-n for array designs. XXXX is a four letter code indicating the course of submission and n is a number e.g. E-MEXP-568. Some experiments also have secondary accession numbers.
In the UniProt database, proteins have unique UniProt Accession Numbers (e.g. P04637) and UniProt Protein ID's (e.g. P53_HUMAN). Uniprot accessions are unique to specific protein isoforms in specific species, and are used as the standard method for uniquely referencing a protein in EBI resources. Uniprot accessions cross-link the entries in various UniProt databases. Most often, researchers will find it useful to follow the Uniprot accession back to an entry in UniProtKB/Swiss-Prot to view a curated summary of known information about that protein.
There is a 'ID Mapping' Tool on the UniProt homepage which can be useful for converting Accession Numbers to corresponding idenfiers in other databases.
|curator||A professional scientist who collects, annotates, and validates information that is disseminated by biological and model organism databases. The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard annotation protocols and vocabularies that enable powerful queries and biological database inter-operability. Curators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories. |
|exon||A nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA (introns) have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA or a functional form of a non-coding RNA such as rRNA or tRNA. Depending on the context, exon can refer to the sequence in the DNA or its RNA transcript. |
|gene||A molecular unit of heredity of a living organism. Genes hold the information to build and maintain an organism's cells and pass genetic traits to offspring. All organisms have many genes corresponding to various biological traits, some of which are immediately visible, such as eye color or number of limbs, and some of which are not, such as blood type or increased risk for specific diseases, or the thousands of basic biochemical processes that comprise life. |
|orthologue||Genes that are found in different species that evolved from a common ancestral gene by speciation. E.g. the human gene BRCA2 and the mouse gene Brca2 are orthologues. Often, orthologues retain the same function in the course of evolution (see paralogue for comparison). |
|paralogue||Genes within the same species that have evolved by duplication. E.g. the human gene FRY and the human gene FRYL are paralogues. Paralogues have often diverged in their function because the additional copy of the gene is redundant and usually free to evolve a new role or function (see orthologue for comparison). |