|Annotation||The process of attaching additional information to biological entities. Annotation can be structural (i.e. identification of the elements from a sequence, such as protein coding regions or the location of regulatory motifs) or functional (i.e. adding biological information to the identified elements, such as the biological function of a protein domain or an entire protein, or the molecular interactions or regulatory role of a nucleotide sequence). Annotation can either be applied automatically or can be manually added (in a process called 'curation') from various sources, such as the scientific literature. Annotation can either be applied automatically or it can be curated (manually) from the scientific literature. At EMBL-EBI, we use a combination of automatic and manual annotation to enrich our databases. |
|Aspera||Aspera (http://asperasoft.com/) is a company owned by IBM that has produced software for the transmission of data through their patented Fast And Secure Protocol (FASP; http://asperasoft.com/technology/transport/fasp/). It is freely available through the Aspera Connect web browser plug-in (http://downloads.asperasoft.com/connect2/) which can be used for manually uploading big datasets to PRIDE, or downloading public dataset files from the PRIDE Archive. For more information, see: http://www.ebi.ac.uk/pride/help/archive/aspera |
|CDS||Coding DNA sequence - the region of a gene that codes for protein. |
|Capillary sequencing technology||Capillary sequencing technology forms part of the Sanger method for detecting the DNA when sequencing. In modern DNA sequencing, the DNA sample is applied to one end of a capillary tube filled with a viscous gel and an electric field is used to move the DNA through the capillary.
You can find out more about the Sanger method on the Wikipedia page: http://en.wikipedia.org/wiki/Chain_termination_method#Chain-termination_methods |
|Cross-reference||An instance within a database which refers to related or synonymous information in another database. Biological databases cross-reference each other using identifiers as a way of linking their related knowledge together. |
|Data class||In the European Nucleotide Archive database, a data class divides entries in the database according to the type of data or method used to obtain it. For example, the WGS (whole genome shotgun) data class.
|Digital Object Identifier||A Digital Object Identifier [no-glossary](DOI)[/no-glossary] is a unique alphanumeric string that is used to identify content. The DOI can be associated with metadata, including a URL to the document. A DOI is useful because it is permanent, whereas a document's location and other metadata may change. http://www.doi.org/ |
|EMBL format||EMBL entries in the database are structured so as to be usable by human readers as well as by computer programs. Each entry in the database is composed of lines. Different types of lines, each with its own format, which are used to record the various types of data which make up the entry. Some entries will not contain all of the line types, and some line types occur many times in a single entry. As noted, each entry begins with an identification line (ID) and ends with a terminator line (//). |
|EMBL-Bank||The EBI’s database of nucleotide sequences. A member of the International Sequence Database Collaboration (www.insdc.org), EMBL-Bank exchanges data every 24 hours with the other INSDC databases to ensure that they are all comprehensive and up to date. EMBL-Bank can be accessed from www.ebi.ac.uk/embl/ |
|Ensembl||Ensembl is a joint project between the EMBL-EBI and the Wellcome Trust Sanger Institute that aims to develop a system that maintains automatic annotation of large eukaryotic genomes. All the software and data are free to access without any constraints. The project is primarily funded by the Wellcome Trust. It is a comprehensive source of stable annotation with confirmed gene predictions that have been integrated from external data sources. Ensembl annotates known genes and predicts new ones, with functional annotation from InterPro, OMIM, SAGE and gene families. |
|Ensembl Genomes||The Ensembl Genomes resource is a collection of five portals for genome-scale data: Ensembl Bacteria, Protists, Fungi, Plants and Metazoa. The resources uses the Ensembl software suite for genome analysis and browsing. |
|European Nucleotide Archive||The European Nucleotide Archive (ENA) is a comprehensive databank of primary nucleotide sequence information. ENA provides access to both assembled sequence and unassembled (raw) sequence reads, but places them in separate databases in order to optimise accessibility and analysis. http://www.ebi.ac.uk/ena/
|FASTA||This tool provides sequence similarity searching against protein databases using the FASTA suite of programs.
You can find out more on FASTA on the WikiPedia page: http://en.wikipedia.org/wiki/FASTA |
|FASTQ||A text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character. It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has become the de facto standard for storing the output of high throughput sequencing instruments. |
|FTP||The File Transfer Protocol (FTP) is a standard network protocol used to transfer computer files from one host to another host over a TCP-based network, such as the Internet. |
|GOA||The UniProt Gene Ontology Annotation (GOA) program aims to provide high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB).
|Identifier||A string given to a biological data entity (to allow reference, retrieval and tracking the entity). Identifiers are usually stable but some entities can change 'identifier' when they are moved between databases. In such cases a stable [Accession] can be used instead to refer to the same entity regardless of database |
|InterPro||The EBI’s integrated resource for protein motifs, families and domains. It provides a single, consistent interface of protein signatures contributed by ten different databases, each of which uses a slightly different method for deriving protein signatures. |
|Intron||A segment (nucleotide sequence) of a DNA or RNA molecule that does not code for proteins. It is removed to generate the final mature RNA product of a gene. |
|Metadata||A term used to describe data that provides additional information about a particular data set. This information can include: how, when and where the data set was generated and what standards were used. In the proteomics context the addition of metadata such as peptide and protein identifications and quantification of their expression values gives meaning to a simple collection of mass spectra output files. |
|Next generation sequencing||Next generation sequencing or high-throughput sequencing technologies parallelise the sequencing process, producing thousands or millions of sequences at once.
You can find out more about NGS /HTS on the Wikipedia page: http://en.wikipedia.org/wiki/Next-generation_sequencing#High-throughput_sequencing |
|REST||Representational state transfer (REST) is a style of software architecture for distributed hypermedia systems such as the World Wide Web.
Definition source: Wikipedia. Full reference is here: http://en.wikipedia.org/wiki/REST) |
|Rfam||Rfam is an open access database, hosted at the Wellcome Trust Sanger Institute, containing information about RNA families.
You can find more information about Rfam on the Wikipedia page: http://en.wikipedia.org/wiki/Rfam |
|Sample||A biological material used in a study, e.g. a mouse, a tumour sample, a bacterial culture, a group of seedlings. |
|Sequence Read Archive||Reads of raw data consisting of short, unassembled fragments of sequence generated using Next Generation sequencing technology.
|Sequence Version||The sequence version appeared on the NI line in Release 47 (it did not exist before this time), with the format 'd' or 'e' or 'g' followed by digits (e.g. d12235345). This format was superseded by the SV line in Release 57, with the current format (e.g. AA123456.1). Queries are converted to uppercase by default, so users should tick the "case sensitive" box when querying by NI. |
|Taxonomic divisions||In the European Nucleotide Archive database, the entries are grouped into taxonomic divisions using NCBI taxonomy. For example, the HUM (human) taxonomy. |
|Translation||In molecular biology and genetics, translation is the third stage of protein synthesis. In translation, RNA and ribosomes work together to produce proteins.
For more information on protein translation visit the Wikipedia page: http://en.wikipedia.org/wiki/Translation_(biology) |
|UniProtKB||UniProtKB (UniProt Knowledgebase) is the central access point for extensive curated protein information, including function, classification, and cross-reference. |
|XML||Extensible Markup Language (XML) defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
You can find out more about XML on the Wikipedia page: http://en.wikipedia.org/wiki/XML |
|accession||A unique, stable string given to a biological data entity when its added to a database (to allow reference, retrieval and tracking the entity). |
|curator||A professional scientist who collects, annotates, and validates information that is disseminated by biological and model organism databases. The role of a biocurator encompasses quality control of primary biological research data intended for publication, extracting and organizing data from original scientific literature, and describing the data with standard annotation protocols and vocabularies that enable powerful queries and biological database inter-operability. Curators communicate with researchers to ensure the accuracy of curated information and to foster data exchanges with research laboratories. |
|exon||A nucleic acid sequence that is represented in the mature form of an RNA molecule either after portions of a precursor RNA (introns) have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by trans-splicing. The mature RNA molecule can be a messenger RNA or a functional form of a non-coding RNA such as rRNA or tRNA. Depending on the context, exon can refer to the sequence in the DNA or its RNA transcript. |
|orthologue||Genes that are found in different species that evolved from a common ancestral gene by speciation. E.g. the human gene BRCA2 and the mouse gene Brca2 are orthologues. Often, orthologues retain the same function in the course of evolution (see paralogue for comparison). |
|paralogue||Genes within the same species that have evolved by duplication. E.g. the human gene FRY and the human gene FRYL are paralogues. Paralogues have often diverged in their function because the additional copy of the gene is redundant and usually free to evolve a new role or function (see orthologue for comparison). |
|site||A site refers to a particular nucleotide or protein sequence residue, often including the evolutionary equivalent or orthologous position in sequences from other species |