Primary and secondary databases
In bioinformatics, and indeed in other data intensive research fields, databases are often categorised as primary or secondary (Table 2). Primary databases are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.
By contrast, secondary databases comprise data derived from the results of analysing primary data. They are often referred to as curated databases but this is a bit of a misnomer because primary databases are also curated to ensure that the data in them is consistent and accurate.
Secondary databases often draw upon information from numerous sources, including other databases (primary and secondary), controlled vocabularies (see later section) and the scientific literature. They are highly curated, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.
Secondary databases have become the molecular biologist’s reference library over the past decade or so, providing a wealth of (often daunting) information on just about any gene or gene product that has been investigated by the research community. The potential for mining this information to make new discoveries is vast. It’s our job in this course to reduce your activation energy to make more of these resources for your research.
Table 2 Essential aspects of primary and secondary databases.
|Primary database||Secondary database|
|Synonyms||Archival database||Curated database; knowledgebase|
|Source of data||Direct submission of experimentally-derived data from researchers||Results of analysis, literature research and interpretation, often of data in primary databases|
|Examples||ENA, GenBank and DDBJ (nucleotide sequence) ArrayExpress and GEO (functional genomics data) Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)||InterPro (protein families, motifs and domains) UniProt Knowledgebase (sequence and functional information on proteins) Ensembl (variation, function, regulation and more layered onto whole genome sequences)|
Hybrid databases and families of databases
Many data resources have both primary and secondary characteristics. For example, UniProt accepts primary sequences derived from peptide sequencing experiments. However, UniProt also infers peptide sequences from genomic information, and it provides a wealth of additional information, some derived from automated annotation (TrEMBL), and even more from careful manual analysis (SwissProt).
Some databases have different ‘branches’ for primary and secondary data. A good example of this is the ArrayExpress suite of data resources: ArrayExpress contains experimentally-derived functional genomics data whereas the Expression Atlas uses a subset of high-quality data from the ArrayExpress to derive knowledge about gene expression patterns under different conditions.