Introduction to public genetic variation data

A wealth of genetic variation data is generated within the scientific community to investigate many diverse subject areas from human disease to informing selective breeding in common bean varieties.

In addition to the continued generation of new genetic variation data, it is important for the community to have access to data that has been previously generated to aid re-evaluation and/or re-use of data in testing both new and previously established hypotheses.

What EMBL-EBI databases and resources are available for sharing, exploring and understanding genetic variation data?

European Variation Archive (EVA): a database of genetic variation data. These datasets are submitted from the community to EVA in order to aid data sharing, and data reuse.

Ensembl: a genome browser that provides a single point of access to annotated genomes. It includes information about genetic variants, population genetics and tools for exploring your own variant data.

GWAS catalog: a quality controlled, manually curated database of published GWAS studies. The GWAS karyotype diagram provides an interactive way of exploring all SNP-trait associations.

UniProt: EMBL-EBI’s resource for protein sequence and annotation data. You can use UniProt’s protein feature viewer to explore variants in relation to protein sequences, structure and function.

Standardised terminology

Each of these databases uses standardised ways of identifying and classifying variants.

For example, Sequence Ontology (SO) provide a standard nomenclature for categorising variants based on where they fall with respect to genes and other genomic features (Figure 1). For an overview of identifiers used by different databases see the section on variant identifiers in part I of this course.

Figure 1 A gene model with possible variant consequences.

In addition to the sequence ontology terms, an IMPACT measure, agreed by Ensembl and SnpEff provides a subjective classification of the severity of each class. Terms commonly used by Ensembl to describe variants are shown in Figure 2.

Figure 2 A list of commonly used sequence ontology (SO) terms for describing variants, from Ensembl. The terms are shown in order of severity (more severe to less severe) as estimated by Ensembl. View the whole list of SO terms.

Known variants in databases are usually annotated with these terms, and tools such as the Ensembl VEP and SnpEff allow you to use them to annotate your own variants.

There are many different starting points for exploring publicly available genetic variation data. The next section features case studies that will illustrate four different ways you can access genetic variation data using a gene, variant, phenotype or publication as a starting point.