Human genetic variation ( II ) : exploring publicly available data

DNA & RNA Beginner 2 hours Genetic variation is fundamental to the evolution of all species and is what makes us individuals. Our genes have a large influence on our lives. They affect what we look like, our personalities and preferences and our susceptibility to disease. By studying genetic variation we hope to understand the molecular process that contribute to life on earth. The study of genetic variation has been used to model human migration, understand the cause of human diseases, and to predict disease outcomes. This is part II of our course on human genetic variation. We will and learn how to explore publicly available genetic variation data. Part I of the course [9] introduces some key concepts in the field of human genetic variation including the types and possible effects of genetic variation, data formats and look common genetic variation study types. If you are new to the field we recommend that you work through part I of this course first. The courses focus on heritable (germline) variation and will give you a taste of the resources you can use to explore genetic variation data. Learning objectives:


Figure 3
Searching for PKD1 in the EVA variant browser (view in EVA [26]).

?
The European Variation Archive [27] (EVA) is a repository for genetic variation data at EMBL-EBI. To learn more about the EVA you can watch our webinar The European Variation Archive at EMBL-EBI [28].
[no-lexicon]How can Ensembl help explore genetic variations of PKD1?
Find out next.

Using Ensembl to explore genetic variations in the PKD1 locus Finding variants
Open access [24] genetic variation datasets in the EVA are fed into Ensembl [11] where they are combined with many different data types from many different resources and services. These digestible chunks of information can be accessed via web browsers, or programmatically via the Ensembl API [29].
This means that we can use Ensembl to gain further insight into the genetic variations that are found within the human PKD1 locus.
For example, let's look at the Ensembl table of variants view for PKD1 ( Figure 4): Page 5 of 23

Figure 4
The Ensembl table of variants for PKD1 (view in Ensembl [30]).

Viewing traits
There are a total of six variants displayed for the human PKD1 gene (Figure 4). These have been filtered for pathogenic, exonic variants that exist at at allele [31] frequency of greater than 0.001.
The pathogenicity of these variants within the human PKD1 locus can be further investigated with Ensembl by viewing the associated traits in the phenotypes section [32] ( Figure 5).

Figure 5
Using Ensembl to find traits associated with variants in PKD1 (view in Ensembl [32]).
Here we can see that the majority of traits associated with variants in the human PKD1 gene are held privately at the human mutation gene database. The publicly available associations are mostly to polycystic kidney disease 1.

?
Page 6 of 23 Next, learn how to use Uniprot to learn about protein structure and function.

PKD1 variants, protein structure and function
Now that we know that variants in PKD1 are associated with a disease we can start to use EMBL-EBI proteincentric resources to understand the potential effects of variants on protein structure and function.
These resources provide detailed information on where in the protein sequence such variants lie, and whether these variants overlap with domains and/or sites, potentially affect post-translational modification residues and/or other protein structural features.
Let's take a look at UniProt's Feature Viewer display for the human PKD1 protein [34] ( Figure 6). The human PKD1 protein contains a number of disease reviewed variants. Not all of these variants are known to be pathogenic, however the Uniprot Sequence Viewer shows how the known variants overlap with other protein features. As you can see, there are many disease causing variants in PKD1. You can learn more about these variants and how they relate to the protein structure by clicking on them.
From UniProt you can also link out to PDBe [35] where you can explore available protein structures for PKD1. We will look at this in more detail in the next case study.
As you have seen, starting with a gene of interest can unearth a wide variety of information to help you understand how variants in that gene contribute to health and disease and influence protein structure and function. In the next case study, we start with a variant.

Case study 2: Search for a variant (rs334)
This case study assumes that you have a variant identifier that you want to learn more about, for example from the results of a variant calling analysis. In this example we will be using rs334, a dbSNP identifier.
One place you might search for the identifier is Ensembl [11]. There, you can find the variant alleles and their source, along with links to further information, including population allele frequencies, sample genotypes, phenotypes that have been associated with the variant, and genes and proteins affected by the variant. rs334 has four observed alleles, where T is the reference (Figure 7).

Figure 7
Search for the rs334 variant in Ensembl (view in Ensembl [37]).
In the genes and regulation section we can see that rs334 is a missense variant in HBB, a haemoglobin subunit. It is associated with sickle cell anaemia and malaria resistance, and that the phenotype-associated A allele is mostly Page 8 of 23

Human genetic variation (II): exploring publicly available data
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online) found in African populations (you can see this in the Phenotype data and Population genetics sections). This is consistent with what we know about sickle cell anaemia; that it is caused by deformed haemoglobin protein resulting in sickled red blood cells. The same change in the protein structure also confers malarial resistance. This is advantageous to heterozygotes if they are exposed to malaria, so is most common in regions where malaria is endemic.
How can we find associated phenotypes?
Click next to find out.

Exploring phenotypes associated with a variant
To learn more about how rs334 is linked to phenotypes you can search for it in the GWAS Catalog [38]. A search for "rs334" returns the entry for that variant [12], containing all relevant data in the GWAS Catalog, such as studies, associations and traits associated with rs334 ( Figure 8).

Figure 8
Studies that mention rs334 and associations between rs334 and traits can be found by searching the GWAS Catalog for rs334 (view in the GWAS Catalog [39]).
Page 9 of 23

Human genetic variation (II): exploring publicly available data
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online) In the GWAS Catalog, rs334 is reported to be associated with urinary albumin-to-creatinine ratio and severe malaria. Summary information is provided for each association, including p-value [40], effect size (odds ratios or -coefficient), risk-allele, location and mapped gene. The Catalog includes information describing the GWAS in which this association was identified, including links to the publication, study design and ontology [17] terms to describe the phenotype and allow integration with data from other resources. Since GWAS are generally used to study complex, rather than simple Mendelian inheritance, we do not see sickle cell anaemia listed here.

?
To learn more about the GWAS catalog have a look at our webinar The NHGRI-EBI GWAS Catalog, a curated resource of SNP-trait associations [41].
How can we explore the effects of variants on function?
Find out next.

Understanding the functional consequences of a variant
Now that we know that rs334 is missense in HBB, and that it is associated with sickle cell anaemia we can start to probe the protein structure to understand the molecular mechanisms underlying this association.

Searching Uniprot for variant identifiers
As we saw in the first case study, you can explore proteins and their variants using UniProt [15]. As with genes or proteins you can search UniProt using the variant identifier (rs334).
There is one protein associated with this variant: human hemoglobin subunit B. By looking at the Pathology and Biotech section [42] for this protein we can see that the variant associated with sickle cell anaemia is p.Glu7Val and causes a change from a charged amino acid to a hydrophobic aliphatic amino acid. This is annotated as E -> V at position 7 (Figure 9).

Figure 9
The UniProt Pathology and Biotech section for human hemoglobin subunit B shows which variants are associated with sickle cell anaemia (view in UniProt [42]).

Using Uniprot's feature viewer
If we take a look at the feature viewer [43] for this protein you can see that this variant does not occur within the region of the essential heme binding residues, but does occur in an alpha-helix within a small cluster of charged residues ( Figure 10).
Page 10 of 23

Figure 10 The UniProt feature viewer for human haemoglobin subunit B (view in UniProt [44] for full annoations).
In the final section of this case study, view the protein's 3D structure.

*/ Viewing proteins in 3D
Next we can use the Protein Data Bank archive [45] (PDBe) to find the three dimensional structure of haemoglobin and understand how the position of this change relates to the the 3D structure of the protein.
Page 11 of 23

Human genetic variation (II): exploring publicly available data
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online) There are a number of different ways to search PDBe [46]. In this case we used the Uniprot [47] ID for human hemoglobin beta (P68871) as the search term and refined the results using the word 'sickle' to find hemoglobin structures solved for sickle cell phenotypes.
The view below shows the macromolecules tab [48] in the search results for this Uniprot ID, which gives a single macromolecule result, specifically the protein described in that Uniprot entry (Figure 11).

Figure 11
The macromolecules tab in PDBe showing results for hemoglobin structures solved for sickle cell phenotypes. This is an interactive subsection of the orginal page which can be view in PDBe [48].

Comparing 2D and 3D structures
Suppose you want to explore the 2D and 3D structures of the hemoglobin subunit beta to understand how genetic variants influence the protein structure. You can do this by clicking on the name of this macromolecule, i.e.
Hemoglobin subunit beta. This will take you through to the macromolecules page for that specific structure [49] ( Figure 12). Here you are presented with sequence views for that particular protein, as well a 2D topology graphic and a 3D structure viewer. The first section in the sequence view (Molecule) shows any sequence annotations for the protein in this structure, with these highlighted in orange (1D sequence annotation). Hovering over the orange bar for this example will display the change of residue 6 from glutamate (E) to valine (V) as is the case for sickle cell hemoglobin variants. If you hover over residue 6 in the schematic of the Topology 2D diagram, you will find that this specific amino acid is highlighted in yellow on the surface of the hemoglobin molecule in the 3D structure view.
The change of this amino acid in the sickle cell variant is from a hydrophilic residue (glutamate) to a hydrophobic residue (valine). This change generates a 'sticky patch' on the surface of the protein because the 'water loving' amino acid has been swapped for a 'water hating' one. This causes the association of multiple hemoglobin complexes, via this hydrophobic valine residue. This consequently leads to the aggregation of hemoglobin molecules into fibres, therefore producing cells with the sickle phenotype that is observed for this variant 1 [50] . From this example you can see that by looking at the structure and understanding the type of variation involved, you can begin to draw functional conclusions about the consequences of variation.
In the next case study, we search for a phenotype.

Case study 3: Search for a phenotype (non-melanoma skin cancer)
*/ With a known phenotype, you may wish to find studies focusing on it as well as variants and genes that are associated with the phenotype. In this case study we look at how you can find studies, variants and genes associated with the phenotype "non-melanoma skin cancer".
To find studies looking at the phenotype, as well as associated genes and variants you can search for "nonmelanoma skin cancer" in the GWAS Catalog [38] (Figure 13).
Page 12 of 23

Figure 13
Search results for non-melanoma skin cancer in the GWAS Catalog. By searching for a phenotype you can find links to variants and studies where these links were found (view in GWAS Catalog [51]).
There are several studies in GWAS that focus on this trait. They identify many SNPs associated with nonmelanoma skin cancer or more specific sub-types, including three in MC1R.
Another place to find information about phenotypes is Ensembl ( Figure 14). Since Ensembl is not restricted by method type, it includes phenotype associations from GWAS, as well as sources such as ClinVar

Human genetic variation (II): exploring publicly available data
Published on EMBL-EBI Train online (https://www.ebi.ac.uk/training/online) describes the biological and molecular functions of genes associated with the phenotypes along with references.
In case study 2 we saw how you can use Ensembl, UniProt [47] and PDBe [58] to explore a gene or variant in more detail. Open Targets [59] is another resource that you can use. It was originally designed for validating potential drug targets but is also useful for getting an overview of the data and features associated with a phenotype or gene.
As we saw in the GWAS Catalog, SNPs in MC1R are associated with non-melanoma skin cancer. If we look at this gene in Open Targets we can see that it is also associated with hair colour. Indeed people with red hair and freckles often have certain SNPs in MC1R (Figure 15).

Figure 15
Searching for MC1R in open targets reveals that it is associated with hair colour. This is an interactive subsection, the full site can be viewed in Open Targets [60].
In the last case study, we search for data related to a specific publication.

Case study 4: Starting with the literature
Sometimes you might start with a specific publication. Perhaps you want to access the data that is described so that you can include it in your own analyses.
In this case study we are going to use the Nature paper [61] from the Wellcome Trust Case Control Consortium (WTCCC) as the starting point and investigate how we can access and analyse the data. The WTCCC is a consortium that was put together to help understand human variation using high-throughput technology.
You can find the paper in Europe PMC [62] by searching for keywords (e.g. author names, phenotypes of interest, etc.) or known literature identifiers (e.g PMID [63]:17554300) ( Figure 16). Searching for the paper in Europe PMC is a useful way to get started as the search results include direct links to information about the genes, proteins,

Figure 18
The processed data from the WTCCC publication are available at the GWAS Catalog (view in GWAS Catalog [69]).
The data from the WTCC GWAS can be downloaded and reused in meta-analyses. For example, Zeggini et al combined these data with two other GWAS studies to uncover SNPs associated with type II diabetes 2 [70] .
It is also possible to browse specific variants using the GWAS Catalog as a starting point, as we saw in case study 2 [71].
To learn more about a particular variant we can link from the GWAS Catalog to Ensembl to further analyse the associated information. In turn, you can probe the effect of specific variants on protein structure and function using UniProt and PDBe as we did in case study 2. [71] In the final page of this course, discover other resources to help you learn more.

Summary
In parts I and II of this course we have introduced some key topics in the field of human genetic variation. In this part of the course we've learnt about some of the public resources available for exploring genetic variation data. In the case studies we used these resources to answer specific biological questions and showed you how you can use genes, variants, phenotypes or the literature as a starting point in your research.