Exploring data and metadata from a genome-wide association study
You may like to try this online tutorial to get to know the GWAS Catalog before starting to work on this mini-project, but it is not essential.
Scenario
The GWAS Catalog is a richly-annotated database of human genome-wide association studies, which analyse associations between genetic variants and a disease or other trait of interest in a sample of individuals from a particular population. The aim of the Catalog is to provide detailed information across large numbers of studies, to allow users to search, compare, visualise and download GWAS data from multiple sources in a single place.
In this mini project, you will take the role of a GWAS Catalog curator, by selecting a publication – which may contain one or more individual GWAS studies. Following a modified version of our curation workflow, you’ll need to extract the most important information about each study, and decide as a group how best to represent this metadata, using controlled vocabularies and ontologies. In particular, we’ll focus on representing traits and sample ancestry.
Dataset
GWAS publications
Choose a GWAS publication from the list below. All of them include the main pieces of information that curators need to extract, but some aspects will be more complex than others, depending on the publication. Use EuropePMC to find your chosen publication and get access to the full text. Take a few minutes to scan the paper, focusing on the Results and Methods sections, but don’t feel like you have to read the text in full right away.
| PMID | Title |
| 32313116 | Genome-wide association study of Buruli ulcer in rural Benin highlights role of two LncRNAs and the autophagy pathway. |
| 34775353 | Genome-wide association study of hospitalized COVID-19 patients in the United Arab Emirates. |
| 33993232 | Identification of a shared genetic risk locus for Kawasaki disease and IgA vasculitis by a cross-phenotype meta-analysis. |
| 36275661 | Genome-wide association study of SNP- and gene-based approaches to identify susceptibility candidates for lupus nephritis in the Han Chinese population. |
Project aims
Study design
- How many studies (i.e. individual GWAS analyses) are included in the publication?
- Are they comparing cases vs controls? Or are they analysing a continuous trait?
- Are there multiple stages (i.e. discovery, replication)?
Traits
GWAS traits can include anything from common diseases (e.g. type 2 diabetes), to molecular measurements (e.g. cholesterol levels) and even behavioural traits (e.g. coffee consumption). Some traits are quite broad (e.g. autoimmune disease), while others are highly specific (e.g. systemic juvenile idiopathic arthritis); and sometimes authors use quite different names to refer to the same trait. To ensure consistency across studies, we use standardised terms from the Experimental Factor Ontology (EFO).
- How does the author describe the trait(s) under investigation in each study (Reported Trait)?
- Is there just one main trait under investigation, or are several traits examined simultaneously? Is there also a background trait – one that all participants share, even though the focus of the analysis is on the main trait?
- How would you represent these trait(s) using standardised ontology terms? Use the Ontology Lookup Service (OLS) to find the most appropriate terms from EFO.
Sample ancestry
The ancestry of study participants is important to record, because genetic associations in one population do not necessarily hold true in another population with a different genetic background. Ancestry is often described in different ways by different authors, and can sometimes be difficult to infer from the information provided. The GWAS Catalog has developed its own standard framework to represent ancestry information, including a controlled vocabulary of ancestry categories to represent broad regional population groupings.
- How many individuals were included in each study? Are they divided into discovery and replication stages?
- Which standardised ancestry categories are included in the sample?
- The GWAS Diversity Monitor tracks the representation of diverse ancestries in GWAS studies across different traits. Think about the ancestries categories included in the paper you are looking at. How well are these represented in GWAS research in general, and in this trait area in particular?
Optional further work
- What is the most significant (i.e. lowest p-value) association identified in your paper? If there are both discovery and replication stages, then look out for the combined or meta-analysis p-value, which includes both stages.
- Look up your paper in the GWAS Catalog. How well do your annotations match up with the Catalog’s existing annotations?
- How many other studies have been conducted for the same trait(s)? Which ancestries are represented in them?
- How many associations have been identified for those traits overall? Which is the most significant? Is any particular region of the genome identified in multiple studies?
Curation sheets
Use these sheets as a guide for your annotations. You might not have to use all of the fields for your chosen paper.
Traits
| Study | Reported trait | Main EFO term | Background EFO term |
| 1 | |||
| 2 |
Samples
| Study | Stage | Number of | ||
| Individuals | Cases | Controls | ||
| 1 | Discovery | |||
| Replication | ||||
| 2 | Discovery | |||
| Replication |