G2P VEP plugin

The G2P VEP plugin identifies likely disease causing genes based on the knowledge encoded in the G2P database and runs as part of the Variant Effect Predictor (VEP).

Variant Effect Predictor

The VEP computes the consequence of a variant and can further annotate the variant with values from a selection of different data types which can be specified when running the script. In most cases the consequence is computed with respect to the transcript that overlaps the variant. If the input file contains variant data for a set of individuals the VEP generates one line of output for each pair of variant allele and overlapping transcript per individual.

How the plugin works:

The G2P VEP plugin can add further annotation to the line of output based on the individual's genotypes and the knowledge contained in the G2P database. The G2P VEP plugin uses a set of filters for identifying potentially causal variants. If the plugin counts a sufficient number of causal variants (variant hits) for a G2P gene it will report the gene as likely disease causing and all variants that passed the filters. The number of sufficient causal variants is derived from the allelic requirement of the gene which is stored in the G2P database.

Filtering rules:

Consider the variant as potentially causal if the variant passes all filtering steps.

  1. The variant overlaps a G2P gene
  2. The variant consequence is in the list of severe consequences. The default list contains the following terms: splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant,stop_lost,initiator_codon_variant, inframe_insertion, inframe_deletion, missense_variant, coding_sequence_variant, start_lost, transcript_ablation, transcript_amplification, protein_altering_variant
  3. All allele frequencies from co-located variants in reference populations (1000 Genomes project, ESP, gnomAD) need to be below a given threshold. The default frequency values for an allele in a bi-allelic gene is 0.005 and for an allele in a mono-allelic gene is 0.0001.

The Sufficient number of variant hits is determined by the gene's allelic requirement.

Software

Installing and running the VEP and G2P VEP plugin

For installation and running the VEP script please refer to the VEP git repository and VEP documentation pages. Plugins are installed and configured during the VEP installation. The G2P VEP plugin is located in the VEP plugins repository.

To run the G2P VEP plugin add the following argument to the VEP command:

./vep -i input.vcf --plugin G2P,file='DDG2P.csv'

vep_g2p_plugin_overview

Options are passed to the plugin as key=value pairs

Key Description Default value
file Path to G2P data file. The file needs to be uncompressed.
- Download from http://www.ebi.ac.uk/gene2phenotype/downloads
- Download from PanelApp
af_monoallelic maximum allele frequency for inclusion for monoallelic genes 0.0001
af_biallelic maximum allele frequency for inclusion for biallelic genes 0.005
confidence_levels Confidence levels to include: confirmed, probable, possible, both RD and IF. Separate multiple values with '&'.
https://www.ebi.ac.uk/gene2phenotype/terminology
confirmed, probable
all_confidence_levels Set value to 1 to include all confidence levels: confirmed, probable and possible 0
af_from_vcf set value to 1 to include allele frequencies from VCF files. The location of the VCF files is configured in: ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json or ensembl-vep/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json depending on how the ensembl-variation API was installed 0
af_from_vcf_keys Select VCF collections. Separate multiple values with '&'. topmed, uk10k, gnomADe, gnomADg
variant_include_list A list of variants to include even if variants do not pass allele frequency filtering. The include list needs to be a sorted, bgzipped and tabixed VCF file.
types SO consequence types to include. Separate multiple values with '&'. splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant, stop_lost, initiator_codon_variant, inframe_insertion, inframe_deletion, missense_variant, coding_sequence_variant, start_lost, transcript_ablation, transcript_amplification, protein_altering_variant
log_dir The log_dir is required to store log_files which are used for writting intermediate results. The log_files can be consulted for any frequency filtering decisions. current_working_dir/g2p_log_dir_[year]_[mon]_[mday]_[hour]_[min]_[sec]
txt_report Write all G2P complete genes and attributes to txt file current_working_dir/txt_report_[year]_[mon]_[mday]_[hour]_[min]_[sec].txt
html_report Write all G2P complete genes and attributes to html file current_working_dir/html_report_[year]_[mon]_[mday]_[hour]_[min]_[sec].html

Allele frequencies from reference populations

The G2P plugin filters input variants on allele frequencies. The allele frequencies are retrieved from major genotyping projects like the 1000 Genomes project, ESP and gnomAD. The VEP provides a cache which contains allele frequencies in order to speed up the variant annotation. VEP's cache currently contains only frequency data for alleles that have been submitted to dbSNP.

It is possible to retrieve allele frequencies for input variants that are not yet in dbSNP from VCF files. In order to enable the functionality add af_from_vcf=1 to the vep command:

./vep -i input.vcf --plugin G2P,file='DDG2P_11_7_2017.csv,af_from_vcf=1'

Available population allele frequency data

reference population short name description source
minor_allele_freqglobal allele frequency (AF) from 1000 Genomes Phase 3 dataVEP cache
AAExome Sequencing Project 6500:African_AmericanVEP cache
AFR1000GENOMES:phase_3:AFRVEP cache
AMR1000GENOMES:phase_3:AMRVEP cache
EAExome Sequencing Project 6500:European_AmericanVEP cache
EAS1000GENOMES:phase_3:EASVEP cache
EUR1000GENOMES:phase_3:EURVEP cache
SAS1000GENOMES:phase_3:SASVEP cache
gnomADGenome Aggregation Database:TotalVEP cache
gnomAD_AFRGenome Aggregation Database exomes r2.1:African/African AmericanVEP cache
gnomAD_AMRGenome Aggregation Database exomes r2.1:LatinoVEP cache
gnomAD_ASJGenome Aggregation Database exomes r2.1:Ashkenazi JewishVEP cache
gnomAD_EASGenome Aggregation Database exomes r2.1:East AsianVEP cache
gnomAD_FINGenome Aggregation Database exomes r2.1:FinnishVEP cache
gnomAD_NFEGenome Aggregation Database exomes r2.1:Non-Finnish EuropeanVEP cache
gnomAD_OTHGenome Aggregation Database exomes r2.1:Other (population not assigned)VEP cache
gnomAD_SASGenome Aggregation Database exomes r2.1:South AsianVEP cache
TOPMedTrans-Omics for Precision Medicine (TOPMed) ProgramVCF file
ALSPACUK10K:ALSPAC cohortVCF file
TWINSUKUK10K:TWINSUK cohortVCF file
gnomADe:afrGenome Aggregation Database exomes r2.1VCF file
gnomADe:ALLGenome Aggregation Database exomes r2.1VCF file
gnomADe:amrGenome Aggregation Database exomes r2.1VCF file
gnomADe:asjGenome Aggregation Database exomes r2.1VCF file
gnomADe:easGenome Aggregation Database exomes r2.1VCF file
gnomADe:finGenome Aggregation Database exomes r2.1VCF file
gnomADe:nfeGenome Aggregation Database exomes r2.1VCF file
gnomADe:othGenome Aggregation Database exomes r2.1VCF file
gnomADe:sasGenome Aggregation Database exomes r2.1VCF file
gnomADg:ALLGenome Aggregation Database genomes v3:All gnomAD genomes individualsVCF file
gnomADg:afrGenome Aggregation Database genomes v3:African/African AmericanVCF file
gnomADg:amiGenome Aggregation Database genomes v3:AmishVCF file
gnomADg:amrGenome Aggregation Database genomes v3:Latino/Admixed AmericanVCF file
gnomADg:asjGenome Aggregation Database genomes v3:Ashkenazi JewishVCF file
gnomADg:easGenome Aggregation Database genomes v3:East AsianVCF file
gnomADg:finGenome Aggregation Database genomes v3:FinnishVCF file
gnomADg:nfeGenome Aggregation Database genomes v3:Non-Finnish EuropeanVCF file
gnomADg:easGenome Aggregation Database genomes v3:South AsianVCF file
gnomADg:othGenome Aggregation Database genomes v3:Other (population not assigned)VCF file

Example input and output files

Remarks

PanelApp

The G2P VEP plugin accepts PanelApp data files as input. We use the following mappings to translate between the terminologies used by G2P and PanelApp.

G2P PanelApp
G2P confidence Gene Ratings
Confirmed Green
Probable
Possible
Allelic requirement Model of inheritance
monoallelic MONOALLELIC, autosomal or pseudoautosomal, not imprinted
MONOALLELIC, autosomal or pseudoautosomal, maternally imprinted (paternal allele expressed)
MONOALLELIC, autosomal or pseudoautosomal, paternally imprinted (maternal allele expressed)
MONOALLELIC, autosomal or pseudoautosomal, imprinted status unknown
BOTH monoallelic and biallelic, autosomal or pseudoautosomal
BOTH monoallelic and biallelic (but BIALLELIC mutations cause a more SEVERE disease form), autosomal or pseudoautosomal
biallelic BIALLELIC, autosomal or pseudoautosomal
BOTH monoallelic and biallelic, autosomal or pseudoautosomal
BOTH monoallelic and biallelic (but BIALLELIC mutations cause a more SEVERE disease form), autosomal or pseudoautosomal
hemizygous X-LINKED: hemizygous mutation in males, biallelic mutations in females
X-linked dominant X-LINKED: hemizygous mutation in males, monoallelic mutations in females may cause disease (may be less severe, later onset than males)