spacer
Related Links
spacer

IPD - MHC Database

Help with the Sequence Alignments

The sequence alignment form contains the following options:

Select Species or Group - this option allows the user to choose which of the species of group of species they would like to select an alignment from. Some groups, like NHP provide sequences for many different species, other groups like DLA, provide only a single set of alignments to cover all species.

Select Locus - this option allows the user to choose which of the HLA or related genes to align. The locus is selected from the drop down menu. The box also includes a number of specialist choices like multiple alignments for all the DRB1,3,4 & 5 alleles or the DRB Pseudo genes. The selection of a locus automatically determines the type of sequences available to align.

Select the feature to align - this option provides a list of alignments available for the loci selected. The types of alignment include CDS alignments, individual exons or combined regions. If an option is not listed in this list then it is either not possible or is currently unavailable.

Enter any specific sequences required - allows the user to perform specific sequence alignments by either entering common nomenclature or by listing allele names. For example to align DRB1*0101, DRB1*010201, DRB1*010202. You could enter 01 or 010 in the box as the common nomenclature, or you could enter 0101, 010201, 010202 in the box provided, separating each allele name with a comma. Wildcards (*) may be used in the allele name.

Enter the reference sequence - the alignment tool allows the user to select an alternative reference sequence. This is optional, if not selected the tool uses the default sequence. The alternatives are a user specified sequence or a consensus sequence. To use an alternative reference sequence simply enter the numerical code in full in the box provided. Please note incorrect codes will cause errors in the alignment, 0101 is not a valid code for A*010101, the full numerical code must be entered. A consensus sequence based on those alleles in the alignment can be used by typing "consensus" into the reference box. The consensus sequence is not derived from all alleles at the locus selected but from those alleles selected for the alignment.

Select how you wish to view any mismatches - this option selects whether the to display the full sequence or to highlight the mismatches. The full sequence details every base pair for all sequences, highlighting mismatches represents only base pairs that differ between the sequence and the reference sequence used. Examples of both options are shown below.

Show mismatches between sequences:

A*010101 CGGGGGCCCT GGCCCTGACC
A*0102   ---------- -------C--

Show all bases:

A*010101 CGGGGGCCCT GGCCCTGACC
A*0102   CGGGGGCCCT GGCCCTGCCC

Select how the alignment will be numbered - depending on the type of sequence selected different numbering styles can be selected. For nucleotide sequences the alignments can be displayed in blocks of 10 nucleotides or in the amino acid codons. Protein are always displayed in blocks of 10 amino acids. For both formats it may be necessary to increase the width of your browser to fully view the sequence. Full details of how sequenced are numbered is explained below.

Do you want to omit alleles unsequenced for this region - due to the high number of alleles in some alignments, you can now omit those alleles that are not sequenced over the region of interest. This will reduce the time taken to perform the alignment and the space required to display the output. Where possible select only the sequences needed, this will reduce time and make the alignments easier to view.

Select type of output - in order to aid printing of the alignments, you can select a text only version of the output. This removes all interactive tags and is easier to cut and paste into applications like Microsoft Word.

Sequence Alignment Display Options

The alignment files produced use the following nomenclature and numbering conventions. These conventions are based on the recommendations published for Human Gene Mutations. These were prepared by a nomenclature-working group looking at how to name and store sequences for human allelic variants. These recommendations can be found in Human Mutation 11:1-3, 1998.

  • Only alleles officially recognised by the various Nomenclature Committees are included in the sequence alignments.
  • As recommended for all human gene mutations, a standard reference sequence should be used for all alignments.
  • The reference sequence will always be associated with the same (original) accession number, unless this sequence is shown to be in error.
  • All alleles are aligned to the reference sequences.
  • Naming of the sequence is based upon the previously published naming conventions.

In the sequence alignments the following conventions are used.

  • The entry for each allele is displayed in respect to the reference sequences.
  • Where identity to the reference sequence is present the base will be displayed as a hyphen (-).
  • Non-identity to the reference sequence is shown by displaying the appropriate base at that position.
  • Where an insertion or deletion has occurred this will be represented by a period (.).
  • If the sequence is unknown at any point in the alignment, this will be represented by an asterisk (*).
  • In protein alignments for null alleles, the 'Stop' codons will be represented by an X.
  • In protein alignments, sequence following the termination codon, will not be marked and will appear blank.
  • These conventions are used for both nucleotide and protein alignments.

Numbering of the Sequence Alignment

In order to provide standardised sequences for any loci, the following numbering system has been established that accurately represents the sequence at both the nucleotide and protein level. We have looked at the HUGO Gene Nomenclature Committee recommendations proposed for the numbering of genomic sequences, and use a similar model for the sequences held in the IPD - MHC Database. Many of their proposals already match our current strategy. HUGO recommends that for all nomenclature systems a standard reference sequence should be used for each locus. The remaining recommendations for nucleotide sequences are as follows;

Nucleotide Sequence Numbering

  • The numbering of the nucleotides in the reference sequence should remain constant.
  • For both gDNA and cDNA the A of the ATG initiator Methionine codon has been denoted nucleotide +1. (Currently used for cDNA sequence)
  • The nucleotide immediately preceding the A of the ATG initiator Methionine codon has been denoted nucleotide -1. Note: that there is no nucleotide 0.
  • cDNA sequences are numbered consecutively from the A of the ATG initiator Methionine codon.
  • Nucleotide sequences may be displayed in codons, in this case the numbering follows that for protein sequences.

The following recommendations are used for describing mutations in nucleotide sequences;

  • Nucleotide substitutions are designated using the nucleotide number, followed by the substitution. For example; 997G>T denotes a substitution of G to T at position 997 of the DNA sequence.
  • Deletions are designated by 'del' after the nucleotide number. For example; 997delT denotes the deletion of a T at position 997 of the DNA. For deletions of a number of consecutive bases the mutation should be described as 997-998delTG which denotes a deletion of TG at positions 997 and 998 of the DNA.
  • Insertions are designated by 'ins' after the nucleotide numbers bordering the insertion. For example; 997-998insT, represents an insertion of T between bases 997 and 998 of the DNA. In the alignments produced this will be represented by a period (.), but the numbering of the reference sequence will not be altered to include this base. Insertions of multiple bases are designated using the same form, 997-998insTG denotes an insertion of TG between positions 997 and 998 of the DNA.

Protein Sequence Numbering

  • For amino acid-based systems, the start codon of the mature protein is labeled codon 1.
  • The codon 5' to this is numbered -1.
  • All numbering is based on the reference sequence.
  • The single letter amino acid code is used in all protein alignments.
  • Nucleotide sequences may be displayed in codons, in this case the numbering follows that for protein sequences.
  • To avoid confusion with the nucleotide numbering p. may be added to the nomenclature to denote a protein sequence.

Mutations in protein sequences follow a similar format;

  • For amino acid nomenclature the reference amino acid is listed first followed by the codon and then the mutation. For example; Y97S represents a substitution of the Tyrosine at codon 97 for a Serine.
  • Stop codons are always designated by X. For example; T97X represents a Threonine substituted for a stop codon.
  • Deletions are again designated used 'del'. For example; T97del is the deletion of a Threonine at codon 97.
  • Insertions again follow the 'ins' convention. For example; T97-98ins represents a Threonine inserted between codons 97 and 98.

Further Information

For information regarding IPD please contact IPD Support

 


spacer
spacer