spacer
Animated Sponsors Logo

IPD - The Immuno Polymorphism Database provides specialist databases for the study of polymorphism in genes of the immune system.
more

  spacer

IMGT/HLA Database

Guidelines On Sequence Alignments and Nomenclature

The following guidelines detail how the IMGT/HLA Sequence Database produces the official HLA sequence alignments. The sequences included in these alignments can be found in previous reports from the WHO Nomenclature Committee for Factors of the HLA System (1), and at the Anthony Nolan Research Institute website (2) and in the IMGT/HLA Sequence Database (3,4,5).

The alignment files produced use the following nomenclature and numbering conventions. These conventions are based on the recommendations published for Human Gene Mutations. These were prepared by a nomenclature-working group looking at how to name and store sequences for human allelic variants. These recommendations can be found in Human Mutation 11:1-3, 1998 (6).

  • Only alleles officially recognised by the WHO HLA Nomenclature Committee for Factors of the HLA System are included in the sequence alignments.
  • As recommended for all human gene mutations, a standard reference sequence should be used for all alignments. A complete list of reference sequences for each allele can be seen below.
  • The reference sequence will always be associated with the same (original) accession number, unless this sequence is shown to be in error.
  • All alleles are aligned to the reference sequences.
  • Naming of the sequence is based upon the previously published naming conventions (1).

Official Reference Sequences

Official Reference Sequences
Locus Allele Acc. No.
HLA-A 01010101 HLA00001
HLA-B 070201 HLA00132
HLA-C 010201 HLA00401
HLA-E 01010101 HLA00934
HLA-F 01010101 HLA01096
HLA-G 01010101 HLA00939
HLA-H 01010101 HLA02546
HLA-J 01010101 HLA02626
HLA-K 01010101 HLA02654
HLA-L 01010101 HLA02655
HLA-P 01010101 HLA02742
HLA-V 01010101 HLA02801
HLA-DMA 0101 HLA00485
HLA-DMB 0101 HLA00489
HLA-DOA 010101 HLA00494
HLA-DOB 01010101 HLA01098
HLA-DPA1 010301 HLA00499
HLA-DPB1 010101 HLA00514
HLA-DQA1 010101 HLA00601
HLA-DQB1 050101 HLA00638
HLA-DRA 0101 HLA00662
HLA-DRB1 010101 HLA00664
HLA-DRB2 0101 HLA01028
HLA-DRB3 010101 HLA00886
HLA-DRB4 01010101 HLA00905
HLA-DRB5 010101 HLA00915
HLA-DRB6 0101 HLA00929
HLA-DRB7 010101 HLA00932
HLA-DRB8 0101 HLA01029
HLA-DRB9 0101 HLA01030
MICA 001 HLA01013
MICB 001 HLA02033
TAP1 0101 HLA00953
TAP2 0101 HLA00959

Constructing the Virtual Sequence

The procedure for inclusion of an allele into the sequence alignments is described below.

  • The sequence of the allele is derived from all sequence entries submitted to the IMGT/HLA Sequence Database. These entries are from the generalist databanks like EMBL/GenBank/DDBJ.
  • A "virtual sequence" is constructed for each allele. This is produced using all the individual sequence entries in the IMGT/HLA Sequence Database. The sequence entries are all expertly annotated and checked before been aligned using ClustalW (7). The sequence produced from this alignment is termed the "virtual sequence".

Image of Virtual Sequence

Alignment of component sequences to form "virtual sequence".

  • The virtual sequence is then aligned against the reference sequence for that locus.
  • Insertions, periods (.), are added to the virtual sequence to ensure alignment to the reference sequence.
  • If the new allele has an insertion that causes the reference sequence to be amended then all the other sequences are realigned against the reference sequence. This is avoided whenever possible and the reference sequence remains standardised.

The finalised sequence alignments are provided at a number of web sites. These alignments contain a number of conventions for display identity and evolutionary events, as well as the numbering of the alignments. These conventions are explained below.

Numbering of the Sequence Alignment

In order to provide standardised sequences for any loci, the following numbering system has been established that accurately represents the sequence at both the nucleotide and protein level. We have looked at the HUGO Gene Nomenclature Committee (1) recommendations proposed for the numbering of genomic sequences, and use a similar model for the HLA sequences held in the IMGT/HLA Sequence Database. Many of their proposals already match our current strategy. HUGO recommends that for all nomenclature systems a standard reference sequence should be used for each locus. In the case of HLA sequences a standard reference sequence is already established for each gene. The remaining recommendations for nucleotide sequences are as follows;

Nucleotide Sequence Numbering.

  • The numbering of the nucleotides in the reference sequence should remain constant.
  • For both gDNA and cDNA the A of the ATG initiator Methionine codon has been denoted nucleotide +1. In some non-expressed genes this codon is not present and in these cases the first base of the reference sequence has been denoted as nucleotide +1.
  • The nucleotide immediately preceding the A of the ATG initiator Methionine codon has been denoted nucleotide -1. Note: that there is no nucleotide 0.
  • cDNA sequences are numbered consecutively from the A of the ATG initiator Methionine codon.
  • Nucleotide sequences may be displayed in codons, in this case the numbering follows that for protein sequences.

The following recommendations are used for describing mutations in nucleotide sequences;

  • Nucleotide substitutions are designated using the nucleotide number, followed by the substitution. For example; 997G>T denotes a substitution of G to T at position 997 of the DNA sequence.
  • Deletions are designated by 'del' after the nucleotide number. For example; 997delT denotes the deletion of a T at position 997 of the DNA. For deletions of a number of consecutive bases the mutation should be described as 997-998delTG which denotes a deletion of TG at positions 997 and 998 of the DNA.
  • Insertions are designated by 'ins' after the nucleotide numbers bordering the insertion. For example; 997-998insT, represents an insertion of T between bases 997 and 998 of the DNA. In the alignments produced this will be represented by a period (.), but the numbering of the reference sequence will not be altered to include this base. Insertions of multiple bases are designated using the same form, 997-998insTG denotes an insertion of TG between positions 997 and 998 of the DNA.

Protein Sequence Numbering

  • For amino acid-based systems, the start codon of the mature protein is labeled codon 1.
  • The codon 5' to this is numbered -1.
  • All numbering is based on the reference sequence.
  • The single letter amino acid code is used in all protein alignments.
  • Nucleotide sequences may be displayed in codons, in this case the numbering follows that for protein sequences.
  • To avoid confusion with the nucleotide numbering p. may be added to the nomenclature to denote a protein sequence.

Mutations in protein sequences follow a similar format;

  • For amino acid nomenclature the reference amino acid is listed first followed by the codon and then the mutation. For example; Y97S represents a substitution of the Tyrosine at codon 97 for a Serine.
  • Stop codons are always designated by X. For example; T97X represents a Threonine substituted for a stop codon.
  • Deletions are again designated used 'del'. For example; T97del is the deletion of a Threonine at codon 97.
  • Insertions again follow the 'ins' convention. For example; T97-98ins represents a Threonine inserted between codons 97 and 98

Some tools provide sequence alignments where identity and mismatches are highlighted. In these tools, the following conventions are used.

  • The entry for each allele is displayed in respect to the reference sequences.
  • Where identity to the reference sequence is present the base will be displayed as a hyphen (-).
  • Non-identity to the reference sequence is shown by displaying the appropriate base at that position.
  • Where an insertion or deletion has occurred this will be represented by a period (.).
  • If the sequence is unknown at any point in the alignment, this will be represented by an asterisk (*).
  • In protein alignments for null alleles, the 'Stop' codons will be represented by a hash (X).
  • In protein alignments, sequence following the termination codon, will not be marked and will appear blank.
  • These conventions are used for both nucleotide and protein alignments.

References

  1. Bodmer JG, Marsh SGE, Albert ED, Bodmer WF, Bontrop RE, Dupont B, Erlich HA, Hansen JA, Mach B, Mayr WR, Parham P, Petersdorf EW, Sasazuki T, Schreuder GMTh, Strominger JL, Svejgaard A, Terasaki PI
    Nomenclature for factors of the HLA System, 1998.
    Tissue Antigens (1999) 53:4, 407-46
  2. HLA Informatics Group, Anthony Nolan Research Institute. (http://www.anthonynolan.org/HIG/)
  3. Robinson J, Bodmer JG, Malik A, Marsh SGE
    Development of the International Immunogenetics HLA Database
    Human Immunology (1998) 59 Supp. 1 17
  4. Robinson J, Marsh SGE, Bodmer JG
    The IMGT/HLA Sequence Database
    European Journal of Immunogenetics (1999) 26 75
  5. Robinson J, Bodmer JG, Marsh SGE
    The IMGT/HLA Sequence Database
    Human Immunology (1999) 60 S1
  6. Antonarakis SE and the Nomenclature Working Group
    Recommendations for a Nomenclature System for Human Gene Mutations
    Human Mutation (1998) 11 1-3
  7. Thompson JD, Higgins DG, Gibson TJ
    CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
    Nucleic Acids Research (1994) 22 4673-4680

Information

For more information about the database, IMGT/HLA queries (including website) or to subscribe to the IMGT/HLA mailing list please contact IMGT/HLA Support.

spacer
spacer