IMGT/HLA Database
Guidelines
On Sequence Alignments and Nomenclature
The following guidelines detail how the IMGT/HLA
Sequence Database produces the official HLA sequence alignments.
The sequences included in these alignments can be found in previous
reports from the WHO Nomenclature Committee for Factors of the HLA
System (1), and at the Anthony Nolan Research
Institute website (2) and in the IMGT/HLA Sequence
Database (3,4,5).
The alignment files produced use the following
nomenclature and numbering conventions. These conventions are based
on the recommendations published for Human Gene Mutations. These
were prepared by a nomenclature-working group looking at how to
name and store sequences for human allelic variants. These recommendations
can be found in Human Mutation 11:1-3, 1998 (6).
- Only alleles officially recognised by the WHO
HLA Nomenclature Committee for Factors of the HLA System are included
in the sequence alignments.
- As recommended for all human gene mutations, a
standard reference sequence should be used for all alignments.
A complete list of reference sequences for each allele can be
seen below.
- The reference sequence will always be associated
with the same (original) accession number, unless this sequence
is shown to be in error.
- All alleles are aligned to the reference sequences.
- Naming of the sequence is based upon the previously
published naming conventions (1).
Official Reference Sequences
Constructing
the Virtual Sequence
The procedure for inclusion of an allele into
the sequence alignments is described below.
- The sequence of the allele is derived from all
sequence entries submitted to the IMGT/HLA Sequence Database.
These entries are from the generalist databanks like EMBL/GenBank/DDBJ.
- A "virtual sequence" is constructed
for each allele. This is produced using all the individual sequence
entries in the IMGT/HLA Sequence Database. The sequence entries
are all expertly annotated and checked before been aligned using
ClustalW (7). The sequence produced from this
alignment is termed the "virtual sequence".
Alignment of component sequences to
form "virtual sequence".
- The virtual sequence is then aligned against the
reference sequence for that locus.
- Insertions, periods (.), are added to the virtual
sequence to ensure alignment to the reference sequence.
- If the new allele has an insertion that causes
the reference sequence to be amended then all the other sequences
are realigned against the reference sequence. This is avoided
whenever possible and the reference sequence remains standardised.
The finalised sequence alignments are provided
at a number of web sites. These alignments contain a number of conventions
for display identity and evolutionary events, as well as the numbering
of the alignments. These conventions are explained below.
Numbering
of the Sequence
Alignment
In order to provide standardised
sequences for any loci, the following numbering system has been
established that accurately represents the sequence at both the
nucleotide and protein level. We have looked at the HUGO Gene Nomenclature
Committee (1) recommendations proposed for the numbering of genomic
sequences, and use a similar model for the HLA sequences held in
the IMGT/HLA Sequence Database. Many of their proposals already
match our current strategy. HUGO recommends that for all
nomenclature systems a standard reference sequence should be used
for each locus. In the case of HLA sequences a standard reference
sequence is already established for each gene. The remaining recommendations
for nucleotide sequences are as follows;
Nucleotide Sequence Numbering.
- The numbering of the nucleotides in the reference
sequence should remain constant.
- For both gDNA and cDNA the A of the ATG initiator
Methionine codon has been denoted nucleotide +1. In some non-expressed genes this codon is not present and in these cases the first base of the reference sequence has been denoted as nucleotide +1.
- The nucleotide immediately preceding the A of the
ATG initiator Methionine codon has been denoted nucleotide -1.
Note: that there is no nucleotide 0.
- cDNA sequences are numbered consecutively from
the A of the ATG initiator Methionine codon.
- Nucleotide sequences may be displayed in codons,
in this case the numbering follows that for protein sequences.
The following recommendations are used for
describing mutations in nucleotide sequences;
- Nucleotide substitutions are designated using the
nucleotide number, followed by the substitution. For example;
997G>T denotes a substitution of G to T at position 997 of
the DNA sequence.
- Deletions are designated by 'del' after the nucleotide
number. For example; 997delT denotes the deletion of a T at position
997 of the DNA. For deletions of a number of consecutive bases
the mutation should be described as 997-998delTG which denotes
a deletion of TG at positions 997 and 998 of the DNA.
- Insertions are designated by 'ins' after the nucleotide
numbers bordering the insertion. For example; 997-998insT, represents
an insertion of T between bases 997 and 998 of the DNA. In the
alignments produced this will be represented by a period (.),
but the numbering of the reference sequence will not be altered
to include this base. Insertions of multiple bases are designated
using the same form, 997-998insTG denotes an insertion of TG between
positions 997 and 998 of the DNA.
Protein Sequence Numbering
- For amino acid-based systems, the start codon
of the mature protein is labeled codon 1.
- The codon 5' to this is numbered -1.
- All numbering is based on the reference sequence.
- The single letter amino acid code is used in all
protein alignments.
- Nucleotide sequences may be displayed in codons,
in this case the numbering follows that for protein sequences.
- To avoid confusion with the nucleotide numbering
p. may be added to the nomenclature to denote a protein sequence.
Mutations in protein sequences follow a similar
format;
- For amino acid nomenclature the reference amino
acid is listed first followed by the codon and then the mutation.
For example; Y97S represents a substitution of the Tyrosine at
codon 97 for a Serine.
- Stop codons are always designated by X. For example;
T97X represents a Threonine substituted for a stop codon.
- Deletions are again designated used 'del'. For
example; T97del is the deletion of a Threonine at codon 97.
- Insertions again follow the 'ins' convention. For
example; T97-98ins represents a Threonine inserted between codons
97 and 98
Some tools provide sequence alignments where identity
and mismatches are highlighted. In these tools, the following conventions
are used.
- The entry for each allele is displayed in respect
to the reference sequences.
- Where identity to the reference sequence is present
the base will be displayed as a hyphen (-).
- Non-identity to the reference sequence is shown
by displaying the appropriate base at that position.
- Where an insertion or deletion has occurred this
will be represented by a period (.).
- If the sequence is unknown at any point in the
alignment, this will be represented by an asterisk (*).
- In protein alignments for null alleles, the 'Stop'
codons will be represented by a hash (X).
- In protein alignments, sequence following the
termination codon, will not be marked and will appear blank.
- These conventions are used for both nucleotide
and protein alignments.
References
Bodmer JG, Marsh SGE, Albert ED, Bodmer WF, Bontrop
RE, Dupont B, Erlich HA, Hansen JA, Mach B, Mayr WR, Parham P,
Petersdorf EW, Sasazuki T, Schreuder GMTh, Strominger JL, Svejgaard
A, Terasaki PI
Nomenclature for factors of the HLA System, 1998.
Tissue Antigens (1999) 53:4, 407-46
- HLA Informatics Group, Anthony Nolan Research
Institute. (http://www.anthonynolan.org/HIG/)
Robinson J, Bodmer JG, Malik A, Marsh SGE
Development
of the International Immunogenetics HLA Database
Human Immunology
(1998) 59 Supp. 1 17
-
Robinson J, Marsh SGE, Bodmer JG
The IMGT/HLA
Sequence Database
European Journal of Immunogenetics (1999) 26
75
Robinson J, Bodmer JG, Marsh SGE
The IMGT/HLA
Sequence Database
Human Immunology (1999) 60
S1
Antonarakis SE and the Nomenclature Working Group
Recommendations for a Nomenclature System for Human Gene Mutations
Human Mutation (1998) 11 1-3
Thompson JD, Higgins DG, Gibson TJ
CLUSTAL W:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and
weight matrix choice
Nucleic Acids Research (1994) 22
4673-4680
Information
For more information about the database, IMGT/HLA
queries (including website) or to subscribe to the IMGT/HLA mailing
list please contact IMGT/HLA Support.
 |