![]() |
IMGT/HLA DatabaseUser ManualThis document describes the format and conventions used in the IMGT/HLA flat files. These files are included in the ftp directory and also in the SRS browser on the EBI website. The formatting of the flat files and documentation below is based on guidelines provided by the EMBL Nucleotide User Manual. The IMGT/HLA Database is composed of sequence entries. Each entry corresponds to a single contiguous sequence as contributed or reported in the literature. In some cases, entries have been assembled from several papers reporting overlapping sequence regions. Conversely a single paper often provides data for several entries. Structure of an Entry The entries in the database are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English, and the symbols and formatting employed for the base sequences themselves have been chosen for readability. Wherever possible, symbols familiar to molecular biologists have been used. At the same time, the structure is systematic enough to allow computer programs easily to read, identify, and manipulate the various types of data included. Each entry in the database is composed of lines. Different types of lines, each with its own format, are used to record the various types of data which make up the entry. In general, fixed format items have been kept to a minimum, and a more syntax-oriented structure adopted for the lines. The two exceptions to this are the sequence data lines and the feature table lines, for which a fixed format was felt to offer significant advantages to the user. Users who write programs to process the database should not assume anything about the column placement of items on lines other than these two: all other line types are free-format. Note that each line begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below:
A sample entry is shown below: ID HLA00001 standard; DNA; HUM; 3554 BP. The ID Line The ID (IDentification) line is always the first line of an entry. The general form of the ID line is: ID entryname dataclass; molecule; division; sequence length BP Entryname: The entry name is a unique identifier generated by the IMGT/HLA Database. This is used to identify each allele and to provide an accession number. EMBL accession numbers are not used, as although the sequence is derived from an EMBL entry the data is not the same and so new identifiers have been provided for the IMGT/HLA . The identifier follows the form 'HLA' then a six digit code. Dataclass: The class of each entry is indicated on the first (ID) line of the entry. Entries distributed and made publicly available are of dataclass 'standard'. Molecule Type: All entries are of the molecular type "DNA". Database division: This indicates to which division the entry belongs. All entries are of the division "HUM". Sequence length: The last item on the ID line is the length of the sequence (the total number of bases in the sequence). This number includes base positions reported as present but undetermined (coded as "N"). An example of a complete identification line is shown below: ID HLA00001 standard; DNA; HUM; 3554 BP. The AC Line The AC (ACcession number) line lists the accession number associated with the entry. An example of an accession number line is shown below: AC HLA00001; Each accession number is terminated by a semicolon. Accession numbers are the primary means of identifying sequences providing a stable way of identifying entries from release to release. Accession numbers allow unambiguous citation of database entries. An accession number is dropped from the database only when the data to which it was assigned have been completely removed from the database. The DT Line The DT lines list when the allele was first assigned an official name. This corresponds to a date in the previous HLA DB. The DT lines also record the latest updates to an allele. There are two kinds of update, sequence and annotation. The DT line records when each of these was last updated and is displayed as follows: : DT DD-MON-YYYY (Rel. #, Created, Version #) The DE Line The DE (Description) lines contain general descriptive information about the sequence stored. This may include the designations of genes for which the sequence codes, the region of the genome from which it is derived, or other information which helps to identify the sequence. This is derived from the EMBL description line, but a standard format is used for all entries. The format for a DE line is: DE description The KW Line The KW (KeyWord) lines provide information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. The keywords chosen for each entry serve as a subject reference for the sequence. The IMGT/HLA keywords include;
The format for a kW line is: KW keyword[; keyword ...]. The OS and OC Lines The OS and OC lines are set for all entries. The database currently contains only human sequences and so these lines are preset. The format of the OS and OC line is: OS Homo Sapiens (human)
OC Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates;
OC Catarrhini; Hominidae; Homo.
The Reference Lines (RN, RP, RX, RA, RT, RL) The references cited for an entry should be considered a pointer to the literature and not as assigning scientific credit for the elucidation of the sequence. These lines comprise the literature citations within the database. The citations provide access to the papers from which the data has been abstracted. The reference lines for a given citation occur in a block, and are always in the order RN, RP, RX, RA, RT, RL. Example of reference : RN [1] RP 1-1098 RX PUBMED; 3375250. RA Parham P, Lomen CE, Lawlor DA, Ways JP, Holmes N, Coppin HL, Salter RD, RA Wan AM, Ennis PD; RT "Nature of polymorphism in HLA-A, -B, and -C molecules"; RL PNAS USA 85:4005-4009(1988). The RN (Reference Number) line gives a unique number to each reference Citation within an entry. The reference number is always enclosed in square brackets. The RP (Reference Position) indicates which part(s) of the sequence are covered by the reference. Note that the numbering scheme is for the sequence as presented in the database entry (i.e. from 5' to 3' starting at 1), not the scheme used by the authors in the reference should it differ. The RX (reference cross-reference) line type is a optional line type which contains a cross-reference to an external citation or abstract database. For example, if a journal citation exists in the PUBMED database, there will be an RX line pointing at the relevant PUBMED identifier. The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the authors are included, and are listed in the order given in the paper. The names are listed surname first followed by a blank followed by initial(s) with periods. The author names are separated by commas and terminated by a semicolon. As many RA lines as necessary are included for each reference. The RT (Reference Title) lines give the title of the paper (or other work) as exactly as is possible given the limitations of computer character sets. Note that the form used is that which would be used in a citation rather than that displayed at the top of the published paper. The title is enclosed in double quotes, and may be continued over several lines as necessary. The title lines are terminated by a semicolon. The RL (Reference Location) line contains the conventional citation information for the reference. In general, the RL lines alone are sufficient to find the paper in question. They include the journal, volume number, page range and year for each paper. Journal names are abbreviated according to existing ISO standards (International Standard Serial Number). The format for the location lines is: RL journal vol:pp-pp(year). The DR Line The DR (Database Cross-reference) line cross-references other databases which contain information related to the entry in which the DR line appears. For example, if the protein translation of a sequence exists in the UniProtKB/Swiss-prot database there will be a DR line pointing to the relevant UniProtKB/Swiss-prot entry. The format of the DR line is as follows: DR database_identifier; primary_identifier; secondary_identifier.The first item on the DR line, the database identifier, is the abbreviated name of the data collection to which reference is made. The second item on the DR line, the primary identifier, is a pointer to the entry in the external database to which reference is being made. The third item on the DR line is the secondary identifier, if available, from the referenced database. Feature Table Definitions The feature table contains information about genes and gene products, as well as regions of biological significance reported in a sequence. It contains information on regions of the sequence that code for proteins and RNA molecules. The IMGT/HLA format is based on the EMBL flat file format. Currently the database provides only feature keys and qualifiers already used by the EMBL system. The FH Line The first two lines of the feature table in the IMGT/HLA entries are feature header (FH) lines, specific to the EMBL flat file format. The FH (Feature Header) lines are present only to improve readability of an entry when it is printed or displayed on a terminal screen. The lines contain no data and may be ignored by computer programs. The format of these lines is always the same: FH Key Location/Qualifiers
FH
The first line provides column headings for the feature table, and
the second line serves as a spacer. If an entry contains no feature
table (i.e. no FT lines - see below), the FH lines will not appear.
The FT Line The FT (Feature Table) lines provide a mechanism for the annotation of the sequence data. Regions or sites in the sequence which are of interest are listed in the table. In general, the features in the feature table represent signals or other characteristics reported in the cited references. In some cases, ambiguities or features noted in the course of data preparation have been included. The feature table is subject to expansion or change as more becomes known about a given sequence. Features appear on FT lines. The line type code FT appears in columns 1-2 and columns 3-5 are blank. The feature key begins in column 6 and may be no more than 15 characters in length. The location begins in column 26. Feature qualifiers begin on subsequent FT lines at column 26. Location, qualifier, and continuation lines may extend from column 26 to 80. Each qualifier is added on a new line. The first item on an FT line is the feature key. It starts in column 6 and can continue to column 24. The features provided in the first release of IMGT/HLA flat files contain only a small amount of information. Further feature keys will be added as annotation of entries progresses. Currently the feature keys listed are the source, CDS and exon information. The second item on the FT line designates the location of the feature in the sequence. The location begins at column 26. Several conventions are used to indicate sequence location. Base numbers in locations refer to the numbering in the entry, which may differ from the official alignment sequences. The first base in the presented sequence is numbered base 1. Sequences are presented in the 5' to 3' direction. Locations can be described by either a single base or a contiguous span of bases. This is indicated by separating the start and end position by two periods (e.g., 23..79). Feature qualifiers provide additional information about the individual feature key. The qualifiers take the form of a slash (/) followed by a name and, if applicable, an equal sign (=) and the qualifier value. The qualifiers can convey many types of information. Text qualifier values are enclosed in double quotation marks. Citation or reference numbers for an entry are enclosed in square brackets ([]) to distinguish them from other numbers. A literal sequence of bases (e.g., "atgcatt") is enclosed in quotation marks. Literal sequences are distinguished from free text by context. Qualifiers that take free text as their values do not take literal sequences, and vice versa. The '/label=' qualifier takes a feature label as its qualifier. Although feature labels are optional, they allow unambiguous references to features. The feature label identifies a feature within an entry; when combined with the accession number and the name of the data bank from which it came, it is a unique tag for that feature. The first release of the IMGT/HLA Database used only those qualifiers found in the EMBL feature qualifier definitions. These qualifiers are still used until additional specific qualifiers for the IMGT/HLA Database are implemented. The first release of the IMGT/HLA Database using the following feature qualifiers;
The SQ Line The nucleotide sequence data is generally present in the database as they have been submitted or published, subject to some conventions which have been adopted for the database as a whole. The sequences are always listed in the direction 5' to 3', regardless of the published order. Bases are numbered sequentially beginning with 1 at the 5' end of the sequence. The nucleotide sequence provided is a CDS sequence, constructed from a number of different sequence entries. The SQ (SeQuence header) line marks the beginning of the sequence data and gives a summary of its content. The sequence data line has a line code consisting of two blanks. The sequence is written 60 bases per line, in groups of 10 bases separated by a blank character, beginning at position 6 of the line. Columns 73-80 of each sequence line contain base numbers for easier reading and quick location of regions of interest. The numbers are right justified and indicate the number of the last base on each line. An example is: SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;As shown, the line contains the length of the sequence in base pairs followed by its base composition. Bases other than A, C, G and T are grouped together as "other". An example of a data line is:
aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60
The XX Line The XX (spacer) line contains no data or comments. Its purpose is to make an entry easier to read on a page or terminal screen by setting off the various types of information in appropriate groupings. XX is used instead of blank lines to avoid confusion with the sequence data lines. The XX lines can always be ignored by computer programs. The CC Line CC lines are free text comments about the entry, and may be used to convey any sort of information thought to be useful. The // Line The // (terminator) line also contains no data or comments. It designates the end of an entry. InformationFor more information about the database, IMGT/HLA queries (including website) or to subscribe to the IMGT/HLA mailing list please contact IMGT/HLA Support. Please see our licence for our terms of use. ![]() |