The IPD-KIR Database provides an FTP site for the retrieval of sequences. The sequences are provided as FASTA and PIR formats. Descriptions of each file type is available below.
The FTP directory is available at the following address:
Release ArchivePrevious releases are archived as a git repository and available at https://github.com/anhig/IPDKIR. This repository contains a branch for each database release and a Latest branch which contains the most recent files as well as all compressed archives.
The following descriptions detail the types of sequence formats available at the FTP site. The FASTA and PIR files are just raw sequence all inserts (.) and spaces (*) have been removed from the sequence. All files have been generated using "ReadSeq", a freely available sequence format conversion program written by D. Gilbert.
Sequences in FASTA/Pearson format are represented by two main line types. The first line always begins with a "greater than" (>) sign and contains sequence information. In the files provided the sequence information lines includes the unique IPD accession number and the allele name. The remaining lines contain plain text representing the nucleotide or protein sequence. There can be any number of these sequence lines, of any length, to represent the sequence. Please note FASTA files contain no form of alignment information, this means the first base of each file may not correspond to the same position when aligned.
Example KIR2DL1*001 in FASTA format:
>IPD:KIR00001 KIR2DL1*001, 1047 bases, 9EB285B5 checksum.
The MSF file format is the only format provided that includes the alignment information. These files have been provided for use in the GeneDoc program.
The file may begin with as many lines of comment or description as required. This can be seen in MSF files which have been saved in GeneDoc. The first mandatory line that is recognised as part of the MSF file is the line containing "MSF:". This line also includes the sequence length, type and date plus an internal check sum value. The next line is a mandatory blank line inserted before the sequence names. There then follows one line per sequence describing the sequence name, length, checksum and a weight value. Only one name per line is allowed; the qualifier "Name: " is followed by the sequence name. Names are restricted to 10 characters or less. Extra characters, between the sequence names and "Len: " are acceptable if they contain no blank characters. Another blank line is added followed by a line starting with two slashes "//" , this indicates the end of the name list. There then follows another blank line. Some MSF formats contain two lines at this point with the second line containing the positions of the sequence elements. The sequence lines follow these start with the sequence name followed by two spaces " ", and 50 bases of the sequence. Each block of 10 elements has to be separated by a space. This is repeated for every sequence. Between each block of sequences is a blank line, again a second line may be added to include positional numbering. The last block of sequences may contain less than 50 elements. It is important that all the sequences have the same length including gaps. To conform to this file format all inserts and spaces are marked by a period (.).
Example KIR2DL1 MSF File
temp.msf1 MSF: 1047 Type: N January 01, 1776 12:00 Check: 7515 ..
Name: KIR2DL1*001 Len: 1047 Check: 9282 Weight: 1.00
Name: KIR2DL1*002 Len: 1047 Check: 9451 Weight: 1.00
Name: KIR2DL1*00301 Len: 1047 Check: 9879 Weight: 1.00
Name: KIR2DL1*00302 Len: 1047 Check: 9759 Weight: 1.00
KIR2DL1*001 ATGTCGCTCT TGGTCGTCAG CATGGCGTGT GTTGGGTTCT TCTTGCTGCA
KIR2DL1*002 ATGTCGCTCT TGTTCGTCAG CATGGCGTGT GTTGGGTTCT TCTTGCTGCA
KIR2DL1*00301 ATGTCGCTCT TGGTCGTCAG CATGGCGTGT GTTGGGTTCT TCTTGCTGCA
KIR2DL1*00302 ATGTCGCTCT TGGTCGTCAG CATGGCGTGT GTTGGGTTCT TCTTGCTGCA
The format of sequences in PIR/NbrF format is more complex. The first line of each sequence entry begins with a "greater than", (>). This is immediately followed by a two character sequence type specifier, for these sequences this is "DL", meaning DNA linear. Space four must contain a semicolon. Beginning in space five is the sequence name or identification code. The second line of each sequence entry contains a brief description including the accession number, allele name, sequence length, and a internal checksum for PIR files. The nucleic acid sequence begins on the third line. The sequence is free format, however to aid in reading the sequences, the nucleotides have been arranged in blocks of 10 bases. The last character is an asterisk (*), and acts as a termination character.
Example KIR2DL1*001 in PIR format.
IPD:KIR00001 KIR2DL1*001, 1047 bases, 9EB285B5 checksum.
ATGTCGCTCT TGGTCGTCAG CATGGCGTGT GTTGGGTTCT TCTTGCTGCA
GGGGGCCTGG CCACATGAGG GAGTCCACAG AAAACCTTCC CTCCTGGCCC
ACCCAGGTCC CCTGGTGAAA TCAGAAGAGA CAGTCATCCT GCAATGTTGG
TCAGATGTCA TGTTTGAACA CTTCCTTCTG CACAGAGAGG GGATGTTTAA
CGACACTTTG CGCCTCATTG GAGAACACCA TGATGGGGTC TCCAAGGCCA
ACTTCTCCAT CAGTCGCATG ACGCAAGACC TGGCAGGGAC CTACAGATGC
TACGGTTCTG TTACTCACTC CCCCTATCAG GTGTCAGCTC CCAGTGACCC
TCTGGACATC GTGATCATAG GTCTATATGA GAAACCTTCT CTCTCAGCCC
AGCCGGGCCC CACGGTTCTG GCAGGAGAGA ATGTGACCTT GTCCTGCAGC
TCCCGGAGCT CCTATGACAT GTACCATCTA TCCAGGGAAG GGGAGGCCCA
TGAACGTAGG CTCCCTGCAG GGCCCAAGGT CAACGGAACA TTCCAGGCTG
ACTTTCCTCT GGGCCCTGCC ACCCACGGAG GGACCTACAG ATGCTTCGGC
TCTTTCCATG ACTCTCCATA CGAGTGGTCA AAGTCAAGTG ACCCACTGCT
TGTTTCTGTC ACAGGAAACC CTTCAAATAG TTGGCCTTCA CCCACTGAAC
CAAGCTCCAA AACCGGTAAC CCCCGACACC TGCACATTCT GATTGGGACC
TCAGTGGTCA TCATCCTCTT CATCCTCCTC TTCTTTCTCC TTCATCGCTG
GTGCTCCAAC AAAAAAAATG CTGCGGTAAT GGACCAAGAG TCTGCAGGAA
ACAGAACAGC GAATAGCGAG GACTCTGATG AACAAGACCC TCAGGAGGTG
ACATACACAC AGTTGAATCA CTGCGTTTTC ACACAGAGAA AAATCACTCG
CCCTTCTCAG AGGCCCAAGA CACCCCCAAC AGATATCATC GTGTACACGG
AACTTCCAAA TGCTGAGTCC AGATCCAAAG TTGTCTCCTG CCCATGA*