IntroductionSequence formats are simply the way in which the amino acid or DNA sequence is recorded in a computer file. Different programs expect different formats, so if you are to submit a job successfully, it is important to understand what the various formats look like.
In order to successfully submit a job it is important to understand what the various sequence formats used for describing biological sequences are and what their basic structure is. The job submission forms are fairly flexible but cannot cope with too much inconsistency.
You can submit sequence to the search and analysis programs in any of the formats mentioned in the options your chosen tool.
If you are submitting sequences to ClustalW2 or pratt you may the normal format, as described below, just making sure that the sequences follow each other and are separated from each other with the format´s separator. In the case of EMBL format this would be '//'.
In order to aid the user with the process of converting sequences to appropriate formats please use the following link:READSEQ.
Examples of Sequence Formats:
Click here to see a complete list of sequence formats supported by EMBOSS applications.
ALN/ClustalW2 format:
ALN format was originated in the alignment program ClustalW2. The file starts with word "CLUSTAL" and then some information about which clustal program was run and the version of clustal used.
e.g. "CLUSTAL W (2.1) multiple sequence alignment"
The type of clustal program is "W" and the version is 2.1.
The alignment is written in blocks of 60 residues.
Every block starts with the sequence names, obtained from the input sequence, and a count of the total number of residues is shown at the end of the line.
The information about which residues match is shown below each block of residues:
"*" means that the residues or nucleotides in that column are identical in all sequences in the alignment.
":" means that conserved substitutions have been observed.
"." means that semi-conserved substitutions are observed.
An example is shown below.
AMPS Block file format:
The first part of a block-file contains the identifier codes of the sequences that are to follow. Each code is prefixed by the > symbol, codes must not contain spaces. e.g.
>HAHU
>Trypsin
>A0046
>Seq1
etc.
The number of ">" symbols is read in the beginning of the file until a * symbol is found. The * signals the beginning of the multiple alignment which is stored VERTICALLY, thus columns are individual sequences, whilst rows are aligned positions. The * symbol must lie over the first sequence. A further star in the same column signals the end of the alignment. Software then uses the number of ">" symbols at the beginning of the file to work out how many columns to read from the * position. It is therefore important that the only ">" symbols in the file are those that define the identifiers, and the only symbols are those defining the start and end of the multiple alinnment. A simple, small block-file is shown below.
Codata Format:
The first line starts with the text ENTRY". The end of a sequence is delineated by "///". The "SEQUENCE" line specifies the beginning of the sequence lines (starting on the next line), and no sequence is assumed to appear in the entry if the "SEQUENCE" line is missing.
EMBL Format:
The EMBL entries(as below) in the database are structured so as to be usable by human readers as well as by computer programs. Each entry in the database is composed of lines. Different types of lines, each with its own format, which are used to record the various types of data which make up the entry. Some entries will not contain all of the line types, and some line types occur many times in a single entry. As noted, each entry begins with an identification line (ID) and ends with a terminator line (//). Consult the EMBL user manual for a more comprehensive guide.
- The ID (IDentification line) line is always the first line of an entry. The general form of the ID line is:
- The XX line contains no data or comments. It is used instead of blank lines to avoid confusion with the sequence data lines.
- The AC (Accession Number) line lists the accession numbers associated with this entry.
- The DT (DaTe) line shows when an entry first appeared in the the database and when it was last updated.
- The DE (DEscription) lines contain general descriptive information about the sequence stored.
- The KW (KeyWord) lines provide information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. The keywords chosen for each entry serve as a subject reference for the sequence, and will be expanded as work with the database continues. Often several KW lines are necessary for a single entry.
- The OS (Organism Species) line specifies the preferred scientific name of the organism which was the source of the stored sequence.
- The OC (Organism Classification) lines contain the taxonomic classification of the source organism.
- The RN (Reference Number) line gives a unique number to each reference citation within an entry.
- The RC (Reference Comment) line type is an optional line type which appears if the reference has a comment.
- The RP (Reference Position) line type is an optional line type which appears if one or more contiguous base spans of the presented sequence can be attributed to the reference in question.
- The RX (Reference Cross-reference) line type is an optional line type which contains a cross-reference to an external citation or abstract database.
- The RA (Reference Author) lines list the authors of the paper (or other work) cited.
- The RT (Reference Title) lines give the title of the paper (or other work).
- The RL (Reference Location) line contains the conventional citation information for the reference.
- The DR (Database Cross-Reference) line cross-references other databases which contain information related to the entry in which the DR line appears.
- The CC lines are free text comments about the entry, and may be used to convey any sort of information thought to be useful.
- The FH (Feature Header) lines are present only to improve readability of an entry when it is printed or displayed on a terminal screen. The lines contain no data and may be ignored by computer programs.
- The FT (Feature Table) lines provide a mechanism for the annotation of the sequence data. Regions or sites in the sequence which are of interest are listed in the table.
A complete and definitive description of the feature table is given here.
- The SQ (SeQuence header) line marks the beginning of the sequence data and gives a summary of its content.
- The sequence data lines has lines of code starting with two blanks. The sequence is written 60 bases per line, in groups of 10 bases separated by a blank character, beginning in position 6 of the line. The direction listed is always 5' to 3'
- The // (terminator) line also contains no data or comments. It designates the end of an entry.
FASTA Format:
- This format contains a single header line providing the sequence
name, and optionally a description, followed by lines of sequence data.
- Sequences in FASTA formatted files are preceded by a line
starting with a " >" symbol.
- The first word on this line is the name of the sequence. The rest
of the line is a description of the sequence.
- The remaining lines contain the sequence itself, usually formated
to 60 characters per line.
- Depending on the application blank lines in a FASTA file are
ignored or treated as terminating the sequence
- Depending on the application spaces or other non-sequence symbols
(dashes, underscores, periods) in a sequence are either ignored or
treated as gaps.
- FASTA files containing multiple sequences are just the same, with
one sequence listed right after another. This format is accepted for
many multiple sequence alignment programs.
GCG/MSF Format
- The file may begin with as many lines of comment or description as required.
- The comments are terminated with a line starting with two slashes.
- The first mandatory line that is recognised as part of the MSF file is the line containing the text "MSF:", this line also includes the sequence length, type and date plus an internal check sum value.
- The next line is a mandatory blank line inserted before the sequence names.
- There then follows one line per sequence describing the sequence name, length, checksum and a weight value. Only one name per line is allowed; the qualifier "Name: " is followed by the sequence name. Names are restricted to 10 characters or less. Extra characters, between the sequence names and "Len: " are acceptable if they contain no blank characters. Another blank line is added followed by a line starting with two slashes "//" , this indicates the end of the name list.
- There then follows another blank line.
- Sequences are interleaved on separate lines with gaps represented by periods. Each sequence line starts with the sequence name which is separated from the aligned sequence residues by white space.
GDE Format:
GDE format is a tagged field format used for storing all available information about a sequence. The format matches very closely the GDE internal structures for sequence data. The format consists of text records starting and ending with braces ('{}'). Between the open and close braces are several tagged field lines specifying different pieces of information about a given sequence. The tag values can be wrapped with double quote characters ('""') as needed. If quotes are not used, the first white space delimited string is taken as the value.Any fields that are not specified are assumed to be the default values. Offsets can be negative as well as positive. Genbank entries written out in this format will have all (") converted to ('), and all ({}) converted to ([]) to avoid confusion in the parser. Leading and trailing gaps are removed prior to writing each sequence. This format is deliberately verbose in order to be simple to duplicate.
Genebank Format:
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. Although there is daily exchange of information with the EMBL Nucleotide Sequence Database, it has it's own sequence format shown below. Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, and a table of features that identifies coding regions and other sites of biological significance, such as transcription units, sites of mutations or modifications, and repeats. Protein translations for coding regions are included in the feature table. Bibliographic references are included along with a link to the Medline unique identifier for all published sequences. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry.
- LOCUS: Short name for this sequence (Maximum of 32 characters).
- DEFINITION: Definition of sequence (Maximum of 80 characters).
- ACCESSION: accession number of the entry.
- VERSION: Version of the entry.
- DBSOURCE: Shows the source, the date of creation and last modification of the database entry.
- KEYWORDS: Keywords for the entry.
- AUTHORS: Authors for the work.
- TITLE: Title of the publication.
- JOURNAL: Journal reference for the entry.
- MEDLINE: Medline ID.
- COMMENT: Lines of comments.
- SOURCE ORGANISM: The organism from which the sequence was derived.
- ORGANISM: Full name of organism (Maximum of 80 characters).
- AUTHORS: Authors of this sequence (Maximum of 80 characters).
- ACCESSION: ID Number for this sequence (Maximum of 80 characters).
- FEATURES: Features of the sequence.
- ORIGIN: Beginning of sequence data.
- // End of sequence data.
NBRF/PIR Format:
- The PIR format is similar to FASTA format.
- The first line of each sequence entry begins with a "greater than", (>) sign.
- Each sequence starts with a sequence type code (described in the table below), then a semi-colon
.
- On the next line the sequence name and a description appears.
- The sequence is on the following line and is ended with an asterisk (*).
Pfam/Stockholm Format:
The "Pfam/Stockholm" format is a system for marking up features in a multiple alignment. These mark-up annotations are preceded by a 'magic' label, of which there are four types.
Header:
The first line in the file must contain a format and version identifier, currently:
# STOCKHOLM 1.0
The sequence alignment:
< seqname> <aligned sequence>
< seqname> <aligned sequence>
< seqname> <aligned sequence>
.
.
//
<seqname> stands for "sequence name", typically in the form "name/start-end" or just "name".
The "//" line indicates the end of the alignment.
Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".
Wrap-around alignments are allowed in principle, mainly for historical reasons, but are not used in e.g. Pfam. Wrapped alignments are discouraged since they are much harder to parse.
The alignment mark-up:
Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.
#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Sequence AND per-Column markup, exactly 1 char per column>
Example:
Phylip Format:
- The first line of the input file contains the number of species, the number of sequences and their length (in characters)separated by blanks.
- The next line contains the sequence name, followed by the sequence in blocks of 10 characters.
Raw Format:
Like text/plain format except that it removes any white space or digits, accepts only alphabetic characters and rejects anything else. This means that it is safer to use this format that plain format. If you have digits and spaces or TAB characters, these are removed and ignored. If you have other non-alphabetic characters (for example, punctuation characters), then the sequence will be rejected as erroneous.
RSF Format:
RSF means rich sequence format and it is created by the Editor in SeqLab. The format is recognised by the word !!RICH_SEQUENCE at the beginning of
the file. It contains one or more sequences that may or may not be related. In addition to the sequence data, each sequence can be annotated with descriptive sequence information such as:
- Creator/author of the sequence
- Sequence weight
- Creation date
- One-line description of the sequence
- Offset, or the number of leading gaps in a sequence that is part of an alignment or fragment assembly project Known sequence features
UniProtKB/Swiss-Prot Format:
UniProtKB/Swiss-Prot is an annotated protein sequence database. The UniProtKB/Swiss-Prot protein knowledgebase consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of UniProtKB/Swiss-Prot follows as closely as possible that of the EMBL Nucleotide Sequence Database. The UniProtKB/Swiss-Prot user manual is available here. The entries in the UniProtKB/Swiss-Prot database are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data that make up the entry.
- The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:
- The AC (ACcession number) line lists the accession number(s) associated with an entry.
- The DT (DaTe) lines show the date of creation and last modification of the database entry.
- The DE (DEscription) lines contain general descriptive information about the sequence stored.
- The GN (Gene Name) line contains the name(s) of the gene(s) that code for the stored protein sequence.
- The OS (Organism Species) line specifies the organism(s) which was (were) the source of the stored sequence.
- The OG (OrGanelle) line indicates if the gene coding for a protein originates from the mitochondria, the chloroplast, a cyanelle, or a plasmid.
- The OC (Organism Classification) lines contain the taxonomic classification of the source organism.
- The OX (Organism taxonomy Cross-Reference) line is used to indicate the identifier to a specific organism in a taxonomic database.
- The RN (Reference Number) line gives a sequential number to each reference citation in an entry.
- The RP (Reference Position) line describes the extent of the work carried out by the authors of the reference cited.
- The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited.
- The RX (Reference Cross-Reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database.
- The RA (Reference Author) lines list the authors of the paper (or other work) cited.
- The RT (Reference Title) lines give the title of the paper (or other work) cited.
- The RL (Reference Location) lines contain the conventional citation information for the reference.
- The CC lines are free text comments on the entry, and are used to convey any useful information.
- The DR (Database cross-Reference) lines are used as pointers to information related to UniProtKB/Swiss-Prot entries and found in other data collections.
- The KW (KeyWord) lines provide information that can be used to generate indexes of the sequence entries based on functional, structural, or other categories.
- The FT (Feature Table) lines provide a precise but simple means for the annotation of the sequence data. The table describes regions or sites of interest in the sequence. In general the feature table lists posttranslational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references.
- The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content.
- The sequence data line has a line code consisting of two blanks rather than the two-letter codes used until now. The sequence counts 60 amino acids per line, in groups of 10 amino acids, beginning in position 6 of the line.
- The // (terminator) line contains no data or comments and designates the end of an entry.
Known biosequence format Extensions
|
CLUSTAL W 2.1 multiple sequence alignment FOSB_MOUSE ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGS 60 FOSB_HUMAN ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGS 60 ********************************.***************:*.**:******