|
Jonathan Kans, Colombe Chappey, Jinghui Zhang, Tatiana Tatusov, and James
Ostell National Center for Biotechnology Information, National Library of
Medicine, NIH
Sequin is a stand-alone software tool developed by the NCBI for submitting and
updating sequences to the GenBank, EMBL, and DDBJ databases. Sequin has the
capacity to handle long sequences and sets of sequences (segmented entries, as well
as population, phylogenetic, and mutation studies). It also allows sequence editing
and updating, and provides complex annotation capabilities. In addition, Sequin
contains a number of built-in validation functions for enhanced quality assurance.
This overview is intended to provide a quick guide to Sequin's capabilities,
including automatic annotation of coding regions, the graphical viewer, quality
control features, and editing features. We suggest that you read this entire
document before beginning your Sequin submission. More detailed instructions on
these and other functions can be found in Sequin's on-screen Help file,
also available on the World-Wide Web from at the EBI:
http://www.ebi.ac.uk/~sterk/sqndocs/
or the Sequin home page at the NCBI:
http://www.ncbi.nlm.nih.gov/Sequin/
E-mail help is also available from:
http://www.ebi.ac.uk/support/ or
info@ncbi.nlm.nih.gov
Basic
Sequin Organization
Sequin is organised into a series of forms to (1) enter submitting authors,
(2) enter organism and sequences, (3) enter information such as strain,
gene, and protein names, (4) view the complete submission, and (5) edit
and annotate the submission. The goal is to go quickly from raw sequence data
to an assembled record that can be viewed, edited, and submitted to your
database of choice.
Advance through the pages that comprise each form by clicking on labeled
folder tabs or the Next Page button. After the basic information forms
have been completed and the sequence data imported, Sequin provides a complete
view of your submission, in your choice of text or graphic format. At this
point, any of the information fields can be easily modified by double-clicking
on any area of the record, and additional biological annotations can be entered
by selecting from a menu.
Sequin has an on-screen Help file that is opened automatically when you
start the program. Because it is context-sensitive, the Help text will
change and follow your steps as you progress through the program. A
"Find" function is also provided.
Before You Begin
Preparing Nucleotide and Amino Acid Data
Sequin normally expects to read sequence files in FASTA format. Note that most
sequence analysis software packages include FASTA or "raw" as one of
the available output formats. Population, phylogenetic, and mutation studies can
also be entered in PHYLIP, NEXUS, or FASTA+GAP formats.
See
http://www.ebi.ac.uk/~sterk/sqndocs/faq.html#Orgnameforphyl
for detailed examples of each of the various input data formats.
Prepare your sequence data files using a word processor or some other text
editor, and save in ASCII text format (plain text). If your nucleotide sequence
encodes one or more protein products, Sequin expects two files, one for the
nucleotides and one for the proteins.
FASTA format is simply the raw sequence preceded by a definition line. The
definition line begins with a > sign, and is followed immediately by a name for
the sequence (your own local identification code, or local ID) and a title. During
the submission process, indexing staff at the database to which you are submitting
will change your local ID to an accession number. You can embed other important
information in the title, and Sequin uses this information to construct a
record. Specifically, you can enter organism and strain or clone information in
the nucleotide definition line, and gene and protein information in the protein
definition line. Examples for the nucleotide and protein files are shown here:
>eIF4E [org=Drosophila melanogaster] [strain=Oregon R] Drosophila ...
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCA ...
>4E-I [gene=eIF4E] [prot=eukaryotic initiation factor 4E-I]
MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGN ...
>4E-II [gene=eIF4E] [prot=eukaryotic initiation factor 4E-II]
MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGNTATTTAPAGDD ...
The ability to embed this information in the definition line is provided as a
convenience to the submitter. If these annotations are not present, they can be
entered in subsequent forms. Sequin is designed to use this information, and
that provided in the initial forms, to build a properly-structured record. In
many cases, the final submission can be completely prepared from these data, so
that no additional manual annotation is necessary once the record is
displayed.
See
http://www.ebi.ac.uk/~sterk/sqndocs/faq.html#Biosrcmod
for a complete list of definition line modifiers.
In this example, we show alternative splicing, where a single gene produces
multiple messenger RNAs, which encode two similar but distinct protein products.
Note that the [ and ] brackets actually appear in the text. (Brackets are
sometimes used in computer documentation to denote optional text.
This convention is not followed here.)
Also, please note that there must be a line break (carriage return) between the
definition line and the first line of sequence. Some word processors will break a
single line onto two lines without actually adding a carriage return. (This
feature is known as "word wrapping.") If you are unsure whether there is a
carriage return, you can either set up your word processor so it shows invisible
characters like carriage returns, or view the file in a text editor which does
not create artificial line breaks. The definition line itself must not have a
line break within it, since the second line would then be misinterpreted as the
beginning of the sequence data. The actual sequence is usually broken every
50 to 80 characters, but this is not necessary for Sequin to be able to read it.
Segmented Nucleotide Sets
A segmented nucleotide entry is a set of non-contiguous genomic DNA sequences,
for example, encoding exons along with fragments of their flanking introns.
Segmented sets apply only to incomplete genomic DNA sequences, not ungapped
genomic DNA sequences or mRNA sequences. In order to import nucleotides in a
segmented set, each individual sequence must be in FASTA format with an
appropriate definition line, and all sequences should be in the same file. The
file containing the sequences is imported into Sequin as described later.
Population, Phylogenetic or Mutation Studies
For phylogenetic studies, the scientific name of each organism should be encoded
in each FASTA definition line, e.g., [org=Mus musculus]. For population studies,
you can encode strain, clone, and isolate information in the definition line,
e.g., [strain=BALB/c]. For these studies, PHYLIP, NEXUS, and FASTA+GAP formats
can also be read. (All three of these formats require a unique sequence
identifier, i.e., a local ID.) A sample FASTA+GAP file for a phylogenetic study
is shown here:
>Dmel28S [org=Drosophila melanogaster]
AUUCUGGUUAACUCUAAGCGGUGGAUCACUCGGCUCAUGGGUCGAUGAAGAACGCAGC--
AAACUGUGCGUCAUCGUGUGAACUGCAGGACACAU-GAACAUCGACAUUUUGAACGCAUA
UCG-----------CAGU-------CCAUGCU-GUUAUA----------UACAACCUCAA
>Xlae28S [org=Xenopus laevis]
-----UCGCGACUCUUAGCGGUGGAUCACUCGGCUCGUGCGUCGAUGAAGAACGCAGC--
UAGCUGCGAGAAUUAGUGUGAAUUGCAGA-CACAUUGAUCAUCGACACUUCGAACGCACC
UUGCGGCCCCGGGUUCCUCCCGGGGCCACGCCUGUCUGAGGGUCGCUCCUCAGACCUCAG
>Mmus28S [org=Mus musculus]
--------CGACUCUUAGCGGUGGAUCACUCGGCUCGUACGUCGAUGAAGAACGCAGC--
UAGCUGCGAGAAUUAAUGUGAAUUGCAGA-CACAUUGAUCAUCGACACUUCGAACGCACU
U-GCGGCCCCGGGUUCCUCCCGGGGCUACGCCUGUCUGAGCGUCGCUUC-GCGACCUCAG
>Hsap28S [org=Homo sapiens]
--------CGACUCUUAGCGGUGGAUCACUCGGCUCGUGCGUCGAUGAAGAACGCAGCGC
UAGCUGCGAGAAUUAAUGUGAAUUGCAGA-CACAUUGAUCAUCGACACUUCGAACGCACU
U-GCGGCCCCGGGUUCCUCCCGGGGCUACGCCUGUCUGAGCGUCGCUUC-GCGACCUCAG
One could add "28S ribosomal RNA gene, partial sequence" as a title to
all sequences using the Annotation page of the Organism and
Sequences form, and could also ask that the correct organism name be
prefixed to each title. (Sequin will convert the U (uracil) characters into T
(thymine) automatically.)
Sets of Segmented Sequences
If the sequences in a phylogenetic study are really segmented (e.g., ITS1 and
ITS2 of ribosomal RNA without the intervening 5.8S rRNA sequence), the individual
segments from a single organism can be grouped within square brackets. Subsequent
segments are detected by the presence of a FASTA definition line. For example:
[
>QruITS1 [org=Quercus rubra]
CGAAAACCTGCACAGCAGAAACGACTCGCAAACTAGTAATAACTGACGGAGGACGGAGGG ...
>QruITS2
CATCATTGCCCCCCATCCTTTGGTTTGGTTGGGTTGGAAGTTCACCTCCCATATGTGCCC ...
]
[
>QsuITS1 [org=Quercus suber]
CAAACCTACACAGCAGAACGACTCGAGAACTGGTGACAGTTGAGGAGGGCAAGCACCTTG ...
>QsuITS2
CATCGTTGCCCCCCTTCTTTGGTTTGGTTGGGTTGGAAGTTGGCCTTCCATATGTGCCCT ...
]
...
FASTA+GAP format can also use this convention for encoding sets of segmented
sequences.
Creating a Submission
The sequence data we will use for this example is the genomic sequence of the
Drosophila melanogaster eukaryotic initiation factors 4E-I and 4E-II (GenBank
accession number U54469).
Welcome
to Sequin Form
Once you have finished preparing the sequence files, you are ready to start
the Sequin program. Sequin's first window asks you to indicate the database
to which the sequence will be submitted, and prompts you to start a new project
or continue with an existing one. Once you choose a database, Sequin will remember
it in subsequent sessions. In general, each sequence submission should be
entered as a separate project. However, segmented DNA sequences, population
studies, phylogenetic studies, and mutation studies, should be submitted together
as one project. This feature also eliminates the need to save Sequin information
templates for each sequence.
See
http://www.ebi.ac.uk/~sterk/sqndocs/netaware.html
for instructions on how to make Sequin "network-aware". When connected
to the Internet, Sequin also allows you to download a record from Entrez. If you
are the original submitter of this record, you can use Sequin to edit the sequence
or add further annotations to the record.
Submitting
Authors Form
The pages in the Submitting Authors form ask you to provide the release
date, a working title, names and contact information of submitting authors, and
affiliation information. To create a personal template for use in future
submissions, use the File/Export option after completing each page of this
form.
The Submission page asks for a tentative title for
a manuscript describing the sequence, and will
initially mark the manuscript as being unpublished.
When the article is published, the database staff
will update the sequence record with the new citation.
This page also lets you indicate that a record
should be held until it is published, although
the preferred policy is to release the record
immediately into the public databases.
The Contact page asks for the name, phone number, and e-mail address of
the person responsible for making the submission. Database staff members will
contact this person if there are any questions about the record.
The Sfx (suffix) field should be filled with personal name
suffixes (e.g., Jr., Sr., or 3rd), not with a
person's academic degrees (e.g., M.D. or Ph.D.).
And it is not necessary to type periods after
initials or suffixes.
In the Authors page, enter the names of the people who should get
scientific credit for the sequence presented in this record. These will become
the authors for the initial (unpublished) manuscript.
Authors are entered in a spreadsheet. As soon as anything
is typed in the last row, a new (blank) row is
added below it. Use the tab key to move between
fields. Tabbing from the Sfx (suffix) column automatically
moves to the First Name column in the next row.
Again, suffix is for personal suffixes (e.g.,
Jr. or Sr.), not academic degrees.
The Affiliation page asks for the institutional affiliation
of the primary author.
Sequence
Format Form
With Sequin, the actual sequence data are imported from an outside data file. So
before you begin, prepare your sequence data files using a word processor or
perhaps a text editor associated with your laboratory sequence analysis
software (see "Before you Begin").
The traditional submission is of a single nucleotide sequence containing
one or more genes and encoding one or more proteins. Data files should be
prepared in FASTA format. Segmented sequences, e.g., where several exons
have been sequenced but the complete introns are not yet sequenced, may also be
submitted, with the individual nucleotide segments in FASTA format combined into
one file.
Sequin will also accept population, phylogenetic and mutation
studies, in which genes from many individuals or organisms are sequenced.
The data for these can be entered in FASTA format, but they can also be entered
in PHYLIP, NEXUS, or FASTA+GAP formats, which contain alignment information
as well as the sequences. For this example, FASTA format will be used.
The batch submission choice is provided to accommodate
submission of unrelated sequences, where no alignment
is present or should be calculated.
One powerful feature of Sequin is that the program can automatically annotate
the name of the organism, strain, gene, protein product, and other information
on your sequence and coding regions. You can supply this information in the
FASTA definition line between square brackets, e.g., [org=Mus musculus]
[strain=BALB/c] for the nucleotide and [gene=ins] [prot=insulin] for the
protein.
It is much easier to produce the final submission if you let Sequin work
for you in this manner.
Organism
and Sequences Form
The first page of this form requests information regarding the scientific name of
the organism from which the sequence was derived, if it was not already encoded
in the nucleotide FASTA file. Organism information is most easily entered by
selecting the appropriate organism from the scrollable list. As you begin typing
the scientific name, the list will jump to the right alphabetical location. Click
on the list to finish the selection.
Once you select an organism from the list, the corresponding
scientific and common names and genetic code are
filled out automatically. If you then choose Mitochondrion
as the sequence location, the alternative genetic
code for that organism will be used. If your organism
is not on the list, Sequin will simply accept
the scientific name you have typed; you should
then manually set the
genetic code used for translation. This will
later be verified by the database staff using
a more complete taxonomic database. (Sequin only
holds the top 800 organisms, while more than 30,000
are actually present in GenBank.)

Note: For phylogenetic studies, you may annotate each FASTA definition line with
[org=scientific name] information
(see "Before you Begin"),
and you would not need to fill in the Organism page. When phylogenetic
study has been selected as the sequence format, in fact, this page is replaced
by a notice explaining how to place the organism name in the definition line. In
that case you should enter the location of the sequence and the default genetic
code. Similarly, in a single sequence, if you embed [org=scientific name] in
the definition line (usually along with other modifiers, such as strain or
clone), you can skip this page.
Importing Nucleotide FASTA Files
To import the nucleotide sequence data, click on the Nucleotide folder tab
or the Next Page button to advance to the next page. Select molecule type
and topology, check any additional boxes that apply, then click on Import
Nucleotide FASTA and select the appropriate file.
It is most convenient to place the multiple segments of a segmented sequence, or
the individual sequences of a population, phylogenetic, or mutation study, in a
single file. However, they can also be kept in separate files. In this case you
would repeat the import step for each file. (In addition to importing from a file,
sequences can also be read by pasting from the computer's "clipboard".)
When the sequence file import is complete, a box will appear showing the number
of nucleotide segments imported, the total length in nucleotides of the
sequences entered, and the local ID(s) you designated. The actual sequence
data are not shown. If any of this information is missing or incorrect,
check the file containing the sequence data for proper FASTA format, choose Clear
from the Edit menu, then reimport the sequence(s).
The format for annotating the nucleotide FASTA definition line is shown below:
>ID [org=scientific name] [strain=strain name] title
Sequin has extracted the organism and strain names from the FASTA definition
line in this example, eliminating the need to fill out the Organism page.
Importing Protein FASTA Files
If you have specified a single sequence or segmented nucleotide sequence, and if
it encodes one or more proteins, you can enter the sequences of the protein
products in this page. To import the amino acid sequence, click on the
Proteins folder tab and proceed in the same manner as for nucleotide data.
In this example, we imported two protein sequences. These are the alternative
splice products of the same gene. Both protein sequences were in the same data
file, but each had its own definition line with its own unique local ID.
The format for annotating the protein FASTA definition line is shown below:
>ID [gene=gene name] [prot=protein name] title
Sequin has extracted the gene and protein names from the FASTA definition lines,
and will use these to construct the initial sequence record.
Annotating Population/Phylogenetic/Mutation Sets
If you have specified a population, phylogenetic, or mutation
study, the Annotation folder tab replaces
the Proteins folder tab. This page allows
you to add an rRNA or CDS feature to the entire
length of all sequences in the set. In addition,
you can add a title to any sequences that didn't
obtain them from a FASTA definition line. It is
much easier to add these in bulk at this step
than to add individual rRNA or CDS features to
each sequence after the record is constructed.
It is customary in a nucleotide record to format titles for sequences containing
coding region features in the following way:
Genus species protein name (gene symbol) mRNA/gene, complete/partial cds.
The choice of "mRNA" or "gene" depends upon the molecule
type (use "mRNA" for mRNA or cDNA, and "gene" for genomic DNA).
Use "partial" for incomplete features. The proper organism name in a
phylogenetic study can be added to the beginning of each title automatically by
checking the Prefix title with organism name box.
However, for records containing with CDS, rRNA, or tRNA features, Sequin can
generate the definition line automatically by computing on the features
(see "Generating the Definition Line").
More complex situations, such as a population study of HIV sequences, can
include multiple CDS features in each sequence. In this case, do not use the
Annotation page to create features. (You can still use it for a common
title, however.) After the initial submission has been created, you would
manually annotate features onto one of the sequences. Feature propagation
through an alignment can then be used to annotate the same features at the
equivalent locations on the remaining sequences.
Entering Gene and Protein Information
In the protein FASTA definition line you can embed [gene=...]
and [prot=...]. If this information was not entered
in the definition line, Sequin will display the
following form for each protein sequence, allowing
you to fill in the missing information manually.
Entering Organism and Source Modifiers
In the nucleotide FASTA definition line you can embed [org=scientific name]
and modifiers from the following list (all in the format [modifier=...]):
cell-line, cell-type, chromosome, clone, clone-lib, country, cultivar, dev-stage,
haplotype, isolate, lab-host, map, natural-host, plasmid-name, plastid-name, sex,
strain, sub-species, tissue-lib, tissue-type, transposon-name, and variety. For
example, [org=Homo sapiens] [cell-line=HeLa].
Some population studies are a mixture of integrated provirus and excised virion.
These can be indicated by molecule and location qualifiers. For example,
[molecule=dna] [location=proviral] or [molecule=rna] [location=virion].
If you're unsure of which modifier to use, use [note=...], and database staff will
determine the appropriate modifier to use.
For population, phylogenetic, or mutation studies, Sequin
presents a form to allow this information to be
edited (if it had been entered in the FASTA file),
or entered in the first place, before it constructs
the sequence record. In addition to organism name,
you can enter any kind of source modifer in this
form.

Viewing Your Submission
GenBank
View
After you have completed importing
the data files, Sequin will display your full
submission information in the GenBank format (or
EMBL format if you chose EMBL as the database
for submission in the first form).
Based on information provided in your DNA and amino acid sequence files, any
coding regions will be automatically identified and annotated for you. The
figure shows only the top portion of the GenBank record, but you can see the
first of two coding region (CDS) features. The vertical bar to the left of the
paragraph indicates that the CDS has been selected by clicking with the computer's
mouse.
There are also three mRNA features (not shown in the figure) that were generated
by copying the cDNA feature intervals and editing them to include the
5' and 3' UTRs. Also, the journal citation, originally listed as
"Unpublished", has been updated now that the article has been
published. Chromosome and map information have also been added to the
biological source feature. These changes were initiated by double clicking on the
appropriate paragraphs in the GenBank display format. Finally, Sequin was asked to
generate the definition line
by computing on the annotated features.
Graphical
View
To get a graphical view, use
the Display Format pop-up menu to change
from GenBank to Graphic. Reviewing your submission
in Graphic format allows you to visually confirm
expected location of exons, introns, and other
features in multiple interval coding regions.
The Graphic view in our eukaryotic initiation
factor example illustrates how the coding region
intervals for the two protein products are spatially
related to each other.
The Duplicate View item in the File menu will launch a second
viewer on the record. The display format on each viewer can be independently
set, allowing you to see a graphical view and a GenBank text report
simultaneously. This is useful for getting an overall view of the features and
seeing the details of annotation.
Sequence
View
Sequence view is a static version
of the sequence and alignment editor. It shows
the actual nucleotide sequence, with feature intervals
annotated directly on the sequence. Protein translations
of CDS features are also shown, as are all features
shown in the graphical view. Intronic regions
within protein translations are shown as a series
of tilde (~) characters.
Editing and Annotating
Your Submission
At this point, Sequin could process your entry based on what you have entered so
far, and you could send it to your nucleotide database of choice (as set in the
initial form). However, to optimise usefulness of your entry for the scientific
community, you may wish to provide additional information to indicate biologically
significant regions of the sequence. But first, save the entry so that if you make
any unwanted changes during the editing process you can revert to the original copy.
Additional information may be in the form of Descriptors or Features.
(Descriptors are annotations that apply to an entire sequence or set of
sequences. Features are annotations that apply to a specific sequence interval.)
Sequin provides two convenient methods to modify your entry: (1) to edit
existing information, double click on the text or graphic area you wish to
modify, and Sequin will display forms requesting needed information; or (2) to
add new information, use the Annotate menu and select from the list of
available annotations.
Sequence
Editor
Additional sequence data can also be added using Sequin's powerful sequence
editor. Sequin will automatically adjust feature intervals when editing the
sequence. Prior to Sequin, it was usually easier to annotate everything from
scratch when the sequence changed. But an even easier way to update sequences
is described in the following section.
Updating
the Sequence
Sequin can also read in a replacement sequence or an overlapping sequence
extension, and perform the alignment and feature propagation calculations
necessary to adjust feature intervals, even though the individual editing
operations were not done with the sequence editor.
The Update Sequence submenu within the Edit menu has several
choices. These are for use by the original submitter of a record.
You can read a FASTA file or raw sequence file. This can be a replacement
sequence, or it can overlap the original sequence at the 5' or 3' end. After
Sequin aligns the two sequences, you then select replace or merge, as appropriate,
and the sequence in your record is updated, with all feature intervals adjusted
properly.
You can also update with an existing sequence record that contains features.
This can be obtained from a file, or retrieved from Entrez either via an
accession number or by selecting an alignment after running
PowerBLAST.
The latter two choices require the
network-aware version of Sequin. Once it gets the new record, Sequin aligns the
two sequences as before. This is typically used either to merge two records that
overlap, or to copy features from database records onto a new large contig.
Generating the
Definition Line
The Generate Definition Line item in the Annotate menu can make the
appropriate titles once the record has been annotated with features. The general
format for sequences containing coding region features is:
Genus species protein name (gene symbol) mRNA/gene, complete/partial cds.
Exceptional cases, where this automatic function is unable to generate a reasonable
definition line, will be edited by the database staff to conform to the style
conventions.
The new definition line will replace any previous title, including that originally
on the FASTA definition line.
Validation
Once you are satisfied that
you have entered all the relevant information,
save your file! Then select Validate under
the Search menu. You will either receive
a message that the validation test succeeded or
see a screen listing the validation errors and
warnings. Just double click on an error item to
launch the appropriate editor for making corrections.
See the Sequin Help text for more information
on correcting errors. The validator includes checks
for such things as missing organism information,
incorrect coding region lengths, internal stop
codons in coding regions, inconsistent genetic
codes, mismatched amino acids, and non-consensus
splice sites.
Submitting
the Entry
When the entry is properly formatted and error-free, click the Done
button or select Prepare Submission under the File menu. You will
be prompted to save your entry and e-mail it to the database you selected. The
address for GenBank is gbsub@ncbi.nlm.nih.gov. The address for
EMBL is datasubs@ebi.ac.uk. The address for DDBJ is
ddbjsub@ddbj.nig.ac.jp.
Advanced Topics
Feature
Editor Design
Sequin uses a common structure for all feature editor forms, with (usually)
three top-level folder tabs. One folder tab page is specific to the given
feature type (biological source and publications have more). The
Properties and Location pages are common to all features. Some of
these pages may have subpages, accessed by a secondary set of smaller folder
tabs. This organization allows editors for complex data structures to fit in a
reasonably small window size. The most important information in a given section
is always presented in the first subpage.
Coding Region Page
The coding region editor is perhaps the most complicated form in Sequin. Within
the Coding Region page, the General subpage displays such values
as the genetic code used for translation and the reading frame. (Please note
that there are currently 13 different genetic codes present in Sequin. For
more information on these, see
http://www.ncbi.nlm.nih.gov/Taxonomy/.)
The Exceptions subpage allows you to
indicate translation exceptions to the normal genetic code, such as insertion
of selenocysteine, or suppression of terminator codons by a suppressor tRNA.
The Protein subpage lets you set the name (or, if not
known, a description) of the protein product.
The Product subpage lets you predict the
coding region intervals from the protein sequence
or translate the protein sequence from the location.
(Importing a protein sequence from a file will
also interpret the [gene=...] and [prot=...] definition
line information and automatically attempt to
predict the coding region intervals.)
Additional annotation on the protein product might include a leader peptide,
transmembrane regions, disulfide bonds, or binding sites. These can be added
after setting the Target Sequence popup on the sequence viewer to the
desired protein sequence. You can also launch a duplicate view, already targeted
to the appropriate protein, from the Product subpage.
Properties Page
All features have a number of fields in common. Check the Partial box to
indicate that the range of the feature extends beyond the length of the
sequence. (You would usually also check the 5' Partial or 3'
Partial boxes in the Location page.) Exception means that the
sequence of the protein product doesn't match the translation of the DNA
sequence because of some known biological reason (e.g., RNA editing).
Evidence can be experimental or non-experimental. For a coding region,
the isolation of an mRNA or cDNA in the laboratory is sufficient to justify
setting this box to Experimental.
In addition, nucleotide features (other than genes themselves) can reference a
gene feature. This is frequently done by overlap. (The overlapping gene will
show up on the feature as a /gene qualifier in GenBank format.) Extension of the
feature location will automatically extend the gene that is selected in the
editor. In rare cases, you may want to set a gene by cross-reference. For
example, the mRNA for the lac operon in E. coli could indicate the
lacIZYA gene by cross-reference, and the four coding regions would overlap the
lacI, lacZ, lacY and lacA genes.
The Comment subpage allows text to be associated with
a feature. In GenBank format, this appears as
a /note qualifier. The Citations subpage
attaches citations to the feature. (The citations
should first be added to the record using items
in the Publication submenu of the Annotate
menu, whereupon it will appear in the REFERENCE
section.) For example, an article that justifies
a non-obvious or controversial biological conclusion
would be cited here. In GenBank format, for example,
if the publication is listed as Reference 2, the
feature citation appears as /citation=[2]. Cross-Refs
are cross-references to other databases. The contents
of this subpage may only be changed by the GenBank,
EMBL, or DDBJ database staff.
Location Page
All features are required to have a location, i.e., one or
more intervals on a sequence coordinate. The Location
page provides a spreadsheet for entering and editing
this information. An arbitrary number of lines
can be entered. In this coding region example,
the intervals correspond to the exons. For an
mRNA, the intervals would be the exons and UTRs.
The 5' and 3' Partial check
boxes will show up as < or > in front of
a feature coordinate in the GenBank flat file,
indicating partial locations.
The GenBank flat file view of this location would be:
join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)
If the 5' Partial or 3' Partial boxes were checked, < and > symbols
would appear at the appropriate end of the join statement:
join(<201..224,1550..1920,1986..2085,2317..2404,2466..>2629)
If the sequence was reverse complemented (based on a length of 2881 nucleotides),
the Strand popups would all indicate Minus, and the join statement
for the resulting feature location would be as follows:
complement(join(253..416,478..565,797..896,962..1332, 2658..2681))
NCBI
DeskTop
The NCBI DeskTop is a window that directly displays the internal structure of
the record being viewed in Sequin. It can be understood as a Venn diagram. It is
only meant for advanced users.
As with other views on a record, the DeskTop indicates selected items, and lets
you select items by clicking. Selected items can then be the target of actions
chosen from the Filter menu.
In this example, Sequin was given the genomic nucleotide,
cDNA, and protein sequences for Drosophila eukaryotic
initiation factor 4E. It then determined the mRNA
and coding region intervals, and built an initial
structure. The organism (BioSource descriptor)
is at the nuc-prot set, and thus applies to both
the nucleotide and protein sequences.
Additional Information
The Sequin home page at the
EBI
http://www.ebi.ac.uk/~sterk/sqndocs/
has a Frequently Asked Questions section, and
more detailed instructions on using the capabilities
of network-aware Sequin.
|