Help - ClustalW2 FAQ
- Why use ClustalW2?
- How can I use ClustalW2?
- What can I do with ClustalW2?
- What type of sequences can ClustalW2 align?
- What input formats does ClustalW2 accept?
- What output formats does ClustalW2 produce?
- How can I save my alignment to a file?
- Is there a limit on the number of sequences or the size of the file that I submit to ClustalW2?
- What do the file extensions mean that I get in my results?
- How are the pairwise alignment scores generated?
- How can I get the colour version of the alignment?
- Why can I not see the guide tree in my browser?
- What is Jalview?
- Why does Jalview not work for me?
- How can I save the guide tree image or the jalview alignment?
- What is the difference between a cladogram and a phylogram?
- Which method is used to draw the guide tree?
- Is there an alternative tree drawing program that can be used for large numbers of sequences?
- Is it possible to perform bootstrap analysis or to get the Bootstrap values along with the nodes of each branch of the phylogram?
- Can we have distance scale (as bar) based on the substitution rates along with the phylogram?
- What are the default parameters used by ClustalW2?
- How can I see the parameter values that are used by the ClustalW2 program?
- How long are ClustalW2 results stored at the EBI?
- Does my job have an identifier?
- Where can I download the ClustalW software from?
- How do I reference use of this service?
- What is the difference between a guide tree and a true phylogenetic tree?
- Where can I find detailed information about ClustalW2?
- How does ClustalW2 Work (very simple explanation)?
Why use ClustalW2?
Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is the identification of conserved sequence regions. This is very useful in designing experiments to test and modify the function of specific proteins, in predicting the function and structure of proteins and in identifying new members of protein families.
How can I use ClustalW2?
The ClustalW2 web form is available at http://www.ebi.ac.uk/Tools/clustalw2/. There are two ways to use this service at the EBI. The first is interactively (default) and the second is by email. Using it interactively, the user must wait for the results to be displayed in the browser window. The email option means that the results will not be displayed in the browser window but will be sent by email. The email option is the better one to take when submitting large amounts of data.
ClustalW is also available from within SRS (http://srs.ebi.ac.uk).
What can I do with ClustalW2?
The program ClustalW2 can be used for two purposes:
1. It can be used to produce a multiple sequence alignment. Using the web form the user need only input or upload a file of the sequences that they want to align in an accepted format. The other options on the form are set to the default values for producing a multiple alignment. The user can use the defaults or they can make some changes on the form to customise their run. A multiple sequence alignment of the sequences submitted will be returned to the user (.aln file).
2. It can be used to produce a true phylogenetic tree. In order to use this option, the user must input or upload a multiple alignment of sequences in one of the standard multiple alignment formats (.aln file). Then, in the phylogentic tree section of the form, they must choose one of the tree type options; NJ, Pyhlip or Dist. These are programs for drawing phylogenetic trees. This time the user will retrieve a .ph (always), .dst and/or .nj files (depending on options chosen), which will contain the phylogenetic trees.
By default, the form is set to produce a multiple alignment.
What type of sequences can ClustalW2 align?
It can align either nucleotide or protein sequences. In the case of nucleotide sequences, it will align them as they are input - the program does not provide the option of specifying DNA strands. The EMBOSS tool revseq can be used to reverse and/or complement nucleotide sequences.
What input formats does ClustalW2 accept?
The program accepts sequences in the following formats:
NBRF/PIR, EMBL/UniProt, Pearson (FASTA), GDE, ALN/ClustalW, GCG/MSF, RSF (see the Clustal help pages for details about formats).
The sequences can either be pasted into the web form or uploaded to the web form in a file. It is very important that each of the sequences has a unique name. If they do not, the program will fail. There must be no empty lines, white spaces or control characters between sequences or at the top of the file. This will also cause the program to fail.
What output formats does ClustalW2 produce?
There are a number of options provided as output for the user:
aln with numbers (default), aln without numbers, gcg MSF, phylip, pir and gde.
The user can specify which of these they want on the web form in the OUTPUT section. There is also an option to specify the order that the sequences appear in the alignment: aligned (default) or in the order in which they were input. The alignment will appear on the results page along with details of scores and guide trees. The alignment can be obtained on its own by clicking on the alignment file option at the top (.aln). This file can be opened in a separate window and/or saved to a file.
How can I save my alignment to a file?
The alignment will appear on the results page along with details of scores and guide trees. The alignment can be obtained on its own by clicking on the alignment file option at the top (.aln). This file can be opened in a separate window or saved to a file.
Is there a limit on the number of sequences or the size of the file that I submit to ClustalW2?
The input for ClustalW2 is limited to a maximum of 500 sequences or to a 10MB file (whichever is smaller). When the input file or the number of sequences is large, ClustalW can run for days and in some cases may not finish at all. If you plan to input large amounts of data/sequences, you should use the "RESULTS: email" option
Email jobs are allowed to run for more than 24 hours and the results are kept for a week.
What do the file extensions mean that I get in my results?
On our ClustalW2 submission page, when you submit a number of sequences using the default parameters, you retrieve a .aln and a .dnd file. The .aln file is the alignment and the .dnd file is a guide tree - it is not a phylogenetic tree.
To get an accurate phylogenetic tree, you need to use the .aln file as input and put this back into the ClustalW2 form. This time you need to choose one of the tree options - nj, phylip or dist (all methods for making phylogenetic trees). This time you will retrieve a .ph (always), .dst and/or .nj (depending on options), which are phylogenetic trees.
The .input is your input and the .output is the results that are output.
How are the pairwise alignment scores generated?
A pairwise score is calculated for every pair of sequences that are to be aligned. These scores are presented in a table in the results. Pairwise scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site. We do not correct for multiple substitutions in these initial distances.
As the pairwise score is calculated independently of the matrix and gaps chosen, it will always be the same value for a particular pair of sequences.
Alignment score is calculated in two ways - fast and slow (more accurate mode). The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).
-
References
- Wilbur, W. J. and Lipman, D. J. (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA, 80: 726-730
- Myers, E. W. and Miller, W. (1988) Optimal alignments in linear space. Comput. Applic. Biosci., 4: 11-17
See also:
How can I get the colour version of the alignment?
The alignment will appear on the results page in black and white. There is an option available 'show colours' that will display the alignment in colour according to the physiochemical characteristics of the amino acids.
Why can I not see the guide tree in my browser?
You must have java enabled to see the guide tree. The guide trees are produced by a java applet (provided by Java runtime plugin). To check that you have enabled Java applets go to Preferences, Advanced, and "Enable Java". More Information
What is Jalview?
Jalview is a mulitple alignment editor that is written entirely in java. It is provided as an option when you retrieve a multiple alignment from ClustalW2. To use it, just click on the Jalview gif. It allows you to do things like:
- Use many different colour schemes
- Read and write alignments in a variety of formats
- Order the alignment according to different criteria
- Draw UPGMA and NJ trees depending on percent identity criteria
- Remove gapped columns
- Perform Smith-Waterman alignment of selected sequences
- Insert/delete gaps using the mouse
- Cluster sequences using principle component analysis (PCA)
Why does Jalview not work for me?
If Jalview is not available with your results, this is because the program requires that your browser supports Java applets (provided by Java runtime plugin). To check that you have enabled Java applets go to Preferences, Advanced, and "Enable Java".
How can I save the guide tree image or the jalview alignment?
The only way this can be done is by making a screen shot. This is because applets are not printed out with the html pages. You will need to:
- Use the "Print Screen" button in the top right corner of your keyboard.
- Open an imaging application like Paintshop Pro or Photoshop.
- Go "File>New" from the menu or "Control+N" from the keyboard to create a new image. Go "Edit>Paste from the menu or "Control+V" from the keyboard to paste your screen capture.
- The use the crop function to trim the image (e.g. "Image>Crop").
- Then save or print the image.
What is the difference between a cladogram and a phylogram?
A phylogram is a branching diagram (tree) that is assumed to be an estimate of a phylogeny. The branch lengths are proportional to the amount of inferred evolutionary change. A cladogram is a branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of equal length. Therefore, cladograms show common ancestry, but do not indicate the amount of evolutionary "time" separating taxa. It is possible to see the tree distances by clicking on the diagram to get a menu of options. The options available allow you to do things like changing the colours of lines and fonts and showing the distances.
Which method is used to draw the guide tree?
The method names PHYLIP is the equivalent of new hamshire format tree representation. All clustalw phylogenetic calculations are based around the neighbor-joining method of Saitou and Nei.
Is there an alternative tree drawing program that can be used for large numbers of sequences?
If the number of sequences is very large, the default tree drawing program can generate an image that is too large to capture with print screen. In these situations a number of other programs may help to scale the image down:
http://pearl.cs.pusan.ac.kr/phylodraw/
http://pfaat.sourceforge.net/
http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
You will need to install the application on your PC. Then save your ClustalW tree file (.ph) and use it with the application.
Is it possible to perform bootstrap analysis or to get the Bootstrap values along with the nodes of each branch of the phylogram?
No - bootstrap analysis is too cpu intensive so we do not allow it via the website
If you do wish to do this you will need to download the software (available from the clustal page) and run this locally.
Can we have distance scale (as bar) based on the substitution rates along with the phylogram?
We do not produce a bar but distances can be displayed when you right click on
the java applet
What are the default parameters used by ClustalW2?
When "def" values are used, we let ClustalW (1.82) use its own default values:
- DNA Gap Open Penalty = 15.0
- DNA Gap Extension Penalty = 6.66
- DNA Matrix = Identity
- Protein Gap Open Penalty = 10.0
- Protein Gap Extension Penalty = 0.2
- Protein matrix = Gonnet
- Protein/DNA ENDGAP = -1
- Protein/DNA GAPDIST = 4
How can I see the parameter values that are used by the ClustalW2 program?
If you submit your job by email, you will receive two emails. The first one is a confirmation mail that lists the parameters that you have chosen. The second mail contains a link to the ClustalW2 result page. It is not possible to show the submission parameters on the result page, because ClustalW2 does not include them in the ClustalW2 output.
How long are ClustalW2 results stored at the EBI?
If you run an interactive job, the results will be available for 24 hours. The results of an email job are available for 2 weeks. Some big files are removed after 15 minutes due to space constraints.
Does my job have an identifier?
Yes. You will find it on the results page. The job identifier has the job name, the date and a random number. If you want to report any errors or queries about a job, please tell us the job identifier.
Where can I download the ClustalW2 software from?
Both ClustalW2 and ClustalX can be downloaded from the EBI ftp site.
We do not distribute the CGI script for the web interface. The CGI.pm module (available from many perl sites on the internet) is needed to build a cgi interface for the command-line version of ClustalW2.
How do I reference use of this service?
ClustalW and ClustalX version 2.
Bioinformatics 2007 23(21): 2947-2948.
abstract full-text PDF What is the difference between a guide tree and a true phylogenetic tree?
A guide tree is calculated based on the distance matrix that is generated from the pairwise scores. The output can be found in the .dnd file. A phylogenetic tree is calculated based on the multiple alignment that it receives. The distances between the sequences in the alignment are calculated and can be found in the .ph file. These distances are then used by the method chosen (nj, phylip, dist) to make the phylogenetic tree (.nj, .ph, .dst file).
Where can I find detailed information about ClustalW2?
There are a number of places:
For additional help on ClustalW2 also see:
- Multiple sequence alignment with the Clustal series of programs
- ClustalW2 Improving Sensitivity
- 2can tutorial explains details of ClustalW2 and how to use it
- EBI ClustalW2 help pages
How does ClustalW2 Work (very simple explanation)?
1. Determine all pairwise alignments between sequences and the degree of similarity between them:
2. Construct a similarity tree.
3. Combine the alignments from 1 in the order specified in 2 using the rule " once a gap always a gap"
In stage 1:1.1. clustalW2 uses a pairwise alignment to compute pairwise alignments.2. Using the matrix from 1.2.1. and Neighbor-Joining, Clustalw constructs the similarity tree. The root is placed in the middle of the longest chain of consecutive edges.
1.2. Using the alignments from 1.1 it computes a distance.
1.2.1. The distance is commonly calculated by looking at the non-gapped positions and count the number of mistmatches between the two sequences. Then divide this value by the number of non-gapped pairs to calculate the distance. Once all distances for all pairs are calculated they go into a matrix. This follows on in stage 2.
3. Combine the alignments, starting from the closest related groups (going form the tips of the tree towards the root).
