Help - ClustalW2 FAQ
- What is ClustalW?
- Why is ClustalW useful?
- How can I use ClustalW?
- What version of ClustalW is run at EMBL-EBI?
- How do I reference use of ClustalW at EMBL-EBI?
- What inputs does ClustalW accept?
- How do I input multiple sequences into a single box?
- Is there a limit on the number of sequences or the size of the file that I submit to ClustalW?
- What does a 'stream closed' error mean?
- Where can I find information on the different parameters/options?
- Why can't I select a specific matrix (eg BLOSUM 62)?
- What are the best settings for aligning two sequences?
- Why do I get a 'minimum 2 sequences required' error?
- Why do I get a 'Two sequences cannot share the same identifier' error?
- What does 'Entry found which does not contain a sequence' mean?
- Why does ClustalW truncate my identifier?
- Why does my input disappear when using the back button from my browser?
- What outputs does ClustalW produce?
- How do I download the alignment?
- How can I view my alignment after I've downloaded it?
- How are the pairwise scores calculated?
- How do I obtain a phylogentic tree for an alignment?
- How do I download/save the phylogenetic tree?
- How can I view my tree after I've downloaded it?
- How long are results stored at EMBL-EBI?
- What do the consensus symbols mean in the alignment?
- What do the colours mean when I show them on protein alignments?
- Why do I get a 'Raw Tool Output' page?
- Why do I get a 'Job not found' page?
- Why do I see 'Java is required'?
Getting further help
ClustalW is a tool to align three or more sequences together in a computationally efficient manner.
Aligning multiple sequences highlights areas of similarity which may be associated with specific features that have been more highly conserved than other regions. These regions in turn can help classify sequences or to inform experiment design.
Multiple sequence alignment is also an important step for phylogenetic analysis, which aims to model the substitutions that have occured over evolution and derive the evolutionary relationships between sequences.
The ClustalW multiple sequence alignment web form is available at http://www.ebi.ac.uk/Tools/msa/clustalw2/. There are two ways to use this service at EMBL-EBI. The first is interactively (default) and the second is by email. Using it interactively, the user must wait for the results to be displayed in the browser window. The email option means that the results will not be displayed in the browser window but instead a link to the results will be sent by email. The email option is the better one to take when submitting large amounts of data or a job that might take a long time to run.
For more detailed help for using ClustalW please see the tool help documentation http://www.ebi.ac.uk/Tools/msa/clustalw2/help/index.html.
We run ClustalW version 2 (ClustalW2) at EMBL-EBI, for the precise version number go to the submission details tab from your job results.
Please cite use of ClustalW at our website with the following:
ClustalW and ClustalX version 2 (2007) Bioinformatics 2007 23(21): 2947-2948. doi:10.1093/bioinformatics/btm404
A new bioinformatics analysis tools framework at EMBL-EBI (2010) Nucleic acids research 2010 Jul, 38 Suppl: W695-9 doi:10.1093/nar/gkq313
The program accepts nucleic acid or protein sequences, in the following multiple sequence formats:
- Pearson (FASTA)
- RSF (see the Clustal help pages for details about formats)
Please note GenBank (from NCBI) or raw sequence data (just the sequence with no recognised formatting) will not work with ClustalW. If using sequences from NCBI be sure to save them as FASTA format first. Sequence format conversion tools are available at http://www.ebi.ac.uk/Tools/sfc/.
The sequences can either be pasted into the web form or uploaded to the web form in a file. It is very important that each of the sequences has a unique name. If they do not, the program will fail. There must be no empty lines, white spaces or control characters between sequences or at the top of the file. This will also cause the program to fail.
Example protein input (FASTA format): sequence12.txt
Paste the sequences in an accepted format into the same box: ClustalW accepts multiple sequence formats as input - these formats allow for multiple sequences to be placed in the same file or input box, as they each contain ways for ClustalW to distinguish where new sequences start. It is important to use correctly formated sequences for this reason.
The input for ClustalW is limited to a maximum of 500 sequences or to a 1MB file (whichever is smaller). When the input file or the number of sequences is large, ClustalW can run for days and in some cases may not finish at all. If you plan to input large amounts of data/sequences you should use the email results option. For batch runs of ClustalW see our web services tools: http://www.ebi.ac.uk/Tools/webservices/
This error sometimes occurs if you try to upload a really large input to our tools, far in excess of the limits described above. Make sure your input falls within the previously described limits.
To view parameter settings for ClustalW click the 'More options' buttons. Clicking on the parameter name will take you to the relevant help information.
We only allow you to select a series of matrices for ClustalW - BLOSUM for example. ClustalW will then select the exact matrix based on its internal calculations of which one is best to use. If you wish to specific an exact matrix then you will need to download and run the commandline version of ClustalW.
Trick question! Multiple sequence alignment tools like ClustalW are designed to align three or more sequences. To align two sequences you should use our pairwise alignment tools: http://www.ebi.ac.uk/Tools/psa/
This occurs when something is wrong with your sequence input. It is often caused by using a sequence format not supported by ClustalW, or a problem with the formatting for example sequence data not being on a newline from the sequence header line. ClustalW requires sequences to be formatted, not just raw sequence data.
Information about sequence formats can be found at the EMBOSS sequence formats guide. If you're not sure, try using FASTA format. You can also use tools like Readseq to convert between sequence formats.
ClustalW needs unique sequence identifiers, which it defines as the first word on the sequence identifier line. Check that you've not got a duplicate identifier somewhere in the input, that you're not using spaces or tabs in your identifiers, and that the first 30 characters of your identifier are unique if using Clustal format files.
ClustalW produces this error when it can't find the sequence data to go with a sequence identifier. This can happen if sequence data is missing from the input or the data is on the same line as the header - make sure the sequence data is on a new line.
The default output format (ALN/Clustal format) truncates sequence identifiers to 30 characters - this may also affect the ability of the job to run if the first 30 characters are not unique. To solve this, change the output format to a type that does not have a limit, such as pearson/FASTA.
If you see this behaviour you're probably using Firefox, which has recently changed behavior and now doesn't retain form information when you use your back button. If you are wanting to do a lot of repeat analysis on the same input/parameters then we'd recommending using Web Services to programmatically access this service. Or you could use a different browser.
ClustalW produces several outputs, depending on the options you selected when submitting the job. By default the main output is the alignment file. Other outputs can be viewed/downloaded in the results summary tab.
The quickest way to download the alignment is to click the 'Download Alignment File' button in the alignments tab of the results. You can view all the files that are produced on the 'Results Summary' tab, which includes the tool output and any guide tree files as well as the alignment file. Colours are not saved as part of the alignment.
The alignment will be in the format you specified in the input section, or Clustal by default if you didn't specify anything. They are all text files, so can be opened in any text viewing program that can deal with unix text characters (end of line characters for example). In Windows you can use WordPad.
More specialist tools are available to investigate alignments and allow you to mark them up with colours for example. Genedoc and Jalview are two such tools.
The Scores Table shows the pairwise scores calculated for every pair of sequences that is to be aligned. Pairwise scores are simply the number of identities between the two sequences, divided by the length of the alignment, and represented as a percentage. This alignment is only a precursor to the full multiple alignment and might not be preserved.
To create a phylogenetic tree you can either send the results to ClustalW_Phylogeny using the button in the Alignment tab, or simply go to the Phylogenetic Tree tab where a simple Neighbour Joining tree will be created.
The tree data can be saved by clicking the 'View Phylogetic Tree file' button or by clicking the tree link in the Results Summary tab. Using this data you can recreate the tree in any tree viewing software that takes Newick format tree data.
The tree image cannot be saved directly as it is a dynamic java-driven interface, however you can take a screenshot and then save this in an image editing program, or as mentioned above use the tree data to recreate the tree in another tree viewing program and save it from there.
The tree data is in the widely used Newick format, there are several online or standalone tree viewing programs available that can take this and recreate the tree from this data.
Results are stored at EMBL-EBI for a week from the date of job submission.
An * (asterisk) indicates positions which have a single, fully conserved residue.
A : (colon) indicates conservation between groups of strongly similar properties - scoring > 0.5 in the Gonnet PAM 250 matrix.
A . (period) indicates conservation between groups of weakly similar properties - scoring =< 0.5 in the Gonnet PAM 250 matrix.
Note that the same symbols display for DNA/RNA alignments, so while * (asterisk) characters are still useful, the other characters should be ignored for DNA/RNA alignments.
This protein-only option colours the residues according to their physicochemical properties:
|AVFPMILW||RED||Small (small+ hydrophobic (incl.aromatic -Y))|
|RK||MAGENTA||Basic - H|
|STYHCNGQ||GREEN||Hydroxyl + sulfhydryl + amine + G|
|Others||Grey||Unusual amino/imino acids etc|
This happens when we can't parse the results of your ClustalW job - usually because the job has failed for some reason. Have a look at the links on the page, especially the Tool Error Details, for clues as to why this might have happened, and check your input for errors.
This happens when we can't find the job that you have requested. Normally it's a result of trying to look at the page over a week after the job was submitted, so the data is no longer available at EMBL-EBI. If you've copied the URL from a recent email job, check for any mistakes in the link.
Some of the options for ClustalW results pages require Java to work properly. If Java isn't installed with your browser or it doesn't have permission to run (for example it has been disabled due to being out of date) then there may be some results options missing, for example the button to launch Jalview or the display of guide trees.
To update your version of Java go to the Java.com download page.
More help can be found in the help documentation (http://www.ebi.ac.uk/Tools/msa/clustalw2/help/index.html).
For specific help about any problems with or if you have any feedback on ClustalW at EMBL-EBI please feel free to contact our helpdesk (http://www.ebi.ac.uk/support/) ideally including full details about your problem and the URL of the page you are contacting us about.