Dotlet Analysis Essay

On By In 1

Sequence alignment

Pairwise alignment


The easiest way to align two sequences is to use a dotplot. In its most straight forward implementation the two sequences to be aligned are written along the coordinate axis.

In more realistic implementations a window of 5 to 20 nucleotides or amino acids is slid along one of the axes (i.e., sequences) and compared to every possible window on the other axis (sequence). The dot intensity is adjusted to reflect the percent identity (or similarity) in the two windows.

See the Dotlet exercises from last Friday.

Optimal global and local alignments.

There are many different algorithms to calculate pairwise sequence alignments. For two sequences it is "easy" to calculate an optimal global alignment. (According to the motto: "It can be easily shown" -- see here). The so called Needleman-Wunsch algorithm is widely used, it optimizes a positive alignment score, a related (and under some conditions equivalent approach) is to minimize the differences between to sequences.

Multiple Sequence Alignments


Usually global alignments are the easiest to calculate(local see discussion of blast )

One of the easiest to use, most sophisticated, and most versatile alignment programs is clustalw

(Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237-244;
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680)

Clustalw runs on all possible platforms (unix, mac, pc), and it is part of most multiprogram packages, and it is also available via different web interfaces (for examples here, and here). 

Clustalw uses a very simple menu driven command-line interface, and you also can run it from the command line only (i.e. it is easy to incorporate into scripts.)

Clustalx uses the same algorithms as clustalw.  However, it has a much nicer interface, it displays information on the level of similarity, and it uses color in the alignment.  Especially for amino acids the use of color greatly enhances the ability to recognize conservative replacements. Clustalx2.1 is available for different platforms at the ebi's ftp site (follow your platform, clustalx is stored in the clustalw folders)

Clustal reads and writes most formats used by different programs.  The easiest format is the FASTA format:

> name of sequence or any other information goes in the first line. This line starts with ">". The line can be longer than 80 characters. The first line ends with the first paragraph sign.p
The second line contains the sequence itself; numbers and other non standard characters are ignored. Be careful if you download sequences. Often the transfer programs introduce paragraph signs every 100 characters, and the end of a command line frequently ends up as the beginning of the sequence.
All sequences to be read should be in a single file.

(sample clustalw input file)

(sample clustalw output file)

Clustal also reads aligned sequences.  If you input aligned sequences you can go directly to the tree section.
!! Be careful if you make a mistake, and the sequences are not aligned, your tree will look strange!!

Clustal also is useful to reformat and edit alignments, it is very forgiving in reading formats, e.g., you can open the clustal format (*.aln) in a text editor and delete columns and reload the file into clustalw, and output it in the other formats available.

For calculating an alignment, you can select different substitution matrices, and gap penalties (end-gaps can be considered differently!)

Clustal is better than its reputation. It is doing a great job in handling gaps, especially terminal gaps, and it makes good use of different substitution matrices.

To align sequences clustal performs the following steps (aka as progressive alignment):

1) Pairwise distance calculation
2) Clustering analysis of the sequences
3) Iterated alignment of two most similar sequences or groups of sequences.

It is important to realize that the second step is the most important. The relationships found here will create a serious bias in the final alignment. The better your guide tree, the better your final alignment. You can load a guide tree into clustal. This tree will then be used instead of the neighbor joining tree calculated by clustalw as a default. (The guide tree needs to be in normal parenthesis notation WITH branch lengths).

Other programs often used for multiple sequence alignment
(We will not use these program in this course; if you are already confused by the information provided, skip to the assignments):

A program available via the www is SAM (sequence alignment and modeling system) by Richard Hughey, Anders Krogh, Christian Barrett, & Leslie Grate at UCSC. The input consists of a multiple sequence file (aligned or not aligned) in FASTA format. The program uses secondary structure predictions, neighboring sites, etc. to place gaps. The program can be accessed at

If your sequences are not very similar, and if you are not able to generate a trustworthy multiple sequence alignment, you can calculate distance trees based on pairwise alignments only. The best program for this purpose is statalign from Jeff Thorne (Thorne JL, Kishino H (1992) Freeing phylogenies from artifacts of alignment. Mol Bio Evol 9:1148-1162). It runs under standard UNIX.  It's only worth your effort if you are getting gray hairs because of a data set you cannot reliably align. Very out of fashion these days.

MUSCLE is the current alignment program of choice. It is thought to give better alignments compared to clustal, it is faster and works with larger datasets. The program is available through a webserver at the ebi, and as a commandline program to download here.

Most multiple sequence alignment programs produce alignments that are pleasing to the human eye by placing only a few large gaps into the sequences. However, for many applications it is better to align a particular amino acid to gaps in the other sequences, if one is not certain about the homology of the position. These programs that introduce more gaps are at present underutilized. An example is PRANK.

In case of divergent sequences, a popular program that combines phylogenetic reconstruction and multiple sequence alignment is SATe.

If you need to use MSA in your work, the current recommendation is to use muscle, and test if an alignment calculated under PKANK gives similar results. Also, if you use less than 100 sequences, try SATe2. (Note the time needed for computation is very different!).

In order to avoid artifacts reflecting the guide tree used for the alignment, many prefer to filter the alignment using only sites that are reliably aligned. One such approach is GBLOCK (implemented in seaview, see below, web server is here), another is guidance from Tal Pupko's lab at TAU (I like this one, because it allows to remove positions from poorly aligned individual sequences, not only complete columns).

One useful sequence editor is seaview. It runs on PC and most unix flavors (including Macs). The latest version (4.3) includes phylogenetic reconstruction using phyml and parsimony, multiple sequence alignment using clustalw and muscle, and filtering of poorly aligned regions using GLocks.


Problem: Step two can create a strong bias, that is recovered as "signal" in future analyses of the multiple sequence alignment.



1. Pagni M, Ioannidis V, Cerutti L, Zahn-Zabal M, Jongeneel CV, Falquet L. MyHits: a new interactive resource for protein annotation and domain identification. Nucleic Acids Res. 2004;32:W332–W335.[PMC free article][PubMed]

2. Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P. SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006;34:D257–D260.[PMC free article][PubMed]

3. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, et al. InterPro, progress and status in 2005. Nucleic Acids Res. 2005;33:D201–D205.[PMC free article][PubMed]

4. Biegert A, Mayer C, Remmert M, Soding J, Lupas AN. The MPI Bioinformatics Toolkit for protein sequence analysis. Nucleic Acids Res. 2006;34:W335–W339.[PMC free article][PubMed]

5. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006;34:D187–D191.[PMC free article][PubMed]

6. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65.[PMC free article][PubMed]

7. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, et al. Ensembl 2007. Nucleic Acids Res. 2007;35:D610–D617.[PMC free article][PubMed]

8. Sperisen P, Iseli C, Pagni M, Stevenson BJ, Bucher P, Jongeneel CV. trome, trEST and trGEN: databases of predicted protein sequences. Nucleic Acids Res. 2004;32:D509–D511.[PMC free article][PubMed]

9. Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ. The PROSITE database. Nucleic Acids Res. 2006;34:D227–D230.[PMC free article][PubMed]

10. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, et al. The Pfam protein families database. Nucleic Acids Res. 2004;32:D138–D141.[PMC free article][PubMed]

11. Gattiker A, Michoud K, Rivoire C, Auchincloss AH, Coudert E, Lima T, Kersey P, Pagni M, Sigrist CJ, et al. Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 2003;27:49–58.[PubMed]

12. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. GenBank. Nucleic Acids Res. 2000;28:15–18.[PMC free article][PubMed]

13. Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2000;28:10–14.[PMC free article][PubMed]

14. Sperisen P, Pagni M. JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture. BMC Bioinformatics. 2005;6:216.[PMC free article][PubMed]

15. Junier T, Pagni M. Dotlet: diagonal plots in a web browser. Bioinformatics. 2000;16:178–179.[PubMed]

16. Clamp M, Cuff J, Searle SM, Barton GJ. The Jalview Java alignment editor. Bioinformatics. 2004;20:426–427.[PubMed]

17. Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005;33:511–518.[PMC free article][PubMed]

18. Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16:276–277.[PubMed]

19. Iseli C, Jongeneel CV, Bucher P. ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1999:138–148.[PubMed]

20. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402.[PMC free article][PubMed]

21. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003;31:3497–3500.[PMC free article][PubMed]

22. Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000;302:205–217.[PubMed]

23. Wallace IM, O'Sullivan O, Higgins DG, Notredame C. M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 2006;34:1692–1699.[PMC free article][PubMed]

24. Armougom F, Moretti S, Poirot O, Audic S, Dumas P, Schaeli B, Keduas V, Notredame C. Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res. 2006;34:W604–W608.[PMC free article][PubMed]

25. Moretti S, Reinier F, Poirot O, Armougom F, Audic S, Keduas V, Notredame C. PROTOGENE: turning amino acid alignments into bona fide CDS nucleotide alignments. Nucleic Acids Res. 2006;34:W600–W603.[PMC free article][PubMed]

26. Junier T, Bucher P. SEView: a Java applet for browsing molecular sequence data. In Silico Biol. 1998;1:13–20.[PubMed]

27. Hau J, Muller M, Pagni M. HitKeeper, a generic software package for hit list management. Source code for biology and medicine. 2007;2:2.[PMC free article][PubMed]


Leave a Reply

Your email address will not be published. Required fields are marked *