Home> Sequence Alignments & Databases> Sequence Alignments Sequence Databases Protein Databases> NCBI ExPASy InterPro

Sequence Alignment

Pairwise and Multiple Sequence

Alignment methods

Gaps

Homologs

Sequence alignment is used to determine degrees of similarity between two or more nucleic or amino acid sequences (DNA, RNA or proteins). Sequence alignments were initially performed by hand. Researchers look for stretches of nucleic or amino acid residues that appear similar between two sequences, line the sequences up next to one another and evaluate how "good" the alignment is.

A "good" or optimal alignment is one in which the greatest number of identical residues are paired between the sequences.

X-A-T-G-C-C-A-C-X

A-T-G-C-C-A-C-X-X

OPTIMAL (X represents any nucleotide)

MISALIGNED

Pairwise and multiple sequence alignment

The example above shows two sequences in a pairwise alignment. Optimal alignments are found between only two sequences, such that identical or similar residues are paired. Researchers also align multiple sequences at once, multiple sequence alignmnet (MSA). The optimal alignment for the group is sought rather than the optimal alignment for two sequences. A number of methods exist for performing automated pairwise and multiple sequence alignments. The overarching goals are the same, align the sequences to optimize for identical and similar residue matches across the sequences.

Alignment Methods

Automated methods for pairwise alignments can be generally categroized as global or local alignment methods. Global alignments seek an optimal alignment between the full length of the sequences, whereas local alignments search for the best alignment between residues in the alignment not the full length. Global and local alignments can be further classified by whether they allow for gaps within a sequence: gapped and ungapped alignments.

MSA Tools

CLUSTAL-W is currently one of the most popular automated multiple sequence alignment tools. CLUSTAL-W calculates a distance matrix for the sequences that are to be aligned. The distance matrix is then used to generate a phylogenetic tree that is used to "guide" the series of global alignments needed to create the multiple alignment. This is referred to as progressive alignment.

Mutliple sequence alignments may also be created by hand and involve gapped or ungapped sequences. Typically, gapped alignments are used for full protein sequences, whereas ungapped alignments may be used to identify protein domains or motifs (See BLOCKS database).

Other multiple sequence alignment methods include DIALIGN, T-Coffee, and POA (Lassman and Sonnhammer, 2002).

Gaps

The example given above is an ungapped alignment. No insertion or deletion of residues are considered in the process of generating the alignment. A gap is generated when a residue has either been deleted or inserted in the sequence.

X-A-T-G-C-C-A-C-X

X-A-T-C-C-A-C-X

X-A-T-G-C-C-A-C-X

X-A-T-_-C-C-A-C-X

identical

deleted G

Gap inserted to achieve optimal (identical) alignment

X-A-T-G-C-C-A-C-X

X-A-T-G-C-T-C-A-C-X

X-A-T-G-C-C-A-C-X

X-A-T-G-C-T-C-A-C-X

X-A-T-G-C-_-C-A-C-X

identical

inserted T

Gap inserted to achieve optimal (identical) alignment

Additional computing operations must be performed to determine if the presence of gaps will enhance the alignment.

Similar sequences: Homologs

Homologs are genes derived from a common ancestor.They are believed to have similar sequences. These sequences diverge from their ancestor as residues are substituted, inserted and deleted. Evolutionary relationships at the molecular level are predicted based on the degrees of similarity or difference between sequences. Proteins that are believed to be homologous at a sequence level commonly have similar or identical functions.

January 24, 2006