We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

Which algorithm or algorithms are considered the *standard* or *state-of-the-art* for multiple sequence alignment?

How big is the need for better algorithms? How many sequences need to be alignment in a typical test? I am trying to understand how important problem this is in bioinformatics.

My Vote goes to **Mafft(insi)** as it have ~86% accuracy and results in ~1.2 hour. Though fastest will be **kalign** takes only ~3 minutes to finish with an accuracy of 74.3%.

For testing:

For each of the 218 reference alignments in the benchmark, we applied eight alignment programs, resulting in a total of 1744 automatically constructed MSAs. The overall quality of these automatic alignments was measured using the Column Score (CS) described in Methods.

FIGURE 1: **Overall alignment performance for each of the MSA programs tested.**

(A) Overall Accuracy

(B) Total run time for constructing all alignments (a log10 scale is used for display purposes).

doi:10.1371/journal.pone.0018093.g003

## Compared Tools

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0018093.t001&representation=PNG_L

Source and Photo Credits:

A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives

PS: This is from an old paper of 2011. If you want the new statistics you can always test on your own, by the process described in the source paper.

The PRANK and PAGAN algorithms have both come out of the Loytynoja lab in Finnland, and are stirring up the pot a bit. They use inferred phylogenetic relationships as a parameter, and tend to yield a much more 'gappy' alignment, supposedly due to more accurate handling of indels. For easy alignments the method doesn't matter so much, but if the sequences are highly divergent it might be worth while checking out PAGAN and PRANK.

Clustal has reinvented itself as Clustal Omega using Hidden Markov Models, and is particularly suited to the alignment of very many sequences.

## Background and motivation

Phylogenetic analyses of molecular sequences are an integral part of many modern molecular and evolutionary biology studies. With the increasing pace of methodological developments it becomes a challenge for those authors that merely apply statistical methods to make sufficiently educated choices of what models and methods are most suitable for their data and purposes. As editors, we regularly come across submissions in which outdated methods are used with no apparent reason, undermining the reliability of reported findings. For example, most of the time no justification is provided for the use of alignment methods, typically with default settings followed by subjective manual intervention. Other common issues include the use of overly simplistic substitution models or reliance on basic pairwise comparisons when multiple homologous sequences are available. In particular, with no justification, some authors rely solely on distance-based tree reconstruction and, worryingly, statistical support for inferred clades is not properly evaluated. Further downstream, selection or dating analyses are common, but again, they often suffer from the use of outdated methods that are based on pairwise comparisons or make overly simplistic assumptions.

While researchers in the field are somewhat critical of outdated methods, in fact, many of them made and still make a profound contribution to the development of methodologies for computational molecular evolution, which explains their frequent usage. However, the field has since moved on and now boasts an overwhelming variety of more advanced models and methods, which were shown to be either better (more accurate) than previous methods in general, or to deal better with data-specific features. Appropriate application of this existing variety, nevertheless, requires a better understanding of the fundamental principles of the various methods and models, their underlying assumptions, and how they are implemented in various programs and web-servers. Looking forward, methods and strategies that are currently the state-of-the-art are likely to become outdated as well, so it is equally important to think broadly about the analysis performed. The field of molecular evolution is extremely interdisciplinary, bridging mathematics and statistics, computer science, ecology, evolutionary biology and population genetics, molecular biology, biochemistry, and physical chemistry. Few researchers have expertise in all of these areas, yet an analysis in molecular evolution is ultimately interdisciplinary, making assumptions across several areas, which may be not fully comprehended by a researcher undertaking the analysis. We appreciate that often, model and method choice is not a trivial task even for method developers. As a consequence, there has been a lot of recent effort in evaluating methods and models

## Progressive Multiple Sequence Alignment with the Poisson Indel Process

Sequence alignment lies at the heart of many evolutionary and comparative genomics studies. However, the optimal alignment of multiple sequences is NP-hard, so that exact algorithms become impractical for more than a few sequences. Thus, state of the art alignment methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogenetic tree. Changes between homologous characters are typically modelled by a continuous-time Markov substitution model. In contrast, the dynamics of insertions and deletions (indels) are not modelled explicitly, because the computation of the marginal likelihood under such models has exponential time complexity in the number of taxa. Recently, Bouchard-Côté and Jordan [PNAS (2012) 110(4):1160–1166] have introduced a modification to a classical indel model, describing indel evolution on a phylogenetic tree as a Poisson process. The model termed PIP allows to compute the joint marginal probability of a multiple sequence alignment and a tree in linear time. Here, we present an new dynamic programming algorithm to align two multiple sequence alignments by maximum likelihood in polynomial time under PIP, and apply it a in progressive algorithm. To our knowledge, this is the first progressive alignment method using a rigorous mathematical formulation of an evolutionary indel process and with polynomial time complexity.

## State of the art: refinement of multiple sequence alignments

**Background:** Accurate multiple sequence alignments of proteins are very important in computational biology today. Despite the numerous efforts made in this field, all alignment strategies have certain shortcomings resulting in alignments that are not always correct. Refinement of existing alignment can prove to be an intelligent choice considering the increasing importance of high quality alignments in large scale high-throughput analysis.

**Results:** We provide an extensive comparison of the performance of the alignment refinement algorithms. The accuracy and efficiency of the refinement programs are compared using the 3D structure-based alignments in the BAliBASE benchmark database as well as manually curated high quality alignments from Conserved Domain Database (CDD).

**Conclusion:** Comparison of performance for refined alignments revealed that despite the absence of dramatic improvements, our refinement method, REFINER, which uses conserved regions as constraints performs better in improving the alignments generated by different alignment algorithms. In most cases REFINER produces a higher-scoring, modestly improved alignment that does not deteriorate the well-conserved regions of the original alignment.

## MATERIALS AND METHODS

### Algorithm overview

Given *m* protein sequences *S* = <*S*_{1}, … , *S _{m}*> and a minimum alignment length

*L*

_{min}, ProDA returns a set of aligned regions with length at least

*L*

_{min}. The algorithm consists of seven steps ( Figure 1):

Overview of the ProDA algorithm. (**A**) Compute all-versus-all PLAs. (**B**) Identify possible repeats by computing non-overlapping PLAs. (**C**) Generate a block of possibly alignable regions from the PLAs. (**D**) Compute a guide tree for the block. (**E**) Progressively align sequences in the block and eliminate spurious alignments using heuristic filters. (**F**) Trim the block alignment. (**G**) Discard used PLAs, and repeat the alignment process until no more alignable blocks of length at least *L*_{min} can be found. The end result is a set of aligned regions with length at least *L*_{min}.

Overview of the ProDA algorithm. (**A**) Compute all-versus-all PLAs. (**B**) Identify possible repeats by computing non-overlapping PLAs. (**C**) Generate a block of possibly alignable regions from the PLAs. (**D**) Compute a guide tree for the block. (**E**) Progressively align sequences in the block and eliminate spurious alignments using heuristic filters. (**F**) Trim the block alignment. (**G**) Discard used PLAs, and repeat the alignment process until no more alignable blocks of length at least *L*_{min} can be found. The end result is a set of aligned regions with length at least *L*_{min}.

*Step 1: Generation of pairwise local alignments (PLAs)*

Compute the best local alignment using either a variant of posterior decoding (Algorithm Details) or the Viterbi algorithm. Stop if the best local alignment is shorter than *L*_{min}. Otherwise, store the alignment found.

Mark cells in the dynamic programming matrix corresponding to the best alignment found in substep 1, disallowing them from contributing to future alignments.

Pair HMM for local alignment of two sequences *x* and *y*. State *M* emits two letters, one from each sequence, and corresponds to the two letters being aligned together. State *I _{x}* emits a letter in sequence

*x*that is aligned to a gap, and similarly state

*I*emits a letter in sequence

_{y}*y*that is aligned to a gap. States

*LF*and

_{x}*LF*emit two unaligned flanking subsequences on the left of the local alignment. Similarly, states

_{y}*RF*

_{x}and

*RF*

_{y}emit two unaligned flanking subsequences on the right of the local alignment.

Pair HMM for local alignment of two sequences *x* and *y*. State *M* emits two letters, one from each sequence, and corresponds to the two letters being aligned together. State *I _{x}* emits a letter in sequence

*x*that is aligned to a gap, and similarly state

*I*emits a letter in sequence

_{y}*y*that is aligned to a gap. States

*LF*and

_{x}*LF*emit two unaligned flanking subsequences on the left of the local alignment. Similarly, states

_{y}*RF*

_{x}and

*RF*

_{y}emit two unaligned flanking subsequences on the right of the local alignment.

*Step 2: Inference of repeats from pairwise alignments*

Some local alignments found in Step 1 may simultaneously overlap in both sequences, indicating the presence of repeats. Break overlapping alignments into shorter non-overlapping PLAs that putatively correspond to individual repeats. This step can be seen as a post-processing step for Step 1 and does not guarantee finding all real repeats.

*Step 3: Generation of a block of alignable sequence fragments*

Two sequence fragments are alignable if they align to each other in one PLA or both of them align to a third fragment in two different PLAs. A block of alignable fragments is a set of sequence fragments in which at least one fragment is alignable to all the other fragments. Compute the block *B* with the maximum number of alignable fragments. Define the boundaries of each fragment within the block by averaging the boundary positions of PLAs corresponding to the block fragments.

*Step 4: Construction of guide tree and adjustment of the block*

For every pair of sequence fragments *b, c* ∊ *B* compute the posterior probabilities *P*(*b _{i}* ∼

*c*∣

_{j}*b*,

*c*) that letters

*b*∊

_{i}*b*and

*c*∊

_{j}*c*are paired in an alignment generated by the pair HMM of Figure 3.

Retain only fragments belonging to tree *T* and remove all the other fragments from *B*.

Example of alignments produced by ProDA. Here we show the alignments of four proteins with SWISSPROT ids GRB2_HUMAN, MATK_HUMAN, CRKL_HUMAN and ABL1_HUMAN, respectively. This set was used previously to demonstrate POA ( 20) and ABA ( 12). In (i) we show each sequence as a line to scale such that one residue always has the same length. Above the line is the PFAM ( 21, 42) annotation, below the line are the regions aligned by ProDA. These regions are assigned arbitrary letters that refer to the detailed alignments in (**A**) through (**D**). ABL1_HUMAN is truncated as it is significantly longer than the other sequences (1130 residues, compared with 507 for MATK_HUMAN), and there are no further PFAM annotations or aligned ProDA regions. In this set we observe different domain structures in each protein including a tandem repeat (SH3 in CRKL_HUMAN) and a rearrangement (SH2-SH3 in CRKL_HUMAN and SH3-SH2 in ABL1_HUMAN). In this example, ProDA successfully reconstructs the known domain organization, although the tyrosine kinase domain is split into two segments (C and D).

Example of alignments produced by ProDA. Here we show the alignments of four proteins with SWISSPROT ids GRB2_HUMAN, MATK_HUMAN, CRKL_HUMAN and ABL1_HUMAN, respectively. This set was used previously to demonstrate POA ( 20) and ABA ( 12). In (i) we show each sequence as a line to scale such that one residue always has the same length. Above the line is the PFAM ( 21, 42) annotation, below the line are the regions aligned by ProDA. These regions are assigned arbitrary letters that refer to the detailed alignments in (**A**) through (**D**). ABL1_HUMAN is truncated as it is significantly longer than the other sequences (1130 residues, compared with 507 for MATK_HUMAN), and there are no further PFAM annotations or aligned ProDA regions. In this set we observe different domain structures in each protein including a tandem repeat (SH3 in CRKL_HUMAN) and a rearrangement (SH2-SH3 in CRKL_HUMAN and SH3-SH2 in ABL1_HUMAN). In this example, ProDA successfully reconstructs the known domain organization, although the tyrosine kinase domain is split into two segments (C and D).

*Step 5: Progressive alignment of the block*

For the block and corresponding tree built in Step 4 progressively align fragments according to the order specified in the tree. Alignments are scored using a sum-of-pairs scoring function in which aligned residues are assigned the match quality scores *P*(*b _{i}* ∼

*c*∣

_{j}*b, c*) and gap penalties are set to zero.

*Step 6: Extraction of final alignments from block alignment*

Extract the longest fraction of the block alignment that begins and ends with columns containing no gaps.

*Step 7: Removal of used PLAs*

Remove PLAs corresponding to the fragments of the block to prevent them from contributing to subsequent blocks. If no PLAs remain, stop the algorithm. Otherwise go to Step 3.

The end result of the algorithm is a set of aligned regions with length at least *L*_{min}. An example alignment generated via the ProDA algorithm is shown in Figure 3.

### Algorithm details

In this section, we provide a detailed description of each step in the ProDA algorithm.

#### 1. Pairwise local alignment

The ProDA algorithm begins by performing all-versus-all pairwise local alignments (PLAs). For each pair of sequences *x* and *y* in *S*, ProDA computes a set of high-scoring local alignments using an iterative procedure in each step of the iterative procedure, a single PLA is found using either a variant of posterior decoding (described below) or the Viterbi algorithm.

In practice, computing all-versus-all local alignments is the most computationally demanding part of the algorithm. On a Pentium IV 3.6 GHz system with 2 GB memory, ProDA processes all 86 reference sets of proteins within BAliBASE reference 6 in ∼2.5 h when using posterior decoding, and in <1 h when using Viterbi decoding for pairwise alignment.

#### Pairwise alignment using posterior decoding

Let *Z* denote a set of pairs of positions from *x* and *y* that we do not allow to be aligned. Initially, let *Z* be the empty set.

If *A* ∗ is shorter than *L*_{min} then stop. Otherwise, store the local alignment found and proceed to (e).

Using our new augmented set *Z* of disallowed residue pairs, go back to Step (b) and recompute the posterior probabilities, noting that we disallow all HMM paths which attempt to align pairs of positions in *Z*.

Marking of used positions in the alignment matrix. If a pair of letters *x _{i}* and

*y*are paired in the current local alignment, they are marked (with a closed circle) to denote their exclusion from future local alignments. In addition, all letter pairs within

_{j}*L*

_{min}residues of (

*x*,

_{i}*y*) in either sequence

_{j}*x*or sequence

*y*are also marked (with dotted lines) in order to prevent ProDA from identifying repeats of length shorter than

*L*

_{min}.

Marking of used positions in the alignment matrix. If a pair of letters *x _{i}* and

*y*are paired in the current local alignment, they are marked (with a closed circle) to denote their exclusion from future local alignments. In addition, all letter pairs within

_{j}*L*

_{min}residues of (

*x*,

_{i}*y*) in either sequence

_{j}*x*or sequence

*y*are also marked (with dotted lines) in order to prevent ProDA from identifying repeats of length shorter than

*L*

_{min}.

Finding all local pairwise alignments between two sequences takes time *O*(*nL* 2 ), where *L* is the length of each sequence and *n* is the number of local alignments. For *m* sequences, the entire all-versus-all PLA computation takes *O*(*nm* 2 *L* 2 ) time. The above algorithm can be used to align a sequence against itself to find repeats by first marking the diagonal as described above. The current implementation of ProDA does not use this option, but instead finds repeats during Step 2.

#### Pairwise alignment using Viterbi decoding

Alternatively, a set of local alignments between sequences *x* and *y* can be computed using Viterbi decoding, by performing steps (a) through (f) above except that we now skip Step (b) and replace Step (c) with the following:

(c′) Compute the local alignment *A*∗ that maximizes the alignment probability *P*(*A* ∣ *x, y, Z*) using the Viterbi algorithm ( 16).

We note that there is no need to recompute the entire Viterbi dynamic programming tables in each iteration of procedure above since only a portion of the tables will be affected when marking positions to be added to *Z*. The Waterman–Eggert algorithm ( 34) uses this observation to compute the set of top-scoring Viterbi parses efficiently. If each local alignment has length *l* < *L* and there are *n* alignments for each sequence pair, finding all local alignments between two sequences takes time *O*(*L* 2 ) + *O*(*nl* 2 ). Thus, the local alignment step for *m* sequences takes time *O*(*m* 2 *L* 2 ) + *O*(*nm* 2 *l* 2 ).

#### 2. Inference of repeats from pairwise alignments

When two sequences share several nearby homologous repeats, a local alignment algorithm will generally find long local alignments spanning several repeat copies, along with shorter alignments containing fewer repeat copies. When two local alignments overlap in both sequences, ProDA breaks them into shorter alignments before proceeding to the subsequent steps ( Figure 5).

Breaking long local alignments that span several copies of a repeat. The two local alignments AD and MN overlap in both sequence *x* and *y*. Thus, we split AD into AB, BC and CD in order to obtain a set of non-overlapping PLAs.

Breaking long local alignments that span several copies of a repeat. The two local alignments AD and MN overlap in both sequence *x* and *y*. Thus, we split AD into AB, BC and CD in order to obtain a set of non-overlapping PLAs.

If there are originally *n* local alignments of two sequences, the number of resulting non-overlapping PLAs is at most *O*(*n* 2 ) [e.g. *n* copies of a perfect tandem repeat yield *n*(*n* − 1)/2 PLAs]. For a local alignment *A*, computing the set of sequence *x* boundaries for all alignments that overlap *A* in sequence *y* takes *O*(*n*) time and splitting *A* takes *O*(*l*) time, so the total running time for repeat inference is *O*(*nm* 2 *l*).

#### 3. Blocks of alignable sequence fragments

From an arbitrary set of PLAs, ProDA forms blocks, or sets of aligned fragments in which at least one fragment is alignable to all others either directly or via a third fragment, using a greedy iterative procedure. In each pass, ProDA selects the largest possible set of sequences to form a block. Then ProDA attempts to determine boundaries for each of the sequences in the block. This latter step can be difficult since PLAs have different lengths and boundaries ( Figure 6A), and moreover, the pairwise alignments within a set of PLAs may be inconsistent with each other ( Figure 6B).

Challenges in determining fragment boundaries. (**A**) PLAs have different overlaps. (**B**) Different alignments are inconsistent making it hard to decide whether the third fragment should begin from the first or the second ‘L’.

Challenges in determining fragment boundaries. (**A**) PLAs have different overlaps. (**B**) Different alignments are inconsistent making it hard to decide whether the third fragment should begin from the first or the second ‘L’.

ProDA uses a simple heuristic for fragment boundary determination. Recalling the definition of a block, let *b* denote a sequence fragment alignable to all other sequences within the current block whose boundaries we wish to determine. After filtering for outliers (see below), ProDA computes the average index of the first and last residue of sequence *b* in each of the PLAs these averaged beginning and end coordinates and their projections to the other sequences form the boundaries of the new block.

Long PLAs that contain many repeats or short PLAs that are similar to parts of real homologous fragments may cause dramatic skews in the average beginning and end index for the block with respect to sequence *b*. To prevent this from occurring, we filter out PLAs whose beginning or ending residue position in sequence *b* differs from the mean by >1 SD, and recompute average beginning and end indices for each block after this filtering has been done. For example, in Figure 6A, point 1 is an outlier so the left boundary of the block is computed by averaging points 2, 3 and 4.

#### 4. Guide tree construction

ProDA uses the same procedure as in ProbCons to build a tree with high expected alignment reliability. In particular, given a set of sequence fragments from a block, define the similarity function *E*(*b*,*c*) to be the expected accuracy of aligning two fragments *b* and *c*. Initially, each fragment is placed in its own cluster. Then, the two most similar clusters are merged to form a new cluster *bc*. The similarity between *bc* and any other cluster *d* is defined as *E*(*b*,*c*)[*E*(*b*,*d*) + *E*(*c*,*d*)]/2.

ProDA stops merging when the similarity between all pairs of clusters drops below some threshold (0.5 in the current implementation) or when only one cluster remains. The tree corresponding to the largest cluster formed so far is returned. All fragments not belonging to the tree are removed from the block. This early termination removes unrelated fragments that were mistakenly added to the blocks because of errors made in previous steps.

For each pair of sequences from the adjusted block, ProDA applies the probabilistic consistency transformation as used by ProbCons. As a default, ProDA uses two iterated applications of this transformation, which work well in practice ( 10). As with ProbCons, the tree construction and probabilistic consistency transformation steps require |$Oleft(**^<3>c _**

**ight)$| time, where**

*m*_{b}and*L*_{b}are the number of fragments and the length of each fragment of block*b*, and*c*is the average number of non-zero elements in posterior probability matrices.** **

**5. Progressive alignment**

**5. Progressive alignment**

This step is similar to the progressive alignment step of ProbCons. For each progressive alignment step, we run a profile–profile Needleman–Wunsch alignment procedure in which the score for matching a column containing *n*_{1} non-gap letters to one with *n*_{2} non-gap letters is computed by summing *n*_{1}*n*_{2} values from the corresponding pairwise posterior matrices. No gap penalties are used.

#### 6. Final alignment extraction

The start and end columns of the multiple alignment from Step 5 often contain gaps as in the example shown in Figure 7. These gaps correspond to errors in the aligned fragment boundaries and should not be present in the final alignment. Thus, ProDA extracts and returns the longest aligned region whose initial and final columns contain no gaps (the region inside the rectangle in Figure 7).

An alignment with gaps in the start and end columns. Block boundaries are determined by extracting the longest subalignment that begins and ends with columns not containing gaps.

An alignment with gaps in the start and end columns. Block boundaries are determined by extracting the longest subalignment that begins and ends with columns not containing gaps.

#### 7. Removing used PLAs

If two fragments belong to an alignment formed in Step 6, their aligned portions should not occur together in subsequent alignments returned by ProDA. To guarantee this, ProDA identifies all PLAs that contain pairs of fragments from the final alignment above, and removes the used portions of these PLAs from the set of candidate PLAs.

This is illustrated in Figure 8, in which the middle section of a PLA belongs to a final alignment generated in Step 5. ProDA removes this middle part, and retains the remaining left and right portions only if they are at least *L*_{min} in length.

Adjustment of a PLA containing two fragments from a final block alignment.

Adjustment of a PLA containing two fragments from a final block alignment.

If there are no more PLAs then the algorithm stops otherwise it goes back to Step 3.

### Algorithm evaluation

For global alignments, the sum-of-pairs (SP) score is the most common measure of aligner accuracy. For alignments of proteins with rearrangements, however, using the SP score in isolation gives a poor 1D view of the aligner's overall ability to recover the domain organization of sets of related sequences. To evaluate the empirical performance of ProDA for this task, we define measures that assess the sensitivity and specificity of an aligner at the residue, domain and cluster levels.

#### 1. Residue-level accuracy

#### 2. Domain-level accuracy

We now characterize the ability of an aligner to recapitulate known domain boundaries (endpoint agreement) and domain positions (midpoint agreement).

#### Endpoint agreement

#### Midpoint agreement

#### 3. Cluster-level accuracy

Finally, we define a measure that describes how well domains are clustered into globally alignable sets. In particular, our measure assumes that the reference annotation is complete and correct. An aligner should not fail to cluster homologous sequences: segments belonging to the same reference cluster should appear in the same predicted cluster. Conversely, an aligner should not overpredict homology: segments belonging to the same predicted cluster should belong to the same reference cluster.

## Contents

This algorithm can be used for any two strings. This guide will use two small DNA sequences as examples as shown in the diagram:

### Constructing the grid Edit

First construct a grid such as one shown in Figure 1 above. Start the first string in the top of the third column and start the other string at the start of the third row. Fill out the rest of the column and row headers as in Figure 1. There should be no numbers in the grid yet.

G | C | A | T | G | C | U |
---|---|---|---|---|---|---|

G | ||||||

A | ||||||

T | ||||||

T | ||||||

A | ||||||

C | ||||||

A |

### Choosing a scoring system Edit

Next, decide how to score each individual pair of letters. Using the example above, one possible alignment candidate might be:

12345678

GCATG-CU

G-ATTACA

The letters may match, mismatch, or be matched to a gap (a deletion or insertion (indel)):

- Match: The two letters at the current index are the same.
- Mismatch: The two letters at the current index are different.
- Indel (INsertion or DELetion): The best alignment involves one letter aligning to a gap in the other string.

Each of these scenarios is assigned a score and the sum of the scores of all the pairings is the score of the whole alignment candidate. Different systems exist for assigning scores some have been outlined in the Scoring systems section below. For now, the system used by Needleman and Wunsch [1] will be used:

For the Example above, the score of the alignment would be 0:

GCATG-CU

G-ATTACA

### Filling in the table Edit

Start with a zero in the second row, second column. Move through the cells row by row, calculating the score for each cell. The score is calculated by comparing the scores of the cells neighboring to the left, top or top-left (diagonal) of the cell and adding the appropriate score for match, mismatch or indel. Calculate the candidate scores for each of the three possibilities:

- The path from the top or left cell represents an indel pairing, so take the scores of the left and the top cell, and add the score for indel to each of them.
- The diagonal path represents a match/mismatch, so take the score of the top-left diagonal cell and add the score for match if the corresponding bases (letters) in the row and column are matching or the score for mismatch if they do not.

The resulting score for the cell is the highest of the three candidate scores.

Given there is no 'top' or 'top-left' cells for the second row only the existing cell to the left can be used to calculate the score of each cell. Hence −1 is added for each shift to the right as this represents an indel from the previous score. This results in the first row being 0, −1, −2, −3, −4, −5, −6, −7. The same applies to the first column as only the existing score above each cell can be used. Thus the resulting table is:

G | C | A | T | G | C | U | |
---|---|---|---|---|---|---|---|

0 | −1 | −2 | −3 | −4 | −5 | −6 | −7 |

G | −1 | ||||||

A | −2 | ||||||

T | −3 | ||||||

T | −4 | ||||||

A | −5 | ||||||

C | −6 | ||||||

A | −7 |

The first case with existing scores in all 3 directions is the intersection of our first letters (in this case G and G). The surrounding cells are below:

This cell has three possible candidate sums:

- The diagonal top-left neighbor has score 0. The pairing of G and G is a match, so add the score for match: 0+1 = 1
- The top neighbor has score −1 and moving from there represents an indel, so add the score for indel: (−1) + (−1) = (−2)
- The left neighbor also has score −1, represents an indel and also produces (−2).

The highest candidate is 1 and is entered into the cell:

The cell which gave the highest candidate score must also be recorded. In the completed diagram in figure 1 above, this is represented as an arrow from the cell in row and column 3 to the cell in row and column 2.

In the next example, the diagonal step for both X and Y represents a mismatch:

G | C | ||
---|---|---|---|

0 | −1 | −2 | |

G | −1 | 1 | X |

A | −2 | Y |

For both X and Y, the highest score is zero:

G | C | ||
---|---|---|---|

0 | −1 | −2 | |

G | −1 | 1 | 0 |

A | −2 | 0 |

The highest candidate score may be reached by two or all neighboring cells:

In this case, all directions reaching the highest candidate score must be noted as possible origin cells in the finished diagram in figure 1, e.g. in the cell in row and column 7.

Filling in the table in this manner gives the scores or all possible alignment candidates, the score in the cell on the bottom right represents the alignment score for the best alignment.

### Tracing arrows back to origin Edit

Mark a path from the cell on the bottom right back to the cell on the top left by following the direction of the arrows. From this path, the sequence is constructed by these rules:

- A diagonal arrow represents a match or mismatch, so the letter of the column and the letter of the row of the origin cell will align.
- A horizontal or vertical arrow represents an indel. Horizontal arrows will align a gap ("-") to the letter of the row (the "side" sequence), vertical arrows will align a gap to the letter of the column (the "top" sequence).
- If there are multiple arrows to choose from, they represent a branching of the alignments. If two or more branches all belong to paths from the bottom right to the top left cell, they are equally viable alignments. In this case, note the paths as separate alignment candidates.

Following these rules, the steps for one possible alignment candidate in figure 1 are:

### Basic scoring schemes Edit

The simplest scoring schemes simply give a value for each match, mismatch and indel. The step-by-step guide above uses match = 1, mismatch = −1, indel = −1. Thus the lower the alignment score the larger the edit distance, for this scoring system one wants a high score. Another scoring system might be:

For this system the alignment score will represent the edit distance between the two strings. Different scoring systems can be devised for different situations, for example if gaps are considered very bad for your alignment you may use a scoring system that penalises gaps heavily, such as:

### Similarity matrix Edit

More complicated scoring systems attribute values not only for the type of alteration, but also for the letters that are involved. For example, a match between A and A may be given 1, but a match between T and T may be given 4. Here (assuming the first scoring system) more importance is given to the Ts matching than the As, i.e. the Ts matching is assumed to be more significant to the alignment. This weighting based on letters also applies to mismatches.

In order to represent all the possible combinations of letters and their resulting scores a similarity matrix is used. The similarity matrix for the most basic system is represented as:

A | G | C | T | |
---|---|---|---|---|

A | 1 | -1 | -1 | -1 |

G | -1 | 1 | -1 | -1 |

C | -1 | -1 | 1 | -1 |

T | -1 | -1 | -1 | 1 |

Each score represents a switch from one of the letters the cell matches to the other. Hence this represents all possible matches and deletions (for an alphabet of ACGT). Note all the matches go along the diagonal, also not all the table needs to be filled, only this triangle because the scores are reciprocal.= (Score for A → C = Score for C → A). If implementing the T-T = 4 rule from above the following similarity matrix is produced:

A | G | C | T | |
---|---|---|---|---|

A | 1 | −1 | −1 | −1 |

G | −1 | 1 | −1 | −1 |

C | −1 | −1 | 1 | −1 |

T | −1 | −1 | −1 | 4 |

Different scoring matrices have been statistically constructed which give weight to different actions appropriate to a particular scenario. Having weighted scoring matrices is particularly important in protein sequence alignment due to the varying frequency of the different amino acids. There are two broad families of scoring matrices, each with further alterations for specific scenarios:

### Gap penalty Edit

When aligning sequences there are often gaps (i.e. indels), sometimes large ones. Biologically, a large gap is more likely to occur as one large deletion as opposed to multiple single deletions. Hence two small indels should have a worse score than one large one. The simple and common way to do this is via a large gap-start score for a new indel and a smaller gap-extension score for every letter which extends the indel. For example, new-indel may cost -5 and extend-indel may cost -1. In this way an alignment such as:

which has multiple equal alignments, some with multiple small alignments will now align as:

or any alignment with a 4 long gap in preference over multiple small gaps.

Scores for aligned characters are specified by a similarity matrix. Here, *S*(*a*, *b*) is the similarity of characters *a* and *b*. It uses a linear gap penalty, here called d .

For example, if the similarity matrix was

A | G | C | T | |
---|---|---|---|---|

A | 10 | −1 | −3 | −4 |

G | −1 | 7 | −5 | −3 |

C | −3 | −5 | 9 | 0 |

T | −4 | −3 | 0 | 8 |

with a gap penalty of −5, would have the following score:

*S*(A,C) + *S*(G,G) + *S*(A,A) + (3 × *d*) + *S*(G,G) + *S*(T,A) + *S*(T,C) + *S*(A,G) + *S*(C,T) = −3 + 7 + 10 − (3 × 5) + 7 + (−4) + 0 + (−1) + 0 = 1

The pseudo-code for the algorithm to compute the F matrix therefore looks like this:

The original purpose of the algorithm described by Needleman and Wunsch was to find similarities in the amino acid sequences of two proteins. [1]

Needleman and Wunsch describe their algorithm explicitly for the case when the alignment is penalized solely by the matches and mismatches, and gaps have no penalty (*d*=0). The original publication from 1970 suggests the recursion F i j = max h < i , k < j < F h , j − 1 + S ( A i , B j ) , F i − 1 , k + S ( A i , B j ) >*,B_ ),F_*

*,B_)>> .*

* *

*The corresponding dynamic programming algorithm takes cubic time. The paper also points out that the recursion can accommodate arbitrary gap penalization formulas:*

A penalty factor, a number subtracted for every gap made, may be assessed as a barrier to allowing the gap. The penalty factor could be a function of the size and/or direction of the gap. [page 444]

A better dynamic programming algorithm with quadratic running time for the same problem (no gap penalty) was first introduced [5] by David Sankoff in 1972. Similar quadratic-time algorithms were discovered independently by T. K. Vintsyuk [6] in 1968 for speech processing ("time warping"), and by Robert A. Wagner and Michael J. Fischer [7] in 1974 for string matching.

Needleman and Wunsch formulated their problem in terms of maximizing similarity. Another possibility is to minimize the edit distance between sequences, introduced by Vladimir Levenshtein. Peter H. Sellers showed [8] in 1974 that the two problems are equivalent.

The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. However, the algorithm is expensive with respect to time and space, proportional to the product of the length of two sequences and hence is not suitable for long sequences.

Recent development has focused on improving the time and space cost of the algorithm while maintaining quality. For example, in 2013, a Fast Optimal Global Sequence Alignment Algorithm (FOGSAA), [9] suggested alignment of nucleotide/protein sequences faster than other optimal global alignment methods, including the Needleman–Wunsch algorithm. The paper claims that when compared to the Needleman–Wunsch algorithm, FOGSAA achieves a time gain of 70–90% for highly similar nucleotide sequences (with > 80% similarity), and 54–70% for sequences having 30–80% similarity.

## The choice of sequence homologs included in multiple sequence alignments has a dramatic impact on evolutionary conservation analysis

**Motivation:** The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs).

**Results:** The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.

**Supplementary information:** Supplementary data are available at Bioinformatics online.

## DISCUSSION

An aim of this work is to determine the boundaries between when pure sequence alignment methods perform well and when augmentation of the alignment with structure is necessary. We wish to highlight that the benchmarks based purely upon structural protein alignments do not adequately test all the uses of sequence alignment. In addition, we are pleased to note that our two independent measures of alignment fitness (SCI and SPS) produce similar results.

In some cases, we found that altering algorithm parameters produced a dramatic improvement over the defaults (e.g. T-Coffee performance improves using Clustal to generate a library of pairwise alignments and POA performance improves dramatically using a combination of the global and the progressive modes).

We find that the conclusions of previous studies based upon structural protein alignments do not necessarily hold for the alignment of structural ncRNA. For example, DIALIGN, identified as a method which performed well for low-homology protein alignment did not generally improve (relative to the alternative methods) on low-homology datasets ( 32 ). Another surprising discovery was that T-Coffee, touted as an excellent method for high-homology datasets, did not perform well (again, relative to the alternative methods) on the ncRNA datasets ( 32 ). Another surprise was that the supposedly outdated, yet still widely used method ClustalW, performed consistently well across all homology datasets. This is possibly a consequence of the fact that more recent algorithms are heavily optimized for protein alignment. The relatively new methods ProAlign, POA (gp) and MUSCLE also performed consistently well. ProAlign, in particular, produced (comparatively) reliable alignments and ranked in the top 5 across all homology ranges. This is possibly due to the fact that ProAlign is one of the few algorithms to use a scoring scheme derived from reliable nucleic acid sequence alignments. The performance of POA (gp) is also remarkable, not only because it employs a very fast method [said to accurately align 5000 EST sequences in 4 h on a Pentium II ( 44 )] but also because it performed consistently well over all test sets.

Another conclusion of this work is that the ‘twilight zone’ of ncRNA alignment—the homology range where little to no structural information of predicted alignments (using the current state of the art algorithms) for structurally homologous sequences is retained—is in the 50–60% sequence-identity range. This is dramatically higher than that of the protein sequences which is 10–20% ( 29 ). Much of this difference is, of course, due to the different alphabet sizes and the generally limited models and the score matrices for nucleotide alignment.

It is interesting to note that three of the structural methods (Dynalign, Foldalign and PMcomp), for a short homology range (40–60% sequence identity), have higher SCI scores than the reference alignment and that in the same regions there is a dip in the performance when Dynalign, Foldalign and PMcomp performance is measured using SPS. This suggests that the reference alignments themselves may be improved upon in this homology range.

Based upon these results the Foldalign score routines seem to have optimized the delicate balance between the sequence and the structure-based scores. This implementation of Sankoff's algorithm employs a light-weight energy model ( 13 , 41 , 45 , 46 ) in concert with the substitution matrices similar to those of RIBOSUM ( 47 ) and BLOSUM ( 48 ), which seem to produce excellent predictions. However, the computational complexity of this algorithm is still an issue, global alignment is restricted to sequences of ∼200 nt or less, in practice. Further optimization may increase this bound, however.

The profile-based approach of Hofacker *et al* . [pmcomp - -fast ( 15 , 49 )], holds promise for producing fast and reasonably accurate alignments in satisfactory time across all homology ranges. It by no means produces ‘optimal’ alignments in terms of sequence or structure, but is a reasonable compromise between the sequence- and the structure-based methods in terms of improved accuracy for the former and dramatically reduced computational requirements for the latter. This method is in the process of being re-implemented in C with affine gap costs and an adjustable sequence-weighting parameter. This is available as ‘RNApaln’ with the Vienna package version 1.5 or greater (I. Hofacker, personal communication).

## 2 Methods

### 2.1 Centre star strategy

The centre star and progressive tree methods are two basic strategies for MSA. The centre star method runs faster, and it is therefore suitable for the MSA of similar DNA sequences ( Zou *et al.*, 2012). The main approach underlying the centre star method is to transform MSA into pairwise alignment based on a ‘centre sequence’. This centre sequence is selected, and other sequences are pairwise aligned to the centre sequence. Then, all of the inserted spaces are summed to obtain the final MSA result.

The majority of the running time in pairwise sequence alignment is due to the dynamic programming employed. If the input is similar DNA sequences, long common substrings can be rapidly extracted from the pairwise sequences. Therefore, we only need to align the remaining short regions. The extraction of common substrings can be quickly implemented based on a trie tree data structure, which will greatly reduce the dynamic programming running time for similar DNA sequences.

A trie tree can improve the efficiency of the MSA algorithm. However, it cannot solve the ‘big data’ problem, due to the fact that the trie trees are always stored in the memory. If the input data increase, the size of trie trees will also markedly increase. Therefore, parallelization is the only fundamental solution for massive data.

The selection of a centre sequence and pairwise alignment consumes most of the running time. We observed that pairwise alignment can be implemented in parallel, and the selection of the centre sequence would cause little difference in performance for the MSA of similar DNA sequences. Therefore, we suggest that parallel hadoop programming can reduce running time and solve the scalable sequences problem for MSA.

### 2.2 Acceleration with trie trees for similar sequences

The key strategy underlying our method is to detect common substrings before pairwise alignment. The detection process involves linear running time and could decrease the required square running time in dynamic programming. When we find common substrings for pairwise DNA sequences, it only remains to align the remaining regions for pairwise alignment.

The centre star strategy involves three steps: centre star sequence selection, pairwise alignment and subtotalling the inserted spaces.

In the centre star sequence selection step, every sequence is partitioned into several disjoint segments. The segments from every DNA sequence are collected and used to construct a trie tree. The trie tree is similar to a dictionary storing the segments and contains certain indexes. If we want to search for any appearance of any segment in a long sequence, we simply search the long sequence in the trie tree with linear running time, instead of searching each segment individually. Therefore, this process will decrease the amount of time consumed.

When the trie tree is constructed, we search all of the segments in every input DNA sequence. The sequence that contains the most segments will be chosen as the centre star sequence, meaning that it is the sequence that is most similar to all of the others.

After the centre star sequence is chosen, the centre star sequence is pairwise aligned to the other sequences. The matched segments from the trie trees are recorded, and the regions with matched segments need not be aligned. We simply align the remaining regions, as shown in Figure 1. Then, all of the inserted spaces are summed to obtain the final MSA result, which is the same as the original centre star strategy. The flow is shown in Figure 2. There are two points of improvement. First, we partition the DNA sequences and employ a trie tree to accelerate searching, and the centre star sequence can therefore be selected rapidly. Second, matching segments are omitted in the pairwise alignment step, which can reduce the dynamic programming running time.