Needleman-Wunsch algorithm

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem (e.g. the full sequence) into a series of smaller problems and uses the solutions to the smaller problems to reconstruct a solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance.

This algorithm can be used for any two strings. This guide will use two small DNA sequences as examples as shown in the diagram:

First construct a grid such as one shown in Figure 1 above. Start the first string in the top of the Third column and start the other string at the start of the 3rd row. Fill out the rest of the column and row headers as in Figure 1. There should be no numbers in the grid yet.

Next we need to decide how to score how each individual pair of letters match up. Just by looking at our two strings you may be able to see one possible best alignment:

We can see that letters may match, mismatch, be deleted or inserted (indel):

There are various ways to score these three scenarios. These have been outlined in the Scoring Systems section below. For now we will use the simple system used by Needleman and Wunsch; matches are given +1, mismatches are given -1 and indels are given -1.

Start with a zero in the second row, second column. Move through the cells row by row, calculating the score for each cell. The score is calculated as the best possible score (i.e. highest) from existing scores to the left, top or top-left (diagonal). When a score is calculated from the top, or from the left this represents an indel in our alignment. When we calculate scores from the diagonal this represents the alignment of the two letters the resulting cell matches to. Given there is no 'top' or 'top-left' cells for the second row we can only add from the existing cell to the left. Hence we add -1 for each shift to the right as this represents an indel from the previous score. This results in the first row being 0, -1, -2, -3, -4, -5, -6, -7. The same applies to the second column as we only have existing scores above. Thus we have:

...
Wikipedia