*** Welcome to piglix ***

Knuth–Morris–Pratt algorithm


In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

The algorithm was conceived in 1970 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977. Independently, Matiyasevich · got in 1969 a similar algorithm, coded by a two -dimentional Turing machine, by studying a string pattern matching recognition problem.

A string matching algorithm wants to find the starting index m in string S[] that matches the search word W[].

The most straightforward algorithm is to look for a character match at successive values of the index m, the position in the string being searched, i.e. S[m]. If the index m reaches the end of the string then there is no match, in which case the search is said to "fail". At each position m the algorithm first checks for equality of the first character in the word being searched, i.e. S[m] =? W[0]. If a match is found, the algorithm tests the other characters in the word being searched by checking successive values of the word position index, i. The algorithm retrieves the character W[i] in the word being searched and checks for equality of the expression S[m+i] =? W[i]. If all successive characters match in W at position m, then a match is found at that position in the search string.

Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 262 (1 in 676). So if the characters are random, then the expected complexity of searching string S[] of length k is on the order of k comparisons or O(k). The expected performance is very good. If S[] is 1 billion characters and W[] is 1000 characters, then the string search should complete after about one billion character comparisons.


...
Wikipedia

...