HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Description:
The Finite Automata based pattern matching algorithm used the fact that pattern matching need not start from beginning of the pattern every time a non-matching character is read from the Text. It can start from the nth position in the Pattern, if all previous n positions are the latest n positions that have been matched.
To illustrate this: consider a pattern "abababaca" being searched in a block of text. assume that the current character being matched is "b" after "ababa" have matched. If a mismatch occurs at this point (i.e. the read character is "c", for example). The figure below (first three tables) shows the processing for such as string. The T(i) is the text being searched through for the Pattern P() and three "markers" or "pointers" are maintained for this matching. The "S" is the shift position, i.e. the start position in the T where the matching starts using two points Tp and Pp within T and P respectively. So the characters matched are S+Tp and Pp respectively at each iteration stating at S.
The KMP relies on the computation of a Pi function at each character of P, which indicates the position in P where the matching process should resume, should a mismatch occur at the current location. This relies on the fact (just like the Finite Automata? ) that during matching with the current character, what is the maximum number of characters that can be skipped to restart the matching process. This length is derived from the longest Prefix of P that is also a suffix of P(x-1) if the current position being read in xth and a mismatch occurs. So the second table below indicates that having read the matching pattern "ababa" (the longest matching Prefix/Suffix being "aba", a mismatch occurs at b <-> c matching. At this point the P can be shifted right by three positions to align with the matching suffix "aba" preceeding the current position. Hence, the shift S changes position to right by two positions (or left of current position by three positions) and the matching resumes.
The third table illustrates the current match as "aba" and the longest Prefix/Suffix is "a", so the S can be shifted to one position to the left of current mismatched position to resume the matching.
The next two tables (fourth/fifth) shows significant gain that can be achieved by the usage of pi function of KMP. The matched pattern is "abababa" and the longest matching Prefix/Suffix is "ababa" which means that the shift position S can shift five positions to the left of the current (mismatching) position. This is illustrated in the second table indicating that one next three comparisons need to be made before the current shift position is evaluated completely, instead of eight in the normal (Naive) case.
Observations:
- The KMP does not evaluate the largest matching Prefix/Suffix for any letter in the alphabet (as is done by the Finite Automata?), but only for the actual letters in the pattern. This is a big advantage of the method.
- The Implementation of KMP is quite straightforward: The function kmp iterates the pattern over the entire text and adjusting the shift "q" by the overlap function.
- The overlap functions builds prefixNest[] list on the first invocation of the overlap function using the lenPrefix function (line 27 in the second code listing above). The implementation of overlap function does not use a separate "shift" but adjusts the pointer position Pp in the Pattern during matching.
Algorithm Analysis:
Very simply the computation of the prefixNest takes only O(m) time and the main loop (lines 6-19) of function kmp take O(n) iterations, the worst case behaviour of kmp is O(m+n).
maintainer: rakesh.mawa@hsc.com
Categories: PatternMatching Software
<< Main.PatternMatching-FiniteAutomata | PatternMatching-Trail | Main.PatternMatching-Rabin-Karp >>
Comments