HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Patterns as State Machine:
The essential idea behind the Finite Automata based pattern matching is that the pattern to be searched is a series of triggers which results into a state changes, the number of states is presumed to be m+1 for a pattern P[m]. So the current state reflects the "extent" of the match of the read characters from the text with the pattern. The algorithm is implemented in two stages: a preprocessing stage, where a state transition matrix is built for the given pattern and then the main algorithm reads characters 1..n-m+1 from T[n] and at any position when the state m is reached (in line 11), a pattern match is said to have occurred.
Example:
Consider the pattern P[m] = "aabababaca", The state machine corresponding to the pattern is:
The Pattern P[m] is considered as a state machine with m+1 states and the read characters are considered as one of the paths (with no self loops) in the state machine. So a pattern "aba" will generate a state machine with 0..3 states and the transition triggers are 0 -> "read a" -> 1, 1 -> "read b" -> 2, 2 -> "read c" -> 3. If the pattern has unique subpatterns the
The state transition matrix indicates the state transition based on the character read, as long as the alphabet set is small, the transition matrix and pre-processing required to generate it is relatively small. However for a "normal ASCII", the entire set of 127 possibilities must be considered at each state. Computing this can be a crippling overhead if this mechanism of pattern searching is adopted for a transaction based system where the pattern keeps changing and the set of alphabets is large.
The transition function is defined as follows: where of x is the length of the largest Prexif of P which is a suffix of x.
For example if P[m] = abababaca, then as at q = 5, when the input character is b, the Pqa = ababab and is P4 = abab.
Implementation Notes:
1. The algorithm treats each character as the trigger in its state machine and the stopping state is triggered by a full pattern match. In effect the state machine maintains a state based on the read triggers and the possibilities of arriving at a state without starting at the beginning of the pattern. So it finds the largest "prefix" of current (in state q) Pq and the read character "a" which is also a suffix of the Pq"a". In other words, the method finds maximum possible characters that are already read and are at the beginning of the pattern, thereby skipping re-reading these characters.
2. The implementation of the State Transition table "buildSm" is straightforward, we compute the state transition for each character in the ascii printable set for the given position in the pattern. This matrix is sparse and for a small alphabet, is quite useful.
3. The computeState function composes a PrefixString of q+1 characters and then checks the largest suffix of the substring P[q+1] in the isSuffix function. It starts with the longest possible suffix i.e. of q+1 characters and then reduces the suffix length. It is quite inefficient way of doing this, but it achieves the purpose and has no run-time performance impact.
Algorithm Analysis:
1. The time complexity of the "finiteSm" function is O(n) as the worst case.
2. The "buildSm" function takes m* iterations (lines 5-7) and each "ComputeState" for each ( lines 18-21) takes m*m (for isSuffix function) calls. So the worst case complexity is .
3. The Pre-processing can be improved to giving an overall complexity of , which is better than both Naive Method? and the Rabin-Karp?.
Views:
maintainer:rakesh.mawa@hsc.com
Categories: PatternMatching Software
<< Main.PatternMatching-NaiveMethod | PatternMatching-Trail | Main.PatternMatching-Knuth-Morris-Prat >>
Comments