HSCTechnicalWiki


view edit history print Talk subscribe
SearchWiki

Views: 724

Full site statistics

Authors:

edit SideBar

Main » Pattern matching using Boyer-Moore algorithm

PageList

Papers

Tutorials

(redirected from Main.PatternMatchingBoyer-Moore)

HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement. Description:

All "traditional" pattern matching techniques such as Rabin-Karp or Knuth-Morris Prat algorithms are basically improvisations of the fundamental Naive Technique of searching for the P at each n positions of the text T. These techniques improvise over the Naive method by reusing information that is available at each position related to the matching "so far". This is, typically, the largest prefix that is also the suffix in the already matched characters. This means the extent till which the Pattern can "slide" right and be aligned with the currently largest suffix. So this saves some shifts.

Boyer Moore does something very interesting! It starts the matching from the rightmost position of the pattern P and the new shifts are decided by the presence of the mismatching character in rest of the Pattern.

Consider a Pattern P(m) = "anpanman" (see the wiki article on Boyer Moore Algorithm) and the text T(n) = "bharacter assassination", if we use the traditional method of shift-match Naive Technique, we'd start matching a with b, a with h, a with a and so on...so we'd make at least n-m comparisions. However, if we start matching last character of P with the mth character "e" of T and then if the character ("e") doesn't match and is not present in P, then we can conclude that irrespecitive of the shift positions between 1 to m of T, the character "e" is going to occur in each shifted m characters in T, since the character doesn't exist in P, the shift can jump by m letters and instead start at "r" succeeding the bad character "e". Of course finding that the character "e" doesn't exist in P takes m comparisions, so in the best case only n/m comparitions are needed to find the Pattern, but we have compared the worst case of Naive (n-m+1) with the best of BM (n/m) which is not a fair comparision, we'd have an bit more elaborate analysis of the performance of the BM later in the article.

Bad Character Heuristics:

Having found a "bad character" in the Text which doesn't exist at all in the P, it allows us to move the shift position by the length of the P as all m starting positions for matching will anyway contain the bad character in T and can be skipped. This is illustrated in the bottom two tables above, the character "s" is encountered in the text and does not occur in P, hence the entire P is shifted such that the new character matching starts at eighth character after the mismatched character "s". In effect it is "sliding" the entire P past the mismatching character and set S to the poistion next to it and start the matching procedure again.

What happens if such as mismatched character occus somewhere in the P? The new shift position is aligned with the first occurance of that mismatched character in P. For instance, if the mismatched character was p then the shift position will be 5. This is illustrated in the tables above. Observant reader will note that the we haven't made use of the knowledge gained after matching all previous suffix characters before a mismatch occurs (or a bad character occurs). Looking at the third table above, the knowledge that the

 pattern "an" has been matched before the mismatched
 character "s" is encountered is immediately discarded as the
 matching starts at last character of P again (as seen in the
 last table above). Could the matching start at someother
 character?

Good Suffix Heuristics:

 

References:

  1. Wiki article on Boyer Moore Algorithm at wikipedia

Views:

maintainer: rakesh.mawa@hsc.com

Categories: PatternMatching Software

<< Main.PatternMatching-Rabin-Karp | PatternMatching-Trail | >>

Comments

Add Comment 
Email address(will be kept hidden) 
Enter code:

Page last modified on September 17, 2009, at 07:56 AM