HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Rabin-Karp Algorithm:
Algorithm Analysis:
- The algorithm treats each character as a decimal digit and computes a hash value for a block of numbers of size m (of pattern to be searched: P[m]) at each n starting positions of the text of T[n]. So if the size of the pattern to be searched is known beforehand and the text is also known, the hash values of each n-m blocks of text can be pre-processed.
- The hash values of two different blocks of text can be same and while matching the hash value of the pattern being searched, one can hit a matching hash pattern where the actual pattern does not match. Hence an additional check is always required, though a good hash function reduces chances of these spurious hits.
- If the size of the pattern to be searched is not known in advance, then the function "findHash" at line 22 can be called n-m+1 times and each call to "findHash" takes O(m) time, hence the worst case time complexity of the Rabin-Karp algorithm is O((n-m+1)m), which is same as the Naive method. Similarly, the worst case time complexity is observed at O(m^2) with m = n/2.
- If the size of the pattern P is known in advance then the hash computation of the m*(n-m+1) blocks of text can be done in constant time and the worst case time complexity is O(n/m)
- A unique feature of Rabin-Karp algorithm is searching for multiple patterns (same length) in a given Text. Assuming a worst case of k sub-patterns being searched in the Text (and including the pre-processing), the time complexity is k + (n-m+1) * m + km. Since K comparisons, k hash computations and (n-m+1) hash computations can be done in Preprocessing, the k pattern searches require O(k(n-m)) comparisons, hence on an average of less than O(n) time for each pattern match.
Role of Hashing in Rabin-Karp:
The key idea in Rabin-Karp algorithm is to develop a classification of patterns into possible matches and not possible matches. It is based the nature of the hashing mechanism. At simplest level a hash function is a modulo function. which is used to map a number into another smaller set of numbers. A modulo function (not "%" of C-language, which is actually a remainder operation). is defined as: a mod b = a - b* floor((float)a/(float)b where a and b are integers. We use the definition of modulo as implemented in Microsoft Excel (which use INT() instead of floor function), note that modulo is negative numbers is undefined, though MS-Excel won't tell you that! For serious fans of Modulo function do check out the debate at dr. math.
The Rabin-Karp algorithm uses the hash function to generation a hash value for each m letters in the text and then compares these one by one with the hash value generated from the Pattern. A match of hash values indicates a "potential" match and then a complete match is attempted. The spurious matches are expected and depend on the nature of the hash function. Thumb rule is to select a prime number much larger than m such that the chances of spurious matches is reduced.
An improvement on the traditional mod function for a large Text is to reuse the hash value obtained of 1..m letters in 2..m+1 letters. Thereby one reduces the computational complexity of finding mod of large numbers.
So two techniques are combined to obtain modulo of large numbers using Horner's method and rolling Hash i.e. obtaining hash value of 2..m+1 digits from 1..m digits is a series of digits/letters.
maintainer: rakesh.mawa@hsc.com
Categories: PatternMatching Software
<< Main.PatternMatching-Knuth-Morris-Prat | PatternMatching-Trail | Main.PatternMatching-Boyer-Moore >>
Comments