HSC welcomes all external visitors to this site, especially students and members of the academic community. Please use the comments box at the bottom of each page to record any comments or suggestions for improvement.
Introduction:
Recognizing patterns in long sequences has application in multiple areas such as genome sequencing, vehicular traffic analysis, user mobility in radio networks, consumer behaviour and Knowledge Discovery/Artificial Intelligence. For example finding if a technical paper is plagiarized is typically done by building a 'Plagiarism Index", which is actually a function of the number of times the text in the paper matches the text in any of the papers that are mentioned in the references of the paper. In this particular case, a match of multiple fixed size strings is sought in the references papers.
There are other areas such as user mobility, where, one may not be interested in exact mobility pattern matches, but patterns that occur between a source and a destination, such pattern matches have two challenges: Recognizing the depth i.e. finding a pair of source and destinations and recognitions of patterns that are routes between these source-destination pairs. This is: in essence, a variable length pattern recognition and approximate pattern matching in a time series.
Another interesting problem in pattern matching is the problem of scale, patterns occur at the scales one is looking at. In a scalar series one can take assume variable size of the patterns and seek to match these from the series. Recognizing a valid pattern is itself dependent on of what use is that pattern itself. As an example a pattern has a property that the underlying process that has generated the pattern data is not purely random, but has some biases or constraints such that a pattern occurs.
Extending the pattern searches in sequences to pattern searches in n-dimensions is an active subject of research.
Fixed-size pattern matching in text:
The simplest of the Pattern Matching problems is the search of a fixed length pattern in a stored text. The solution for a computing system is to find the algorithm that finds a Pattern P of length m in a text T of length n using minimum space and time complexity. An assumption that some of the algorithms make is that the pattern match is not a single transaction, but many transactions using the same pattern P. Hence these algorithms allow for a "Pre-Processing" which is assumed to be constant time and is not factored as part of the time complexity of the actual pattern match. However, the assumption is not true for transaction processing systems where the length as well as the actual pattern to be searched may change every transaction.
Terms, notation and Definitions:
- Text : T[1..n] or T[n] A text or a string is a scalar series of length n letters from an alphabet
- Pattern : P[1..m] or P[m] A pattern of length m letters is a scalar series from the aphabet .
- |X| is the length of the text X.
- s is the shift required for aligning a substring P[m] in T[n] such that all letters of P match in T.
- X is a substring of Y iif X is contained entirely in Y.
- Prefix notation: X is a prefix of string Y.
- Suffix notation: X is a suffix of string Y.
Prefix/Suffix definitions: Consider a string Z and
substrings X and Y of Z.
- (Y is a suffix of Z) if Y[x] = Z[x] for x = n , n - 1, m-2...n-m in Z and x = m, m-1, m-2...1 in Y.
- (Y is a prefix of Z) if Y[x] = Z[x] for x = 1, 2,...n in Z and x = 1, 2, 3...m in Y.
Pattern Matching Algorithms:
The simplest technique of pattern searching in a stored text, also called the Naive Method, is to compare the pattern P[1..m] with text T [s..s+1] at shift s = 1 to n-m.
Some algorithms such as Rabin-Karp speed up the matching of Pattern with the Text at the current shift position by pre-computing hash values of the block of text of the same size as of the Pattern.
Yet another variation of the preprocessed Pattern based search is the String matching with Finite Automata which treats the Pattern as a state machine and read text as the triggers. Any time the final state is reached, the Pattern is said to be matched in the Text. The important property the state machine has is: that it need not rematch the substrings of the Pattern being searched in the Text afresh, if such substrings exist in the Pattern.
Another variation of the String matching with Finite Automata is the Knuth-Morris Prat Algorithm, which, instead of preprocessing the Pattern to be matched computes a largest matching substring in the String being matched, at the run-time. This largest matched substring is used to avoid smaller shifts when a mismatch occurs. This is quite useful when a Pattern has a number of repeated substrings.
Boyer-Moore (in progress) is a significant improvement over other pattern matching algorithms and the only one which can match a Pattern in a text in under linear time. Its best case time is O(n/m). For large Patterns made up of letters from a small alphabet it is the most suitable substring search algorithm. Its key idea is usage of currently matched pattern in selecting the shifts for matching and matching the right most letter of the pattern to the Text, such that if a chracter is found that is not in the entire text being matched, a best case shift of length of the Pattern can be exercised. A newer variant of Boyer-Moore(in progress) has been published in December 2007, we present a modified version in Variant of Boyer-Moore
Future
We would like to understand distributed pattern matching using pipelines, design issues are sandboxed at Pipeline
References:
- "String Searching over Small Alphabets", J.S.Moore, Matyas A. Sustik in Technical Report Tr-07062 Univ. of Texas at Austin.
Maintainer: rakesh.mawa@hsc.com
Categories: PatternMatching Software
<< | PatternMatching-Trail | Main.PatternMatching-NaiveMethod >>
Comments