Knuth-Morris-Pratt also called KMP is a pattern matching algorithm.

There are various other algorithms which are used for pattern matching. For example, Rabin Karp algorithm, Boyer Moor etc. KMP is one of them.

First, we need to understand, what is pattern matching algorithm?

Let’s say we have a large text T containing words [1…..n] and we need to find the occurrence of pattern P [1..m] in this text T. To accomplish this task, we need some algorithms. Those algorithms are called pattern matching algorithms and KMP is one of them. But here two questions arise:

- Why do we need a separate algorithm for this task?
- What is the application of pattern matching? Where do we need pattern matching?

Firstly, we’ll see the role of pattern matching.

In this section, we’ll talk about the applications of pattern matching. In daily life, whenever we search a particular string in notepad or word file, this search utility is the result of pattern searching algorithms. And there are unlimited areas where we are using the concept of pattern matching, we have listed some of them in the following figure.

As we saw in above figure, the importance of pattern matching. There is also an important detail about pattern matching which we should know. It is not necessary that every time people will search only text. They may want to search different types of data like audio, video, image etc. So tackle with different data types, we need better algorithms. After finding the location of matching pattern, it can also be used to substitute some component of that pattern with another string sequence that is first search then replace.

Let’s first study the naïve method for doing this. The idea is to take each word and compare it with the pattern.

Pseudocodefor this method Function match(text[], pattern[]) { // n is size of text // m is size of pattern for(i = 0; i < n; i++) { for(j = 0; j < m && i + j < n; j++) if(text[i + j] != pattern[j]) break; //doesn’t match, break the inner loop if(j == m) // match found } }

As we know, the problem of pattern matching is like a needle in haystack and If the length of the text is n and that of pattern is m. Then, it will take O(n*m) iterations to complete the task. In that case, the time complexity for worst case is O(mn). That is why, this method is impractical.

This algorithm was developed by Donald Knuth, Vaughan Pratt and James H. Morris in 1977. The idea was to search occurrences of a word within a text by using the result of previously matched characters. It uses the observation through which if a mismatch occurs, the word itself will provide sufficient information to determine where we’ll find the next match. At high level, there is a similarity between KMP and the naïve algorithm. In both the cases, the scanning is done in the same order that is from 1 to n-m, where n is the length of the text and m is the length of the pattern. But in KMP algorithm, it uses the information obtained by the previous occurrences of the pattern in the whole text. By using this information, it is able to skip those parts of the text which cannot contain the pattern we are searching. If we can calculate that how much we can shift in advance, then we can reduce the number of iterations. We have explained this in the following example:

Position: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Text (t): a b a b a a b b a b a b b a Pattern (p): a b a b b a b a b b a b a b b a b a b b

Start from the very first position, here the pattern is matched up to position 4th. At position 5 mismatch occurred, so shifting the pattern one place will not do any good. We also need to match the pattern at position 3 with the text at position 5, so as to make sure we don’t miss any matches. We’ll continue matching from position 5. Then, we encounter another mismatch at position 6. This shifting does not work. This is because, we did not consider the fact that ab is repeating itself in the pattern, thus only two places shift resulted in mismatch. Since the condition pattern at position 1<> pattern at position 4 is satisfied. Thus, we take a shift of three places and resume matching at position 6 and here we find mismatch at position 8. Then again, we shifted the pattern three places. And finally, we find the pattern at position 9 of the text.

The naïve algorithm would have taken 23 iterations, but this algorithm will give result in 15 iterations.

Now, we’ll mathematically deduce the above explained shift:

Let A[1….m] be the array which gives the information that how much shift do we need. This array is called Prefix array, it stores the information of previous matches in such a way that will exactly tell us the position where we have to jump for finding the match. The length of this array is equal to the length of the pattern. For example, If this array at any column contains the value 1. That means the sequence up to which the text matched contain the same prefix and suffix of length 1. The length of the longest proper prefix in the pattern that matches a proper suffix in the same pattern. So, we can shift accordingly. In other words, we do not need to backtrack on the text.

A[1]=0

A[i] = max (s| (s=0) or (p[1… s-1] = p[i-s+1 .. i-1] and p[s]<>p[i])), the maximisation of s is needed that means minimised shift. So, that, no occurrence of pattern will be missed.

1. i=1; j=1; 2. while (i<=m) and (j<=n) do begin 3. while (i>=0) and (p[i]<>[j]) 4. do i=h[i]; 5. i=i+1; j=j+1 6. end 7. if i <= m then print(“pattern is not found in the text”) 8. else print(“found at”, j-i+1)

Let the function f(i) be defined by

f(1) = 0 f(i) = max{ s | (1 <= s < i) and (p[1 .. s-1] = p[i-s+1 .. i-1]) }

Both the definition of A[i] and f(i) are almost same. But in the case of A[i], we need to satisfy the condition of p[s]<>p[i] which is not needed in the case of f(i). So, in case of the prefix array, the pattern matching of p is on pattern itself. We’ll use this function f(i) in next section.

In this section, we’ll write algorithm for computing prefix array that is A[i].

1. t=0; A[1]:=0; 2. for i=2 to m do begin 3. while (t>0) and (p[i-1]<>p[t]) //t=f(i-1) 4. do t=A[t]; 5. t:=t+1; 6. if p[i]<>p[t] //t=f(i) 7. then A[i]:=t else A[i]:=A[t] 8. end

**Explanation of the above algorithm:**

We are proceeding from left to right. In algorithm, we mentioned that t=f(i-1) in line 3. Then, in that case, we can extend the matched portion by 1 (line 5).

If p[t]<>p[i-1], we slide the pattern over itself until we find p[t]=p[i-1] and we can then extend the matching portion.

In line 7, the condition satisfied, so A[i] is set to t.

Pattern p = a b a b b i = 2, at line 7 t=1, p[1]<>p[2], f(2) = 1, A[2]=1 i a b a b b a b a b b t i = 3, t=1 at line 4, p[1]<>p[2], t=A[1]=0, at line 7, t=1, f(3)=1, p[1]=p[3], A[3]=0 a b a b b a b a b b i = 4 t=1 at line 4. p[3]=p[1], t:=t+1, t=2. At line 7, p[4]=p[2], A[4]:=A[2]=1 i = 5 t=2, f(4)=2. At line 4, p[4]=p[2], t=t+1=3. f(5)=3. At line 7, p[5]<>p[3], A[5]=t=3 Finally we have i | 1 2 3 4 5 ------------------------------ pat | a b a b b f | 0 1 1 2 3 A | 0 1 0 1 3

Creating prefix array is a pre-processing task and if the length of pattern is m. Then to execute that algorithm, the running time will be O(m), the space complexity will also be O(m) and in second algorithm, the running time will be O(n) where n is the string length in the text.

So, the total time complexity for this algorithm is O(m+n).

Basic Subject

Computer Science

- Programming Assignment Help
- Database Help
- Data Structure Assignment Help
- Operating Systems Assignment Help
- Computer Network Assignment Help
- UML Diagram Assignment Help
- IT Assignment Help
- Game Programming
- Computer Science Assignment Help
- Medical Science Assignment Help
- Social Science Assignment Help
- Information Systems

Engineering

- Biochemical and Biotechnology Help
- Chemical Engineering Assignment
- Statistics Assignment Help
- Civil Engineering Assignment Help
- Electrical, Electronics Help
- Mathematics, Computing Assignment Help
- Mechanical and Industrial Engg. Help
- Petroleum Engg. Assignment Help
- Biochemistry Assignment Help
- Cell Biology Assignment Help
- Arts and Architecture Help
- Silverlight Assignment Help