Algorithm of Knuth-Morris-Pratt

The algorithm of Knuth-Morris-Pratt (often shortened by algorithm KMP ) is a Algorithme of search for under-chain, making it possible to find the occurrences of a chain P in a text S. Its characteristic resides in a preprocessing of the chain, which provides sufficient information to determine where to continue research in the event of not-correspondence. That makes it possible the algorithm not to re-examine the characters which were previously checked, and thus to limit the number of comparisons necessary.

The algorithm was invented by Knuth and Pratt, and independently by J.H. Morris in 1975.

Principle of operation

Naive approach

In order to better include/understand the logic of the algorithm of Knuth-Morris-Pratt, it is instructive to lean on the naive approach search for chain.

The chain B can be found in the text has with the following algorithm:

  1. To fix i = 1;
  2. As long as there remain positions to check
  3. * Comparer letter with letter the chain B and the text has starting from the position i;
  4. * If the chain corresponds, to finish the treatment and to turn over i like position of the beginning of the occurrence;
  5. * If not to fix i = I + 1;
  6. To finish the treatment, no occurrence was found.

This procedure can be improved by stopping the comparison of the third stage as soon as a different character is detected.

This approach has a disadvantage: after an unfruitful comparison, the following comparison will begin with the position i + 1, without holding any account of those which already took place with the preceding iteration, with the position i. The algorithm of Knuth-Morris-Pratt examines initially the chain B and in deduced from information making it possible not to compare each character more once.

Phases

  1. the first phase of the algorithm builds a table, indicating for each position a “shift”, i.e. the next position where a potential occurrence of the chain can be.
  2. the second phase carries out research strictly speaking, by comparing the characters of the chain and those of the text. In the event of difference, it uses the table to know the shift to be taken into account to continue research without flashback.

Example

To present the principle of operation of the algorithm, a particular example is considered: the chain P is worth ABCDABD and the text S is ABC ABCDAB ABCDABCDABDE.

Notations : To represent the character strings, this article uses tables from which the indices begin to zero. Thus, the C of the chain P will be noted P. m indicates the position in the text S to which the chain P is in the course of checking, and i the position of the character currently checked in P.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

The algorithm starts by testing the correspondence of the characters the ones after the others. Thus, with the fourth stage, m = 0 and i = 3. S is a space and P = 'D', the correspondence is not possible.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

Rather than to start again with m = 1, the algorithm notices that no A is present in P between positions 0 and 3, except for position 0. Consequently, to have tested all the preceding characters, the algorithm knows that it is not likely any to find the beginning of a correspondence if it checks again. So the algorithm advances until the following character, by posing m = 4 and I = 0.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

An almost complete correspondence is quickly obtained when, with i = 6, the checking fails.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

However, right before the end of this partial correspondence, the algorithm passed on the reason AB, which could correspond to the beginning of another correspondence. This information must thus be taken into account. As the algorithm knows already that these the first two characters correspond with the two characters preceding the current position, it is not necessary to reverify them. Thus, the algorithm takes again its treatment the current character, with m = 8 and i = 2.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

This checking fails immediately (C does not correspond with space in S). As the chain does not contain any space (as in the first stage), the algorithm continues research with m = 11 and by réinitialisant i = 0.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

Again, the algorithm finds a correspondence partial ABCDAB, but the following character C does not correspond to the final character D of the chain.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

With the same reasoning that previously, the algorithm begins again with m = 15, to start with the chain of two characters AB driving to the current position, by fixing i = 2, new current position.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

This time, the correspondence is complete, the algorithm turns over position 15 (i.e. m) like origin.

1 2 01234567890123456789012 m: v S: ABC ABCDAB ABCDABCDABDE P: ABCDABD I: ^ 0123456

The algorithm of research

The preceding example illustrates in an instructive way the principle of the algorithm. It supposes the existence of a table giving the “partial correspondences” (described low), indicating where to seek the potential beginning of the next occurrence, if the checking of the current potential occurrence fails. For the moment, this table, indicated by T, can be regarded as a block box having the following property: if one has a partial correspondence until S, but which fails at the time of the comparison between S + 1 and P, then the next potential occurrence starts with the position m + I - T-1. In particular, T exists and is defined in -1. Being given this table, the algorithm is relatively simple:
  1. To fix i = m = 0. Let us suppose that P has a length of n characters, and S, of l characters;

  2. If m + I = l, then to finish the treatment, no correspondence were found. If not, to compare P and S + i;
  3. * If they are equal, to fix i = I + 1. If i = n, then the correspondence is complete. To finish the treatment and to turn over m like position of the beginning of the correspondence;
  4. * If they are different, to fix e = T-1. To fix m = m + I - e, and if i > 0, to fix i = e;
  5. To begin again at the stage n° 2.

This description implements the algorithm applied in the preceding example. With each failure of the checking, the table is consulted to find the beginning of the next potential occurrence, and the meters are updated consequently. So the checking of the characters is never carried out backwards. In particular, each character is checked only once (although it can be possibly isolated several times following the failure of correspondences. See low for the analysis of the effectiveness of the algorithm).

Example of code of the algorithm of research

The piece of code C which follows is an implementation of this algorithm. In order to mitigate the intrinsic limitations of the tables out of C, the indices are shifted of a unit, i.e. T in the code is equivalent to T + 1 in description above.

int kmp_recherche ( tank *P, tank *S) { extern int T; int m = 0; int I = 0; while (S + I! = “\ 0” && P! = “\ 0”) { yew (S + I == P) { ++i; } else { m += I - T; yew (I > 0) I = T; } } yew (P == “\ 0”) { return m; } else { return m + I; } }

Effectiveness of the algorithm of research

By supposing the preliminary existence of table T, the phase “seeks” algorithm of Knuth-Morris-Pratt is of complexity O (L) , where l indicates the length of S. If one excludes the fixed additional treatments induced by the entry and the exit from the function, all the treatments are carried out in the principal loop. To calculate a limit on the iteration count, a first observation in connection with the nature of T is necessary. By definition, it is built so that if a partial correspondence beginning with S fails at the time of the comparison between S + i and P, then the next potential correspondence does not begin before S + (I - T [I)]. In particular, the next potential correspondence must be with a position higher than m, so that T < i.

On the basis of this fact, one shows that the loop is carried out with the more l time. With each iteration, it carries out one of the two branches of the instruction yew .

  • the first branch increases i invariably and does not modify m, with the result that the index m + i character currently checked in the chain S is increased.
  • the second branch adds i - T to m. As we saw, i - T is always positive. Thus, the position m of the beginning of the current potential correspondence is increased.
The loop ends if S + I = '\ backslash 0 ' , which means, by taking account of convention C specifying that a character NO ONE indicates the end of a chain, that m + I = l. Consequently, each branch of the instruction yew can be traversed with the more l time, since they increase m respectively + i or m, and that m \ Leq m + i, thus if m = l, then m + I \ geq l, and as the increase with each iteration is at least of a unit, m + I = l was necessarily checked in the past.

Thus, the loop is carried out with the more 2l time, establishing by là-même an algorithmic complexity in O (L) .

The table of the “partial correspondences”

The objective of this table is to make it possible the algorithm not to test each character of the text more once. The observation-key, in connection with the linear nature of the research, which makes it possible this algorithm to function, is that while having checked part of the text with a “first portion” of the chain, it is possible to determine with which positions can begin the possible occurrences which follow, and which continue to correspond to the current position in the text. In other words, the reasons (under-parts of the chain) are “pre-required” in the chain, and a list is drawn up, indicating all the possible positions to which to continue to jump a maximum of useless natures, without however sacrificing any potential occurrence.

For each position in the chain, it is necessary to determine the length of the initial reason longest, which finishes with the current position, but which does not allow a complete correspondence (and which thus most probably has just failed). Thus, T indicates exactly the length of the initial reason longest finishing with P. By convention, the null string is null length. As a failure with the whole beginning of the chain is a particular case (the possibility of backtracking does not exist), one poses T = -1, as discussed previously.

By reconsidering the ABCDABD example presented previously, one can establish that it functions on this principle, and that it profits from the same effectiveness for this reason. One fixes T = -1.

-1 0 1 2 3 4 5 6 I: v P: WITH B C D HAS B D T: -1

As P appears only at the end of the complete initial reason, one also fixes T = 0.

-1 0 1 2 3 4 5 6 I: v P: WITH B C D HAS B D T: -1 0

To determine T, the algorithm must find a reason final in AB which is also an initial reason for the chain. But the only possible final reason for AB is B, which is not an initial reason for the chain. So T = 0.

-1 0 1 2 3 4 5 6 I: v P: WITH B C D HAS B D T: -1 0 0

While continuing with C, one notices that there exists a short cut to check all the final reasons. Let us consider that the algorithm found a reason final for two characters length, fascinating end on the C; then the first character of this reason is an initial reason for an initial reason for the chain, and consequently, an initial reason itself. Moreover, it ends on the B, for which we know that the correspondence is not possible. Thus, it is not necessary to worry about the reasons for two characters length, and as in the preceding case, the single reason for unit length does not correspond. Thus T = 0.

In the same way for D, one obtains T = 0.

-1 0 1 2 3 4 5 6 I: v P: WITH B C D HAS B D T: -1 0 0 0 0

For the A according to, the preceding principle shows us that the reason longest to take into account contains 1 character, and in this case, A corresponds . Thus, T = 1.

-1 0 1 2 3 4 5 6 I: v P: WITH B C D HAS B D T: -1 0 0 0 0 1   P: WITH B C D HAS B D P: HAS B C D HAS B D

Same logic is applied to B. If the algorithm had found a reason starting before the preceding A, and continuing with the B currently considered, then it would have itself a correct reason initial ending in A although beginning before A, which contradicts the fact that the algorithm already found that A is the first occurrence of a reason finishing there. Consequently, it is not necessary to look before the A to seek a reason for B there. In fact, by checking it, the algorithm finds that it continues by B and that B is the second letter of the reason whose A is the first letter. So the entry for B in T is higher of a unit than that of A, i.e. T = 2.

-1 0 1 2 3 4 5 6 I: v P: WITH B C D HAS B D T: -1 0 0 0 0 1 2   P: WITH B C D HAS B D P: HAS B C D HAS B D

Lastly, the reason does not continue not B towards D. The preceding reasoning shows that if a reason for a length higher than 1 were found on D, then it should contain a reason finishing on B. As the reason running does not correspond, it must be shorter. But the reason running is an initial reason for the chain finishing with the second position. Thus this new potential reason should him also finish with the second position, and we already saw that there was none of it. As D is not itself a reason, T = 0.

-1 0 1 2 3 4 5 6 I: v P: WITH B C D HAS B D T: -1 0 0 0 0 1 2 0

From where the following table:

Algorithm of construction of the table

The preceding example illustrates the general technique to produce the table with less concern possible. The principle is the same one as for general research: the majority of the treatment is already made at the time of the arrival on a new position, it remains only little of treatment to pass to the following one. The description of the algorithm follows. To eliminate from the particular cases, following convention is applied: P exists and its value is different from all the possible natures of P.
  1. To fix T = -1. Let us suppose that P contains n characters;

  2. To fix i = 0 and j = T-1;
  3. If i = n, to finish the treatment. If not, to compare P and P.
  4. * If they are equal, to fix T = J + 1, j = J + 1 and i = I + 1;
  5. * If not, and if j > 0, to fix j = T-1;
  6. * If not, to fix T = 0, i = I + 1 and j = 0.
  7. To begin again at the stage n° 3.

Example of code of the algorithm of construction of the table

The piece of code C which follows is an implementation of this algorithm. As for the algorithm of research, the indices of T were increased by 1 in order to return the code C more natural. The additional variable c makes it possible to simulate the existence of P. It is supposed that this function, as well as the function of research, are called within a function of higher level, which suitably manages the allowance of the memory for table T.

void kmp_tableau ( tank *P) { extern int T; int I = 0; int J = -1; tank C = “\ 0”; T = J; while (P! = “\ 0”) { yew (P == c) { T + 1 = J + 1; ++j; ++i; } else yew (J > 0) { J = T; } else { T + 1 = 0; ++i; J = 0; } C = P; } }

Effectiveness of the algorithm of construction of the table

The complexity of the algorithm of construction of the table is O (N) , where n indicates the length of P. Except for initializations, all the treatment is carried out in the stage n° 3. Thus, it is enough to show that this stage is carried out in O (N) , which is made thereafter by simultaneously examining the i quantities and i - j.
  • In the first branch, i - j is preserved, because i and j are increased simultaneously. The quantity i, it, is thus increased.
  • In the second branch, j is replaced by T-1, which is always strictly lower than j (see higher), which increases i - j.
  • In the third branch, i is increased, but not j, therefore i and i - j is both increased.
Like i \ geq I - j, that means that with each stage, either i, or quantity lower than i increases. Consequently, since the algorithm stops when i = n, it stops after with the more n iterations of the stage n° 3, because i - j starts with 1. Thus, the complexity of this algorithm is O (N) .

Effectiveness of the algorithm of Knuth-Morris-Pratt

As the two parts of the algorithm have respectively complexities of O (L) and O (N) , the complexity of the algorithm in its totality is O (N + L) .

See too

Related articles

Random links:Tyro | List codes CIM-10 | Championships of the world of figure skating | Death (South Park) | Pignoletto | Dactylographie_de_canard