DSA: Strings Pattern Matching

What is Pattern Matching?

Pattern matching is a fundamental problem in computer science that is employed in a wide range of applications. Pattern matching in the context of strings refers to detecting instances of a given pattern (substring) inside a bigger string.

Algorithm for Pattern Matching:

There are several popular algorithms and techniques for string pattern matching. Here are some commonly used ones:

Naive Pattern Matching:

The Naive Pattern Matching method is the most basic and easiest technique. It compares all characters in the main string to the pattern. It has a time complexity of O(N*M), where N is the text length and M is the pattern length.

class NaivePatternMatching {
    static void search(char[] pattern, char[] text) {
        int m = pattern.length;
        int n = text.length;

        for (int i = 0; i <= n - m; i++) {
            int j;

            for (j = 0; j < m; j++) {
                if (text[i + j] != pattern[j])
                    break;
            }

            if (j == m)
                System.out.println("Pattern found at index " + i);
        }
    }

    public static void main(String[] args) {
        char[] text = "AABAACAADAABAAABAA".toCharArray();
        char[] pattern = "AABA".toCharArray();

        search(pattern, text);
    }
}

Explanation:

The search method takes two parameters: the pattern (the pattern to search for) and the text (the text in which to search for the pattern).
It initializes variables m and n with the lengths of the pattern and text, respectively.
The outer for loop iterates from 0 to n - m (inclusive). This loop controls the starting position of the pattern in the text.
Inside the outer loop, there is an inner for loop that iterates over the characters of the pattern. It compares each character of the pattern with the corresponding character in the text starting from the current position.
If a mismatch is found (i.e., the characters at the current positions don’t match), the inner loop breaks, and the search continues with the next starting position.
If the inner loop completes without a break, it means that the entire pattern has been matched at the current starting position. In this case, it prints the message “Pattern found at index ” followed by the current starting index.
The main method is where the pattern search is invoked. It initializes the text and pattern arrays with the provided strings, and then calls the search method, passing the pattern and text arrays as arguments.
When you run the program, it searches for the pattern “AABA” within the text “AABAACAADAABAAABAA”. The output shows the indices where the pattern is found: 0, 9, and 13.

Knuth-Morris-Pratt (KMP) Algorithm:

The KMP method enhances pattern-matching performance by utilizing a preprocessed table known as the Longest Proper Prefix, which is also a Suffix (LPS) table. It prevents redundant character comparisons based on the LPS table information. The KMP algorithm has a time complexity of O(N+M).

class KMPAlgorithm {
    static void KMPAlgorithm(String pattern, String text) {
        int m = pattern.length();
        int n = text.length();
        int[] lps = computeLPSArray(pattern, m);
        int i = 0; // index for text[]
        int j = 0; // index for pattern[]

        while (i < n) {
            if (pattern.charAt(j) == text.charAt(i)) {
                i++;
                j++;
            }

            if (j == m) {
                System.out.println("Pattern found at index " + (i - j));
                j = lps[j - 1];
            } else if (i < n && pattern.charAt(j) != text.charAt(i)) {
                if (j != 0) {
                    j = lps[j - 1];
                } else {
                    i++;
                }
            }
        }
    }

    static int[] computeLPSArray(String pattern, int m) {
        int[] lps = new int[m];
        int len = 0;
        lps[0] = 0;
        int i = 1;

        while (i < m) {
            if (pattern.charAt(i) == pattern.charAt(len)) {
                len++;
                lps[i] = len;
                i++;
            } else {
                if (len != 0) {
                    len = lps[len - 1];
                } else {
                    lps[i] = 0;
                    i++;
                }
            }
        }
        return lps;
    }

    public static void main(String[] args) {
        String text = "AABAACAADAABAAABAA";
        String pattern = "AABA";

        KMPAlgorithm(pattern, text);
    }
}

Explanation:

The KMPAlgorithm function takes the pattern and text strings as input.
It initializes variables, including the length of the pattern (m), length of the text (n), and the Longest Proper Prefix which is also a Suffix array (lps), which is computed using the computeLPSArray function.
The algorithm iterates through the text while comparing characters of the pattern with the corresponding characters in the text.
If a match is found, it increments both the pattern index (j) and text index (i).
If the entire pattern is matched (j == m), it prints the index at which the pattern is found and updates the pattern index using the lps array.
If there is a mismatch and j is not at the beginning of the pattern, it updates the pattern index using the lps array.
If there is a mismatch and j is at the beginning of the pattern, it increments the text index.
The computeLPSArray function computes the Longest Proper Prefix which is also a Suffix array for the given pattern using the two-pointer technique.
The main function demonstrates an example usage of the algorithm by calling KMPAlgorithm with a sample text and pattern.

Rabin-Karp Algorithm:

The Rabin-Karp algorithm searches for patterns in text using hashing. It compares the hash value of the pattern to the hash values of the text’s substrings. It has an average time complexity of O(N+M), where N is the text length and M is the pattern length.

class RabinKarpAlgorithm {
    static final int d = 256; // Number of characters in the input alphabet

    static void search(String pattern, String text, int q) {
        int m = pattern.length();
        int n = text.length();
        int p = 0; // Hash value for pattern
        int t = 0; // Hash value for text
        int h = 1;

        // Calculate the hash value of the pattern and the first window of text
        for (int i = 0; i < m - 1; i++)
            h = (h * d) % q;

        for (int i = 0; i < m; i++) {
            p = (d * p + pattern.charAt(i)) % q;
            t = (d * t + text.charAt(i)) % q;
        }

        // Slide the pattern over the text one by one
        for (int i = 0; i <= n - m; i++) {
            // If the hash values of the current window in text and pattern match, check for exact match
            if (p == t) {
                boolean found = true;
                for (int j = 0; j < m; j++) {
                    if (text.charAt(i + j) != pattern.charAt(j)) {
                        found = false;
                        break;
                    }
                }
                if (found)
                    System.out.println("Pattern found at index " + i);
            }

            // Calculate the hash value for the next window of text
            if (i < n - m) {
              t = (d * (t - text.charAt(i) * h) + text.charAt(i + m)) %q;
                if (t < 0)
                    t = t + q;
            }
        }
    }

    public static void main(String[] args) {
        String text = "AABAACAADAABAAABAA";
        String pattern = "AABA";
        int q = 101; // A prime number

        search(pattern, text, q);
    }
}

Explanation:

The search function takes the pattern, text, and a prime number q as input.
It initializes variables, including the lengths of the pattern (m) and text (n), hash values for pattern (p) and text (t), and a value h for the highest significant digit.
It calculates the hash value of the pattern and the first window of the text.
It iterates through the text, comparing the hash values of the current window in the text and pattern.
If the hash values match, it performs an additional check for an exact match by comparing characters.
If a match is found, it prints the index at which the pattern is found.
It calculates the hash value for the next window of the text using a rolling hash technique.
The main function demonstrates an example usage of the algorithm by calling search with a sample text, pattern, and a prime number.

Finite Automaton Algorithm:

The Finite Automaton Algorithm generates a finite automaton based on the pattern and utilizes it to do pattern matching. It prepares the pattern by generating a state transition table. The time complexity of the algorithm is O(N+M), where N is the length of the text and M is the length of the pattern.

import java.util.Arrays;

class FiniteAutomatonAlgorithm {
    static final int NO_OF_CHARS = 256;

    static int[][] computeTransitionFunction(String pattern, int m) {
        int[][] TF = new int[m + 1][NO_OF_CHARS];
        int lps = 0;

        // Fill the entries in TF[][] based on pattern[]
        for (int state = 0; state <= m; state++) {
            for (int x = 0; x < NO_OF_CHARS; x++)
                TF[state][x] = getNextState(pattern, m, state, x);
        }

        return TF;
    }

    static int getNextState(String pattern, int m, int state, int x) {
        // If the character at the current state matches the input character, move to the next state
        if (state < m && x == pattern.charAt(state))
            return state + 1;

        int nextState, i;

        // Find the longest prefix suffix
        for (nextState = state; nextState > 0; nextState--) {
            if (pattern.charAt(nextState - 1) == x) {
                for (i = 0; i < nextState - 1; i++) {
                    if (pattern.charAt(i) != pattern.charAt(state - nextState + 1 + i))
                        break;
                }
                if (i == nextState - 1)
                    return nextState;
            }
        }

        return 0;
    }

    static void search(String pattern, String text) {
        int m = pattern.length();
        int n = text.length();

        int[][] TF = computeTransitionFunction(pattern, m);

        // Process text over the finite automaton to find occurrences of the pattern
        int state = 0;
        for (int i = 0; i < n; i++) {
            state = TF[state][text.charAt(i)];
            if (state == m)
                System.out.println("Pattern found at index " + (i - m + 1));
        }
    }

    public static void main(String[] args) {
        String text = "AABAACAADAABAAABAA";
        String pattern = "AABA";

        search(pattern, text);
    }
}

Explanation:

The computeTransitionFunction function calculates the transition function (TF) for the given pattern.
It initializes a 2D array TF with dimensions (m + 1) x NO_OF_CHARS, where m is the length of the pattern and NO_OF_CHARS is the number of characters in the input alphabet (e.g., ASCII).
It fills the entries in TF by iterating through each state and character, using the getNextState function to determine the next state based on the current state and input character.
The getNextState function calculates the next state based on the current state and the input character.
It checks if the character at the current state matches the input character. If so, it moves to the next state.
If there is no match, it finds the longest prefix suffix and updates the next state accordingly.
The search function performs the string search using the finite automaton.
It initializes the state as 0 and processes the text character by character.
It updates the state using the transition function TF[state][text.charAt(i)].
If the state becomes equal to the length of the pattern, a match is found, and it prints the index where the pattern occurs.
The main function demonstrates an example usage of the algorithm by calling search with a sample text and pattern.

The time and space complexity of the pattern-matching algorithms discussed:

Algorithm	Time Complexity	Space Complexity
Naive Pattern Searching	O((n – m + 1) * m)	O(1)
KMP Algorithm	O(n + m)	O(m)
Rabin-Karp Algorithm	O((n – m + 1) * m)	O(1)
Finite Automaton Algorithm	O(n)	O(m * NO_OF_CHARS)

where,

n refers to the length of the text.
m refers to the length of the pattern.
NO_OF_CHARS refers to the number of characters in the input alphabet.

Note: also read about DSA: Binary Search

Follow Me

If you like my post please follow me to read my latest post on programming and technology.

https://www.instagram.com/coderz.py/

https://www.facebook.com/coderz.py

Rabecca Fatima

Staying up to the mark is what defines me. Hi all! I’m Rabecca Fatima a keen learner, great enthusiast, ready to take new challenges as stepping stones towards flying colors.

NextProblems based on Pattern Matching »

Previous « DSA: Binary Search

coderz.py