String Matching, or needle in haystack (KMP Algorithm)
Given a string
target, find the index of its first occurrence in stringsource. -1 if such index does not exist.
There is a good youtube video that describes KMP table generation pretty well and I hope you could check it out.
Another good resource is the Wikipedia page of KMP, which explains the purpose of KMP with great examples. But it becomes less straightforward when comes to the pseudo code part.
Okay, here comes the code, we divide the code into two parts, the first one is generating KMP table, which determines how should we move when there is a mismatch, and the second part is the matching function that does the matching according to our KMP table.
Several notes:
- The
failure_table(ortablein this case) is for the target string (needle), not the source string. - Using
posandoffsetcan be more convience than traditionali,jwhich denotes the position in each string accordingly because this eliminates the calulation of the index of first occurrence.
The above code is also the solution of strStr() question on LeetCode and LintCode.
Longest Common Substring
Find the longest common substring of two strings.
Suffix Tree solution of LCS with two strings
First let’s see an examle from Wikipedia
1 | ABABC |
Apparently the longset common substring is ABC, and ABC can be obtained by the following procedures:
- For all suffixes of
ABABCandBABCA, find the longest common prefix of their suffixes - For all prefixes of
ABABCandBABCA, find the longest common suffix of their prefixes
The defintion and implementation of Suffix Tree or Suffix Trie can be found in another article.
Here is the naive solution of it, and the time complexity is O(n^2 + m^2), where n and m are the length of each string.
The main steps of this solution is:
- we insert all suffixes of string
Aand stringBinto a suffix trie. <- this costsO(n^2 + m^2)due to the naive implementation - we search the suffix trie we get and find the deepest ancestor of suffixes of both
AandB. <- this costsO(m+n)time
An improved implementation is as follow.
TODO: gist of an improved version of suffix tree
Here is a youtube video that explains how (Ukkonen’s algorithm)[https://www.youtube.com/watch?v=aPRqocoBsFQ] works
Dynamic Programming solution of LCS with two strings
Find the longset common substring of
mstrings.
Suffix Tree solution of LCS with m strings
Longest Common Subsequence
Reference:
Knuth–Morris–Pratt algorithm
KMP Table Tutorial
Longest common substring problem
Longest common subsequence problem