RAndom NoTeS


  • Home

  • About

  • Archives

  • Tags

Edit Distance

Posted on 2015-09-21   |  

The sub-problem of edit distance is the edit distance of string A[0:i] and string B[0:j].

The boundary cases are:

  1. A.length() == 0, then edit distance is B.legnth()
  2. B.length() == 0, then edit distance is A.length()
  3. otherwise, there are three cases (log the minimum score):
    1. if we align A and B at i and j respectly, then the edit distance is M[i-1][j-1] + (A[i] == B[j] ? 0 : 1)
    2. if we align by A[0:i-1] and B[0:j], then we need to perform a delete/add operation
    3. if we align by A[0:i] and B[0:j-1], then we need to perform a delete/add operation

By analysis the above conditions, we can know the state we need to keep is two rows, namely M[i-1][0:j] and M[i][0:j].

Therefore, the code is:

Pirority Queue, Heap Implementation

Posted on 2015-09-11   |  

Pirority queue can be implemented by a binary heap, and heap is usually implemented in a vector so we do not need to keep multiple pointers that point to its parent and two children.

If we use a vector v to store a heap, it satisfies the following rules:

  1. the root of this heap is stored at v[1], and v[0] is never used
  2. for any sub heap in this vector, if v[pos] is the root, v[2*pos] and v[2*pos + 1] are the children of it.
  3. for any node (except the root) in the heap, v[pos]‘s parent is v[pos/2].

The main operations of a heap is insert ant pop, when the heap is a min heap, pop pops out the maximum value in the heap and when it is a max heap, it pops out the smallest one. Since we need to keep the root to be the smallest or largest element, every time we modify the heap, we need to reorder it in order to keep the root is the largest or smallest.

Therefore, we need to define two algorithms to reorder the heap for both insert and pop function.

Insert

Insert is done by attach an element to the rightmost node.

Trie, Prefix Tree, and Suffix Tree

Posted on 2015-09-06   |  

Trie is a tree with k different type of children where the number of each type of child is 0 or 1.

Prefix Tree is a Trie that contains all the prefix of a string or strings.

Suffix Tree is a Trie that contains all the suffix of a string or strings.

Reference:

Suffix Trees

String Problems

Posted on 2015-09-06   |   In code interview   |  

String Matching, or needle in haystack (KMP Algorithm)

Given a string target, find the index of its first occurrence in string source. -1 if such index does not exist.

There is a good youtube video that describes KMP table generation pretty well and I hope you could check it out.

Another good resource is the Wikipedia page of KMP, which explains the purpose of KMP with great examples. But it becomes less straightforward when comes to the pseudo code part.

Okay, here comes the code, we divide the code into two parts, the first one is generating KMP table, which determines how should we move when there is a mismatch, and the second part is the matching function that does the matching according to our KMP table.

Several notes:

  1. The failure_table (or table in this case) is for the target string (needle), not the source string.
  2. Using pos and offset can be more convience than traditional i, j which denotes the position in each string accordingly because this eliminates the calulation of the index of first occurrence.

The above code is also the solution of strStr() question on LeetCode and LintCode.

Longest Common Substring

Find the longest common substring of two strings.

Suffix Tree solution of LCS with two strings

First let’s see an examle from Wikipedia

1
2
3
ABABC
|||
BABCA

Apparently the longset common substring is ABC, and ABC can be obtained by the following procedures:

  1. For all suffixes of ABABC and BABCA, find the longest common prefix of their suffixes
  2. For all prefixes of ABABC and BABCA, find the longest common suffix of their prefixes

The defintion and implementation of Suffix Tree or Suffix Trie can be found in another article.

Here is the naive solution of it, and the time complexity is O(n^2 + m^2), where n and m are the length of each string.

The main steps of this solution is:

  1. we insert all suffixes of string A and string B into a suffix trie. <- this costs O(n^2 + m^2) due to the naive implementation
  2. we search the suffix trie we get and find the deepest ancestor of suffixes of both A and B. <- this costs O(m+n) time

An improved implementation is as follow.

TODO: gist of an improved version of suffix tree

Here is a youtube video that explains how (Ukkonen’s algorithm)[https://www.youtube.com/watch?v=aPRqocoBsFQ] works

Dynamic Programming solution of LCS with two strings

Find the longset common substring of m strings.

Suffix Tree solution of LCS with m strings

Longest Common Subsequence

Reference:

Knuth–Morris–Pratt algorithm
KMP Table Tutorial
Longest common substring problem
Longest common subsequence problem

Graph Traversal -- DFS, BFS, and BFS From Both Sides

Posted on 2015-08-27   |   In code interview   |  

Depth First Search

The key point of DFS is keep tracking all visited nodes and make sure check the end condition correctly. Usually the end condition is reaching certain length or certain node.

Breadth First Search

The key point of BFS is also keep tracking visited nodes. Unlike DFS which uses recursion, BFS uses a queue to do layer-by-layer traversal.

In this implementation, we do not keep the lavel, or depth, of the nodes we visited, a simple way to do that is insert an indicator to indicate the end of a layer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
deque<Node*> frontier;
vector<int> visited(GRAPH_SIZE, false);
frontier.push_back(startNode);
visited[startNode->label] = true;
int level = 0; // start node is at level 0
while(frontier.size() > 0) {
// push a nullptr into the queue as a label, all nodes before that belongs to the same level
frontier.push_back(nullptr);
while(frontier.front() != nullptr) {
/*Do normal works, push_back a new neighbor that we did not visited*/
frontier.pop_back(); // remove current node
}
frontier.pop_back(nullptr); // now we reach the end of the level, add level by 1
++level;
}

Breath First Search with Two Frontiers

The following code is for the Word Ladder problem. I used BFS with two frontiers to solve this problem. The details can be seen in the code comment

Hide One or More Legends in ggplot2

Posted on 2015-08-25   |  

This is actually pretty simple thanks to this StackOverflow post.

To disable the legend of a scale being shown in the legend, just set guides(SCALE_NAME=FALSE), the follwing example ignores the legend of color.

1
ggplot(mpg, aes(x=cty,y=hwy,color=manufacturer,shape=class)) + geom_point() + guides(color=FALSE)

Ugly Number

Posted on 2015-08-19   |   In code interview   |  

Ugly Number and Find $n^{th}$ Ugly Number

The question description can be found at the following site:

Leetcode: (263) Ugly Number

The definition of ugly number is the prime fractors of a number only contains 2,3,5 (1 is also an ugly number and it is an exception). Therefore a list of ugly numbers are

1
2
3
index:   1  2  3  4  5  6  7  8  9  10 ...

number: 1 2 3 4 5 6 8 9 10 12 ...

The first problem, 263 Ugly Number gives an integer as input and the output is whether this is an ugly number or not. In order to solve this problem, we need to know how the ugly numbers are constructed.

Namely any ugly number can be written as $2^n \times 3^m \times 5^k$, therefore to validate an ugly number, all we need to do is divide that number by 2,3, and 5 and see if there is any fraction.

Find $n^{th}$ Ugly Number

The question description can be found at the following site:

Leetcode: (264) Ugly Number II

This is a follow up problem of the aforementioned ugly number problem. It is easy to validate an ugly number but to generate $n^{th}$ ugly number needs some effort.

Since we know any ugly number can be represented by $2^n \times 3^m \times 5^k$, we denote this as (n,m,k). Therefore, the pervious ugly number list becomes

1
2
3
4
5
index:     1        2        3        4        5        6        7        8        9       10   ...

number: 1 2 3 4 5 6 8 9 10 12 ...

: (0,0,0) (1,0,0) (0,1,0) (2,0,0) (0,0,1) (1,1,0) (3,0,0) (0,2,0) (1,0,1) (2,1,0) ...

If we maintain three ordered queues, the first queue multiple 2 to every new element pushes into it, the second and third multipe 3 and 5 respectively, we can somehow get the result by merging these three queues. The problem is, how should we construct the queues?

An intuitive idea is that a ugly number (a,b,c) can construct three future ugly numbers (a+1,b,c), (a,b+1,c), and (a,b,c+1). If a=1,b=0,c=0, the three ugly numbers are (2,1,1), (1,2,1), and (1,1,2). Unfortunately, although we know (2,0,0) < (1,1,0) < (1,0,1), we do not know how many other ugly numbers lies between them based on the information we have so far. The good news is, for (a',b',c') > (a,b,c), we know that (a'+1,b',c') > (a+1,b,c), (a',b'+1,c') > (a,b+1,c), and (a',b',c'+1) > (a,b,c+1). That means if we keeping multiplying a new ugly number by 2,3,5 and push them into each queue, the numbers in each queue are always in order.

Let’s do an example to illusrate this:

At the beginning, each queue only contains only element 1

1
2
3
4
5
6
7
8
9

nth: 0
val: unknown

queue2: 1

queue3: 1

queue5: 1

We check the front element of each queue and remove the smallest one, which is 1, and then we construct three new numbers and put them into three queues.

Note that we removed all 1s in three queues.

Embedded numbers mean already removed

1
2
3
4
5
6
7
8
9

nth: 1
val: 1

queue2: [1] 2

queue3: [1] 3

queue5: [1] 5

Then we check the fronts again and remove 2, then we construct three more ugly numbers using 2.

1
2
3
4
5
6
7
8
9

nth: 2
val: 2

queue2: [1] [2] 4

queue3: [1] 3 6

queue5: [1] 5 10

and so on.

1
2
3
4
5
6
7
8
9

nth: 3
val: 3

queue2: [1] [2] 4 6

queue3: [1] [3] 6 9

queue5: [1] 5 10 15

Algorithm Problems: Binary Search

Posted on 2015-08-18   |   In code interview   |  

Binary Search Basics

If you know what is binary search, I encourage you to skip the first section and goes to the questions.

Before we dive into real problems, I want to briefly discuss about binary search. Binary search is a technic that can search for the existance of a value in a sorted array in O(logn) time, where n is the size of that array.

An example array v, which is sorted, is shown below:

1
2
3
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 4 5 7 9 11 13 20 21 39

Say we want to determine if 10 is in the array or not, an intuitive solution is walk though this array and check arr[i] == 9. The time complexity of such algorithm is O(n). Note that this is usually the best solution if the array is unsorted because binary search can not be applied to unsorted array (A rotated, sorted array is an exception and we will talk about it later).

To reduce the time complexity of searching for a value in sorted array from O(n) to O(logn), we will use binary search next.

The C++ snippet is as follow:

Let’s walk though the code using our example:

At the very beginning, l = 0 and r = 10. v[l:r], where l:r means [l,l+1,...,r-1,r], denotes the search range, in other words, we know that if the target value exists in v, it must lies in v[l:r].

1
2
3
4
5
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 4 5 7 9 11 13 20 21 39
^ ^
label: l r

Then we compute the median v[m] of v[l:r], which is the value right in the middle of v[l:r]. Note that we are not necessary computing the exact median when there are even number of elements in the range. That is to say we are calculating $v[ \lfloor \frac{l+r}{2} \rfloor ]$.

1
2
3
4
5
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 4 5 7 9 11 13 20 21 39
^ ^ ^
label: l m r

As we can see, v[m] = 9 and v[m] < 10. Since v[m] = max(v[0:m]), we can safely say that if 10 is in v, it must in v[m:r].

With that in mind, we narrow our search space to v[5,10].

1
2
3
4
5
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 4 5 7 9 11 13 20 21 39
^ ^
label: l r

After we reduce the search range, we calculate m by (l+r)/2. Note l, r, and 2 are integers therefore the fraction part of (l+r)/2 is ignored.

1
2
3
4
5
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 4 5 7 9 11 13 20 21 39
^ ^ ^
label: l m r

This time v[m] is larger than 10. Because v[m] = min(v[m:(v.size()-1)]), we can safely conclude that 10 won’t appear in v[m:10]. So we change our search space to v[l:m] by setting r = m.

1
2
3
4
5
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 4 5 7 9 11 13 20 21 39
^ ^ ^
label: l m r

Now v[m] = 11, which is still larger than our target 10, therefore we set r = m. Note that in our C++ snippet we stop when l == r - 1, therefore we stop when l and r are adjecent.

1
2
3
4
5
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 4 5 7 9 11 13 20 21 39
^ ^
label: l r

Then we test v[l] and v[r] to see if they equal to 10. Unfortunately none of them is 10 therefore we say 10 is not in this array v. One may wondering why comparing v[r] since r is m in the previous step and we already compared v[m] in last step. Thing may get tricky when one is trying to save this two comparison. An example is what if l == m? In such case we do not need to compare v[l] but v[r].

Binary Search Questions

There are five main parts of a binary search algorithm:

  1. Invalid input detection. One need to filter our all input with a length less than 0 because this will make r == -1.
  2. Iteration stop condition. Usually I like to use l < r - 1 because it is easy to implement and can avoid specific cases. But you can always use l < r - k, where k >= 0.
  3. Pruning condition. The classic binary search uses v[m] < target or v[m] > target, but this depends on your problems.
  4. Boundry condition. By using l < r - 1, one do not need to do detecting but check the value of v[l] and v[r].
  5. Boundry check. Check boundary values v[l] and v[r] and deal with the case when target is not found.

Problems on Sorted Arrays

Binary Search

The question description can be found at the following site:

Lintcode: (14) Binary Search

This is the basic binary search problem, the code can be found above.

Search insert position

The question description can be found at the following sites:

Lintcode: (60) Search insert position
Leetcode: (35) Search insert position

The answer of this question is a twisted version of classic binary search. Namely we just need to change the invalid input detection and boundry check.

The invalid input detection should output 0 instead of -1 because when target is not found, the output should be the index of target if it was inserted. In this case, it is 0.

As for boundary check, we return l or r if m[l] == target or m[r] == target. If none of them equals to target, we need to determine where to insert target. The index should be the smallest value in m that is larger than target.

Wood Cut

Find Bad Version

Find the nearest pow(i,2) that equals or less thansqrt(x)

The question description can be found at the following sites:

Lintcode: (141) sqrt(x)
Leetcode: (69) sqrt(x)

This problem is also a variant of binary search problem. One thing we need to mention is how to calculate sqrt(2). The question does not require you to return 1.414 but 1 (the largest integer that satifies x^2 < target). Therefore, we need to change the pruning condition to comparing m[mid]^2 and x.

Find value range in sorted array containing duplicate values

The question description can be found at the following sites:

Lintcode: (61) Search for a Range
Leetcode: (34) Search for a Range

If we keep using the old algorithm, we can find v[index] = target, but we can not guarantee which index we returned. An example is:

1
2
3
4
5
index: 0  1  2  3  4  5  6  7  8  9  10

value: 1 3 3 3 4 9 11 13 20 21 39
^
label: m

Here we return index = 2, but we can not return the correct range of 3, which is [1,3].

A simple solution is search the left and right part of index one by one until we find an element not equals to target or reaches the boundary. This is doable but in the worse case it will cost O(n) time, which is not what we want.

To solve this problem in O(logn) time as usual, we need to twist the traditional binary search algorithm. Remember, we keep reducing the search space by comparing v[mid] to target, and we stop when v[mid] == target. What if we change the comparison? I do not know how to search the range in one binary search, so I break the result into two parts, namely find the left most target and the right most target.

To find the left most target in this array, we should change the comparison from

1
2
3
4
5
if (v[m] > target) {
r = m;
} else {
l = m;
}

to

1
2
3
4
5
if (v[m] >= target) {
r = m;
} else {
l = m;
}

In such way, if we find v[m] == target, we will not stop. Instead we will search the left part while keeping v[r] == target. When l == r - 1, we just check v[l] and v[r] and see if v[l] == target, which is unlikely in this case, or v[r] == target. We return the index of the leftmost one that equals to target. By doing this, we find the leftmost target.

The way of finding the rightmost target is similar and is shown in the follwing code.

Problems on Rotated, Sorted Arrays

Search in Rotated Sorted Array

Search in Rotated Sorted Array II

Find Minimum in Rotated Sorted Array

Find Minimum in Rotated Sorted Array II

Find Peak Element

Other Binary Search Problems

Search A 2D Matrix

12
Baoxu(Dash) Shi

Baoxu(Dash) Shi

18 posts
1 categories
47 tags
GitHub Twitter
© 2015 - 2016 Baoxu(Dash) Shi
Powered by Hexo
Theme - NexT.Muse