Longest Increasing Digital Subsequence

A variation on longest increasing subsequence problem.

There are a lot of discussions on the famous longest increasing subsequence (LIS) problem, and plentiful of efficient algorithms with different methods prevail. Recently I found a variation of the LIS problem, which is called longest increasing digital subsequence (LIDS) problem. It comes from an algorithm course of UIUC, and I think it is quite interesting to solve this problem. First thing first, the definition of LIDS is following:

A digital subsequence of D is a sequence of positive integers composed in the usual way from disjoint substrings of D. For example, '34, 5, 9, 26, 358, 9' is a digital subsequence of sequence '3, 4, 5, 9, 2, 6, 3, 5, 8, 9'. The numbers in the digital subsequence can be a combination of any form of the digits of the given sequence, as long as there is no repetition or wrong order. An increasing digital subsequence means the numbers in the digital subsequence need to be strictly increasing, and the previous example is not. A positive example is '3, 4, 5, 9, 26, 35, 89, 93, 238, 462' of length 10, generated from sequence '3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5 ,8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6'. The problem is, given D, generate its longest digital subsequence.

Since LIDS is known to be a variation of LIS, dynamic programming seems to be a reasonable try. The horizontal axis is now settled as the given sequence. It is important to figure out what type of values to store and reuse for dynamic programming. The difficulty here is that numbers of variant lengths can be used for the sequence. Well, this seems to be a choice for vertical axis: each row should represent the digital subsequence made from a fixed number of digits, the first row being the original sequence, and the last row being a huge number made by all the digits. Another thought comes to mind is that a subsequence is an extension of a shorter subsequence. This is the key for dynamic programming problems.

So this is the construction of our algorithm. An Excel sheet for illustration:

Initialization

Diagonal of the matrix represents the first number of length n, and being itself means the longest subsequence is 1. Now start iterating over each row, just as how normal LIS DP method does. Remember that for each row, the iteration starts from the indices that are initialized with 1's. Note how this matrix becomes triangular. If you are not familiar with that, here's a short recap:

Given an array A[0 ... n - 1], initialize a storage array D[0 ... n - 1] of 1's, which represents length 1 of each entry itself. Starting from A[1], for each element A[i], compare it with the values A[j] in A[0 ... i - 1]. If A[i] is greater than A[j], and adding this 1 extra length (D[j] + 1) is greater than its original D[i], update D[i]. The length of the LIS is max(D).

But wait, for numbers with only one digit, this is perfectly fine; what about numbers with more than one digit, there are problems of corner cases and reusing digits! Well, those can be tackled individually. For the corner cases where indices run out of margin (before the diagonal), just use the number on the front line and add one to it, since a number with more digits is definitely bigger than that with digit deficit. The second problem requires a little bit thinking. First, the notation will be different here: A[1 ... n] instead of A[0 ... n - 1]. There is also an important assumption made during iteration: each entry of the row stores the maximum length of LIDS ending at its index, instead of starting from; that is, for the entry below A[6] "9", it stores the maximum length of LIDS of sequence "3, 1, 4, 1, 5, 9". The main difference between calculating LIDS of numbers with length one and those with greater length is that the latter allow a smaller length of numbers in their subsequences. "3, 1, 4, 1, 5, 9" only allows subsequences such as "3, 4, 5, 9" for row one, but there can be "3, 4, 15" or "3, 14, 15" for row two. This requires an additional trick for updating the length values. In order to take the described situation into account, the next number needs to perform a check on the gain between treating the previous number as a whole or as a disjoint part. For example, in the case of 2-digit numbers, for number 59 in "3, 1, 4, 1, 5, 9", 59 is greater than 41, so the value of matrix M[2, 6] can add one at the base of M[2, 4] (which means length of LIDS ending at the second "1"). However, this value might be smaller than adding one to the solved subsequence "3, 1, 4, 1" with 1-digit numbers (in this example those two values are the same, so choose either one). Similar checks apply to all different lengths: 3-digit numbers should check the solved subsequence with 2-digit numbers, etc.

With above in mind, the final matrix is generated as follows:

matrix

where each value M[i, j] represents the maximum length of increasing digital subsequence ending at A[j] containing the i-digit number ending with A[j], with at least one another i-digit number in the subsequence. For example, the value of M[2, 9] is 4, meaning the subsequence generated from "3, 1, 4, 1, 5, 9, 2, 6, 5" ending with 65 and containing at least one another 2-digit number has maximum length of 4, which could be "3, 4, 15, 65" or "3, 14, 15, 65". Below is the code with both calculation and backtracking: the backtrace is quite simple as long as the trajectory of the updates is recorded. Note that the result is not a complete set of all possible LIDS', just a subset. This algorithm runs in time complexity of pseudo O(n³), and space complexity of O(n²), without any optimization.

def longest_increasing_digital_subsequence(D):
 """
 Let D[1..n] be an array of digits, each an integer between 0 and 9. 
 A digital subsequence of D is an sequence of positive integers composed in 
 the usual way from disjoint substrings of D. 
 For example, 3, 4, 5, 9, 26, 35, 89, 93, 238, 462 is an increasing digital 
 subsequence with length 10 of the first several digits of π:
 3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6
 This function computes the longest increasing digital subsequence of D.
 """
     A = [0] + D
     n = len(A)
     w = np.ones((n, n), dtype='int')
     # trackback matrix
     t = []
     for i in range(n):
         t.append([])
     for j in range(n):
          t[i].append((0, 0))
     # Mark as (-1, -1) for iteration base case
     t[0][0] = (-1, -1)

     # This is basically a upper-triangular matrix
     w[0, :] = 0
     w[:, 0] = 0

     # No need to account for the last one
     for sublen in range(1, n):
         for end_idx in range(sublen, n):

             start_idx = end_idx + 1 - sublen
             curr_val = digit2int(A[start_idx:end_idx + 1])

             prev_max = max(w[:sublen, start_idx - 1])
             prev_max_idx = (np.argmax(w[:sublen, start_idx - 1]), start_idx - 1)
 
             for jmp_idx in range(1, start_idx):
 
                 if jmp_idx - sublen < 0:
                     w[sublen, end_idx] = max(w[:sublen, jmp_idx]) + 1
                     t[sublen][end_idx] = (np.argmax(w[:sublen, jmp_idx]), jmp_idx)
                     continue;

                 whole_val = digit2int(A[jmp_idx - sublen + 1: jmp_idx + 1])
 
                 if curr_val > whole_val:

                     together = max(w[sublen, jmp_idx] + 1, w[sublen, end_idx])
                     separate = w[sublen - 1, end_idx - sublen] + 1

                     if together > separate: 
                         w[sublen, end_idx] = together

                         if w[sublen, jmp_idx] + 1 >= w[sublen, end_idx]:
                             t[sublen][end_idx] = (sublen, jmp_idx)
                         else:
                             w[sublen, end_idx] = separate
                             t[sublen][end_idx] = (sublen - 1, end_idx - sublen)
     subseq_lst = []
     max_res_idx_set = np.argwhere(w == np.amax(w)).tolist()

     for max_res_idx in max_res_idx_set:
         num = []
         while max_res_idx != (-1, -1):
             sublen, col = max_res_idx[0], max_res_idx[1]
             max_res_idx = t[max_res_idx[0]][max_res_idx[1]]
             num.append(digit2int(A[col - sublen + 1: col + 1]))
             num.reverse()
         subseq_lst.append(num[1:])

 return np.amax(w), subseq_lst

Display of some interesting results:

input: 3,1,4,1,5,9,2,6,5,3,5,8,9,7,9,3,2,3,8,4,6,2,6,4,3,3,8,3,2,7,9,5,2,0,8,8,4,1,9

output:

14,

[1, 2, 3, 5, 7, 9, 32, 38, 62, 64, 83, 279, 520, 884],

[1, 2, 3, 5, 7, 9, 32, 38, 62, 64, 83, 95, 208, 841],

[1, 2, 3, 5, 7, 9, 32, 38, 62, 64, 83, 95, 208, 419],

[1, 2, 3, 5, 7, 9, 32, 38, 62, 64, 83, 279, 520, 8841],

[1, 2, 3, 5, 7, 9, 32, 38, 62, 64, 83, 95, 208, 8419],

[1, 4, 5, 9, 26, 35, 89, 93, 238, 462, 6433, 8327, 9520, 88419]