0

I have to find max(s.length * s.count) for any substring s of a given string t, where s.length is the length of the substring and s.count is the number of times s occurs within t. Substrings may overlap within t.

Example: For the string aaaaaa, the substring aaa has the max (occurrences * length), substrings and occurrences are:

a: 6
aa: 5 
aaa: 4
aaaa : 3 
aaaaa: 2 
aaaaaa: 1

So aaa is our winner with 3 occurrences * length 4 is 12. Yes, aaaa also has a score of 12, but aaa comes first.

I have tried the only means I know or can figure out, but I have an input string of 100,000 length, and just finding all the substrings is O(n^2), and this hangs my program:

var theSet = new HashSet<string>();
for (int i = 1; i < source.Length; i++)
{
    for (int start = 0; start <= source.Length - i; start++)
    {
        var sub = source.Substring(start, i);
        if (!theSet.Contains(sub))
        {
            theSet.Add(sub);
        }
    }
}
...
// Some not-noteworthy benchmark related code
...
int maxVal = 0;
foreach (var sub in subs)
{
    var count = 0;
    for (var i = 0; i < source.Length - sub.Length + 1; i++)
    {
        if (source.Substring(i, sub.Length).Equals(sub)) count++;
    }

    if (sub.Length * count > maxVal)
    {
        maxVal = sub.Length * count;
    }
}

I know I am looking for a relatively unknown algorithm and or data structure with this, as google yields no results that closely match the problem. In fact, Google is where I basically only found the costly algorithms I have attempted to use in the above code.

3
  • Take a look at the KMP algorithm. It may be a good starting point to develop some efficient solution to your problem. Commented Dec 29, 2021 at 16:29
  • 1
    You solution is actually O(n^4), not O(n^2). There are O(n^2) subs. For each you run an O(n^2) calculation. Commented Dec 29, 2021 at 16:35
  • en.m.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm Commented Dec 29, 2021 at 18:02

2 Answers 2

2

Edit: Just realized that the problem has a solution on GFG: https://www.geeksforgeeks.org/substring-highest-frequency-length-product/

This can be solved in O(n) time by applying three well-known algorithms: Suffix Array, LCP Array and Largest Rectangular Area in a Histogram.

I will not provide any code as implementations of these algorithms can easily be found on the Internet. I will assume the input string is "banana" and try to explain the steps and how they work.

1. Run Suffix Array - O(n)

The Suffix Array algorithm sorts the suffixes of the string alphabetically. For the input "banana", the output is going to be the array [5, 3, 1, 0, 4, 2], where 5 corresponds to the suffix starting at position 5 ("a"), 3 corresponds to the suffix starting at position 3 ("ana"), 1 corresponds to the suffix starting at position 1 ("anana"), etc. After we compute this array, it becomes much easier to count the occurrences of a substring because the equal substrings are placed consecutively:

  1. a
  2. ana
  3. anana
  4. banana
  5. na
  6. nana

For example, we can immediately see that the substring "ana" occurs twice by looking at the 2nd and the 3rd suffixes in the above list. Similarly, we can say the substring "n" also occurs twice by looking at the 5th and the 6th.

2. Run LCP Array - O(n)

The LCP algorithm computes the length of the longest common prefix between every consecutive pair of suffixes in the suffix array. The output is going to be [1, 3, 0, 0, 2] for "banana":

  1. a
  2. ana // "a" and "ana" share the prefix "a", which is of length 1
  3. anana // "ana" and "anana" share the prefix "ana", which is of length 3
  4. banana // "anana" and "banana" share no prefix, so 0
  5. na // "banana" and "na" share no prefix, so 0
  6. nana // "na" and "nana" share the prefix "na", which is of length 2

Now if we plot the output of the LCP algorithm as an histogram:

  x
  x  x
 xx  x
 -----
 01234
 -----
aaabnn
 nnaaa
 aan n
  na a 
  an
   a

Now, here is the main observation: every rectangle in the histogram that touches the y axis corresponds to a substring and its occurences: the rectangle's width is equal to s.count - 1 and its height equals to s.length

For example consider this rectangle in the lower left corner, that corresponds to the substring "a".

xx
--
01

The rectangle is of height 1, which is "a".length and of width 2, which is "a".count - 1. And the value we need (s.count * s.length) is almost the area of the rectangle.

3. Find the largest rectangle in the histogram - O(n)

Now all we need to do is to find the largest rectangle in the histogram to find the answer to the problem, with the simple nuance that while calculating the area of the rectangle we need to add 1 to its width. This can be done by simply adding a + 1 in the area calculation logic in the algorithm.

For the "banana" example, the largest rectangle is the following (considering we added +1 to every rectangle's width):

x
x
x
-
1

We add one to its width and calculate its area as 2 * 3 = 6, which equals to how many times the substring "ana" occurs times its length.

Each of the 3 steps take O(n) time, totalling to an overall time complexity of O(n).

Sign up to request clarification or add additional context in comments.

Comments

0

this does the trick despite not being very efficient O(n) complexity. I can't imagine more efficient way though...

 static void TestRegexes()
        {
            var n = CountSubs("aaaaaa", "a");
            var nn = CountSubs("aaaaaa", "aa");
            var nnn = CountSubs("aaaaaa", "aaa");
            var nnnn = CountSubs("aaaaaa", "aaaa");
            var nnnnn = CountSubs("aaaaaa", "aaaaa");
            var nnnnnn = CountSubs("aaaaaa", "aaaaaa");
            ;

        }

        private static int CountSubs( string content, string needle)
        {
            

            int l = content.Length;
            int i = 0;
            int count = 0;
            while (content.Length >= needle.Length)
            {
                if (content.StartsWith(needle))
                {
                    count++;
                }

                content = content.Substring(1);
                i++;
            }

            return count;
        }

1 Comment

That counts the subs alright, but how do you get the subs? You've just hard-coded subs of the example aaaaaa, while as I did say, I have to find all subs of a 100,000 char string, where O(n^2) is prohibitively costly. There is a way, the one that I am looking for, to optimally search subs based on the likelihood of them occurring n many times vs other subs

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.