8

I am looking for an algorithm that will find the number of repeating substrings in a single string.

For this, I was looking for some dynamic programming algorithms but didn't find any that would help me. I just want some tutorial on how to do this.

Let's say I have a string ABCDABCDABCD. The expected output for this would be 3, because there is ABCD 3 times.

For input AAAA, output would be 4, since A is repeated 4 times.

For input ASDF, output would be 1, since every individual character is repeated 1 time only.

I hope that someone can point me in the right direction. Thank you.

1
  • They must be consecutive? What about ABCZXABXYZAB? There can be multiple? Eg.: FOO BAR BAZ FOO should find (space), FOO, BA. Commented Mar 31, 2017 at 20:07

3 Answers 3

12

I am taking the following assumptions:

  • The repeating substrings must be consecutive. That is, in case of ABCDABC, ABC would not count as a repeating substring, but it would in case of ABCABC.
  • The repeating substrings must be non-overalpping. That is, in case of ABCABC, ABC would not count as a repeating substring.
  • In case of multiple possible answers, we want the one with the maximum value. That is, in the case of AAAA, the answer should be 4 (a is the substring) rather than 2 (aa is the substring).

Under these assumptions, the algorithm is as follows:

  • Let the input string be denoted as inputString.
  • Calculate the KMP failure function array for the input string. Let this array be denoted as failure[]. This operation if of linear time complexity with respect to the length of the string. So, by definition, failure[i] denotes the length of the longest proper-prefix of the substring inputString[0....i] that is also a proper-suffix of the same substring.
  • Let len = inputString.length - failure.lastIndexValue. At this point, we know that if there is any repeating string at all, then it has to be of this length len. But we'll need to check for that; First, just check if len perfectly divides inputString.length (that is, inputString.length % len == 0). If yes, then check if every consecutive (non-overlapping) substring of len characters is the same or not; this operation is again of linear time complexity with respect to the length of the input string.
  • If it turns out that every consecutive non-overlapping substring is the same, then the answer would be = inputString.length/ len. Otherwise, the answer is simply inputString.length, as there is no such repeating substring present.

The overall time complexity would be O(n), where n is the number of characters in the input string.

A sample code for calculating the KMP failure array is given here.


For example,

Let the input string be abcaabcaabca.

Its KMP failure array would be - [0, 0, 0, 1, 1, 2, 3, 4, 5, 6, 7, 8].

So, our len = (12 - 8) = 4.

And every consecutive non-overlapping substring of length 4 is the same (abca).
Therefore the answer is 12/4 = 3. That is, abca is repeated 3 times repeatedly.

Sign up to request clarification or add additional context in comments.

3 Comments

i found one string, where this is not working, it's "ABCDABCDA", the expected output is 1, but with this method it gives us 2.
my bad, all i had to do is put 1 condition before the output, if len modulo our substraction is not 0, then print 1
@mereth Thanks for pointing that out. I'll edit the algorithm to incorporate this change.
1

The solution for this with C# is:

 class Program
 {
     public static string CountOfRepeatedSubstring(string str)
     {
         if (str.Length < 2)
         {
             return "-1";
         }

         StringBuilder substr = new StringBuilder();

         // Length of the substring cannot be greater than half of the actual string
         for (int i = 0; i < str.Length / 2; i++)
         {
             // We will iterate through half of the actual string and
             // create a new string by appending the current character to the previous character
             substr.Append(str[i]);

             String clearedOfNewSubstrings = str.Replace(substr.ToString(), "");

             // We will remove the newly created substring from the actual string and 
             // check if the length of the actual string, cleared of the newly created substring, is 0.
             // If 0 it tells us that it is only made of its substring
             if (clearedOfNewSubstrings.Length == 0)
             {
                 // Next we will return the count of the newly created substring in the actual string.
                 var countOccurences = Regex.Matches(str, substr.ToString()).Count;

                 return countOccurences.ToString();
             }
         }

         return "-1";
     }       

     static void Main(string[] args)
     {
         // Input: {"abcdaabcdaabcda"}
         // Output: 3

         // Input: { "abcdaabcdaabcda" }
         // Output: -1

         // Input: {"barrybarrybarry"}
         // Output: 3            

         var s = "asdf"; // Output will be -1

         Console.WriteLine(CountOfRepeatedSubstring(s));
     }
 }

1 Comment

1. Minor thing: OP wanted 1 for no repeats, not -1. 2. Why return a string type?
-1

How do you want to specify the "repeating string"? Is it simply the first group of characters up until either a) the first character is found again, b) the pattern begins to repeat, or c) some other criteria?

So, if your string is "ABBAABBA", is that a 2 because "ABBA" repeats twice or is it 1 because you have "ABB" followed by "AAB"? What about "ABCDABCE" -- does "ABC" count (despite the "D" in between repetitions?) In "ABCDABCABCDABC", is the repeating string "ABCD" (1) or "ABCDABC" (2)?

What about "AAABBAAABB" -- is that 3 ("AAA") or 2 ("AAABB")?

If the end of the repeating string is another instance of the first letter, it's pretty simple:

Work your way through the string character by character, putting each character into another variable as you go, until the next character matches the first one. Then, given the length of the substring in your second variable, check the next bit of your string to see if it matches. Continue until it doesn't match or you hit the end of the string.

If you just want to find any length pattern that repeats regardless of whether the first character is repeated within the pattern, it gets more complicated (but, fortunately, it's the sort of thing computers are good at).

You'll need to go character by character building a pattern in another variable as above, but you'll also have to watch for the first character to reappear and start building a second substring as you go, to see if it matches the first. This should probably go in an array as you might encounter a third (or more) instance of the first character which would trigger the need to track yet another possible match.

It's not difficult but there is a lot to keep track of and it's a rather annoying problem. Is there a particular reason you're doing this?

2 Comments

sorry i didn't specify my problem completely, my bad, but i got the answer i needed, thanks for help anyways to answer your question: the reason im doing this is for school project
No worries, I'm glad you got the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.