Finding the largest repeating substring

Question

Here is a function I wrote that will take a very long text file. Such as a text file containing an entire textbook. It will find any repeating substrings and output the largest string. Right now it doesn't work however, it just outputs the string I put in

For example, if there was a typo in which an entire sentence was repeated. It would output that sentence; given it is the largest in the whole file. If there was a typo where an entire paragraph was typed twice, it would output the paragraph.

This algorithm takes the first character, finds any matches, and if it does and if the length is the largest, store the substring. Then it takes the first 2 characters and repeats. Then the first 3 characters. etc.. Then it will start over except starting at the 2nd character instead of the 1st. Then all the way through and back, starting at the 3rd character.

def largest_substring(string):

  length = 0
  x,y=0,0

  for y in range(len(string)):        #start at string[0, ]
    for x in range(len(string)):      #start at string[ ,0]
     substring = string[y:x]          #substring is [0,0] first, then [0,1], then [0.2]... then [1,1] then [1,2] then [1,3]... then [2,2] then [2,3]... etc.
     if substring in string:          #if substring found and length is longest so far, save the substring and proceed.
      if len(substring) > length:
       match = substring
       length = len(substring)

(Wouldn't you stumble upon repeated substrings building, say, a suffix array?) Sentences or paragraphs are less likely to be typed twice than to be copied (and not removed from the original position in an attempted move). — greybeard
– greybeard, Commented Sep 19, 2014 at 8:30

Vikram Bhat · Accepted Answer · 2014-09-19 04:52:20Z

4

I think your logic is flawed here because it will always return the entire string as it checks whether a substring is in whole string which is always true so the statement if substring in string will be always true. Instead you need to find if the substring occurs more than once in the entire string and then update the count.

Here is example of brute force algorithm that solves it :-

import re

def largest_substring(string):

    length = 0
    x=0
    y=0

    for y in range(len(string)):       
        for x in range(len(string)):     
            substring = string[y:x]                   
            if len(list(re.finditer(substring,string))) > 1  and len(substring) > length:
                match = substring
                length = len(substring)
    return match


print largest_substring("this is repeated is repeated is repeated")

answered Sep 19, 2014 at 4:52

Vikram Bhat

6,3063 gold badges22 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andrew Tsay Over a year ago

Sorry, 1 more note. It seems to crash when it encounters special symbols such as '(' or '*'

Collectives™ on Stack Overflow

Finding the largest repeating substring

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related