1

The problem statement is:

Given a list of strings, find the shortest unique substring of each string such that the substring does not occur in any of the other strings.

For example:

Input: ["cheapair", "cheapoair", "peloton", "pelican"]
Output:
"cheapair": "pa"  // every other 1-2 length substring overlaps with cheapoair
"cheapoair": "po" // "oa" would also be acceptable
"pelican": "ca"   // "li", "ic", or "an" would also be acceptable
"peloton": "t"    // this single letter doesn't occur in any other string

I think this is solved with dynamic programming, but honestly, I have no idea how to do this other than brute force: Storing all substrings for each word and checking that they don't exist in any of the other ones, which is a terrible idea.

6
  • 1
    Is there any issue with this approach? stackoverflow.com/a/18479878 Commented May 24, 2022 at 7:36
  • I think this is a typical problem for the sliding window approach. Commented May 24, 2022 at 8:08
  • @AbhinavMathur Came across this problem for interview prep. Suffix arrays, suffix trees, I've never heard of (I come from a math background, not CS), and I've been doing leetcode problems for some time now. I was hoping someone would suggest a solution using more common data structures. Commented May 24, 2022 at 8:14
  • What are correct results for Riccardo's example input ["abab", "ba"]? Commented May 24, 2022 at 17:45
  • And since it talks about "the" substring (with the property) of each string, and there are no instructions for what to do if a string doesn't have one, are we guaranteed that each string does have at least one? Commented May 24, 2022 at 17:53

1 Answer 1

4

Building all substrings to find those that appear in two (or more) strings. Then for each string, print a shortest substring that doesn't appear in two (or more). Maybe not the best (depends on your data and what you consider "best"), but works and is simple.

strings = ["cheapair", "cheapoair", "peloton", "pelican"]

def substrings(s):
    n = len(s)
    return {s[i:i+k]
            for k in range(1, n+1)
            for i in range(n-k+1)}

one = set()
two = set()
for s in strings:
    subs = substrings(s)
    two |= one & subs
    one |= subs

for s in strings:
    subs = substrings(s)
    uniq = subs - two
    print(s, min(uniq, key=len))

Output (Try it online!):

cheapair pa
cheapoair oa
peloton t
pelican an

Preliminary benchmark with a modified version that proceeds by length like Nick's does. Take this with some bags of salt: 1) The question has unclear issues (I asked under the question now). 2) The solutions yield all valid substrings instead of just one. 3) I modified Nick's a little to make it work here (nothing that should influence its speed). 4) It's a worst case input, where each string needs the maximum length, i.e., has no unique substring other than itself.

0.19 s  Kelly2
1.34 s  Nick
0.19 s  Kelly2
1.35 s  Nick
0.19 s  Kelly2
1.34 s  Nick
Sign up to request clarification or add additional context in comments.

16 Comments

What is the time complexity here?
Also, what's the space complexity?
I'm sure you guys can figure it out if you want. 10:30 am, I'm finally going to sleep. Didn't write this for efficiency. Easily good enough for the example input. In an interview, this could be a good initial solution. And here, it could be good for testing better solutions for correctness on moderately large inputs.
Sorry for asking. But you simply answered with a solution described in the question itself (i.e., Storing all substrings for each word and checking that they don't exist in any of the other ones). And that solution is referred as a terrible idea in the question.
Also, I think the solution you proposed is wrong. Indeed, the question asks for the shortest unique substring, not just the shortest one.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.