Given a list of strings, find the shortest unique substring of each string that doesn't occur in the other strings

Question

The problem statement is:

Given a list of strings, find the shortest unique substring of each string such that the substring does not occur in any of the other strings.

For example:

Input: ["cheapair", "cheapoair", "peloton", "pelican"]
Output:
"cheapair": "pa"  // every other 1-2 length substring overlaps with cheapoair
"cheapoair": "po" // "oa" would also be acceptable
"pelican": "ca"   // "li", "ic", or "an" would also be acceptable
"peloton": "t"    // this single letter doesn't occur in any other string

I think this is solved with dynamic programming, but honestly, I have no idea how to do this other than brute force: Storing all substrings for each word and checking that they don't exist in any of the other ones, which is a terrible idea.

Is there any issue with this approach? stackoverflow.com/a/18479878 — Abhinav Mathur
– Abhinav Mathur, Commented May 24, 2022 at 7:36
I think this is a typical problem for the sliding window approach. — Stef
– Stef, Commented May 24, 2022 at 8:08
@AbhinavMathur Came across this problem for interview prep. Suffix arrays, suffix trees, I've never heard of (I come from a math background, not CS), and I've been doing leetcode problems for some time now. I was hoping someone would suggest a solution using more common data structures. — TheRealFakeNews
– TheRealFakeNews, Commented May 24, 2022 at 8:14
What are correct results for Riccardo's example input ["abab", "ba"]? — Kelly Bundy
– Kelly Bundy, Commented May 24, 2022 at 17:45
And since it talks about "the" substring (with the property) of each string, and there are no instructions for what to do if a string doesn't have one, are we guaranteed that each string does have at least one? — Kelly Bundy
– Kelly Bundy, Commented May 24, 2022 at 17:53

Kelly Bundy · Accepted Answer · 2022-05-25 00:32:23Z

4

Building all substrings to find those that appear in two (or more) strings. Then for each string, print a shortest substring that doesn't appear in two (or more). Maybe not the best (depends on your data and what you consider "best"), but works and is simple.

strings = ["cheapair", "cheapoair", "peloton", "pelican"]

def substrings(s):
    n = len(s)
    return {s[i:i+k]
            for k in range(1, n+1)
            for i in range(n-k+1)}

one = set()
two = set()
for s in strings:
    subs = substrings(s)
    two |= one & subs
    one |= subs

for s in strings:
    subs = substrings(s)
    uniq = subs - two
    print(s, min(uniq, key=len))

Output (Try it online!):

cheapair pa
cheapoair oa
peloton t
pelican an

Preliminary benchmark with a modified version that proceeds by length like Nick's does. Take this with some bags of salt: 1) The question has unclear issues (I asked under the question now). 2) The solutions yield all valid substrings instead of just one. 3) I modified Nick's a little to make it work here (nothing that should influence its speed). 4) It's a worst case input, where each string needs the maximum length, i.e., has no unique substring other than itself.

0.19 s  Kelly2
1.34 s  Nick
0.19 s  Kelly2
1.35 s  Nick
0.19 s  Kelly2
1.34 s  Nick

edited May 25, 2022 at 0:32

answered May 24, 2022 at 8:17

Kelly Bundy

28k8 gold badges34 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Abhinav Mathur Over a year ago

What is the time complexity here?

Riccardo Bucco Over a year ago

Also, what's the space complexity?

Kelly Bundy Over a year ago

I'm sure you guys can figure it out if you want. 10:30 am, I'm finally going to sleep. Didn't write this for efficiency. Easily good enough for the example input. In an interview, this could be a good initial solution. And here, it could be good for testing better solutions for correctness on moderately large inputs.

Riccardo Bucco Over a year ago

Sorry for asking. But you simply answered with a solution described in the question itself (i.e., Storing all substrings for each word and checking that they don't exist in any of the other ones). And that solution is referred as a terrible idea in the question.

Riccardo Bucco Over a year ago

Also, I think the solution you proposed is wrong. Indeed, the question asks for the shortest unique substring, not just the shortest one.

|

Collectives™ on Stack Overflow

Given a list of strings, find the shortest unique substring of each string that doesn't occur in the other strings

1 Answer 1

16 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

16 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related