Checking whether any anagram of a string appears within another string

Question

I have the following code that takes a string t and checks to see if any anagram of t is a substring of s:

### Question 1 main function and helper functions.
"""Given two strings s and t, determine whether some anagram of t is a
substring of s. For example: if s = "udacity" and t = "ad", then the
function returns True. Your function definition should look like:
question1(s, t) and return a boolean True or False."""
# Define an anagram.
def anagram(s1, s2):

    # anagram is true if sorted values of s1 and s2 are equal.
    s1 = list(s1)
    s2 = list(s2)
    s1.sort()
    s2.sort()
    return s1 == s2

# Main function.
def Question1(t, s):

    # Defining local variables.
    t_len = len(t)
    s_len = len(s)
    t_sort = sorted(t)

    # Setting the range to be that of t.
    for i in range(s_len - t_len + 1):

        # Call helper function which checks for t in s.
        if anagram(s[i: i+t_len], t):
            return True
    return False

# Simple test case.
print Question1("app", "paple")
# True

# Test case with a space character.
print Question1("", "a space")
# True

# Test case with numbers.
print Question1("15", "4732890793878894351")
# True

# Test case with punctuation.
print Question1("$ %", "100% $Expensive")
# True

# Test case that returns false.
print Question1("music", "muscle")
# False

# End question 1.
print """---End Question 1---
"""

Is this code Industry standard, and if not what can I do that would simplify this code and increase it's efficiency?

"Is this code Industry standard ..." Hmm, finding anagrams rarely is a requirement in industry :-/ — πάντα ῥεῖ
– πάντα ῥεῖ, Commented Apr 7, 2017 at 19:19
I imagine it's not, however it's for a practice interview and I just wanted to see if there was a way that I could make my code simpler. — Calvin Ellington
– Calvin Ellington, Commented Apr 7, 2017 at 19:23
If they have kinda such silly apprentice tests, I'm wondering which kinda industry that company is in (no offence). — πάντα ῥεῖ
– πάντα ῥεῖ, Commented Apr 7, 2017 at 19:27
None taken, it's just a practice interview I'm working through for the sake of learning to simplify my solutions, do you have any advice regarding simplifications? — Calvin Ellington
– Calvin Ellington, Commented Apr 7, 2017 at 20:12

Caridorc · Accepted Answer · 2017-04-07 21:04:46Z

1

You can indeed simplify, it is just a matter of having familiarity with the built-ins and the time-complexity of certain operations:

def anagram(s1, s2):
    return sorted(s1) == sorted(s2)

sorted returns a new list instead of modifying so you can avoid the many variables.

def Question1(t, s):
    return any(anagram(s[i: i+len(t)], t)
                 for i in range(len(s)-len(t)+ 1))

any expresses the kind of logic you want, as you said yourself code that takes a string t and checks to see if **any** anagram of t is a substring of s:, so you should use that built-in too.

Probabably the most important point though is that tests should be automatic, that is only print if there is a failure, I personally enjoy using doctest but any standard testing module is better and saves time over printing and manually checking a result.

answered Apr 7, 2017 at 21:04

Caridorc

28.2k7 gold badges55 silver badges138 bronze badges

1

\$\begingroup\$ Damn, beaten me by a second :-D I'd add using xrange instead of range, since it's python 2 \$\endgroup\$

ChatterOne
– ChatterOne

2017-04-07 21:05:17 +00:00
Commented Apr 7, 2017 at 21:05
1

\$\begingroup\$ @ChatterOne yes, questions like this get pretty standard answers pretty fast, so it is a situation where the first gets to answer and that's all. Have a nice day and good luck next time :) \$\endgroup\$

Caridorc
– Caridorc

2017-04-07 21:07:55 +00:00
Commented Apr 7, 2017 at 21:07
\$\begingroup\$ Regarding "time-complexity of certain operations", is your version particularly more efficient than OP's? You do basically the same thing as he does, just less explicitly. \$\endgroup\$

kyrill
– kyrill

2017-04-07 22:47:53 +00:00
Commented Apr 7, 2017 at 22:47
\$\begingroup\$ @kyrill Because he stored the result of len into a variable, probably thinking it was O(N) but in reality it is O(1) \$\endgroup\$

Caridorc
– Caridorc

2017-04-08 19:27:27 +00:00
Commented Apr 8, 2017 at 19:27
1

\$\begingroup\$ @kyrill I want to empathize that it would be absolutely worth it to store an action with complexity O(N) while the gains from storing an action with complexity O(1) are marginal. \$\endgroup\$

Caridorc
– Caridorc

2017-04-08 20:56:20 +00:00
Commented Apr 8, 2017 at 20:56

| Show 1 more comment

kyrill · Accepted Answer · 2017-04-08 12:37:14Z

The asymptotic time complexity of your implementation is: $$O(\ (s-t)*(t*log(t))\ )$$

Because you sort and compare \$s-t\$ substrings of length \$t\$.
(I'm assuming Python uses \$n*log(n)\$ sort).

Here is a solution with \$\ O(\ t*(s-t+log(t))\ ) \ \$ time complexity.

It uses a sorted sliding window of length \$t\$. When the window slides through s, an old character is removed and a new one is added into a specific position, such that the window is kept sorted.

Initial sorting of string t is \$\ t*log(t)\$
Initial sorting of the sliding window is \$\ t*log(t)\$
Initial comparison of sorted(t) with the window is \$\ t\$
For each character in s[len(t):], total of \$\ s-t\ \$ items, there is:
- one bisection of window for the character to be removed : \$\ log(t)\$
- one removal of a character : \$\ t\$
- one bisection of window for the character to be inserted : \$\ log(t)\$
- one insertion of a character : \$\ t\$
- one comparison of sorted(t) with the window : \$\ t\$

So the asymptotic complexity is \$O(\ t*log(t)\ +\ (s-t)*t\ )\$,
or simply \$O(\ t*(s-t+log(t))\ )\$.

from bisect import bisect_left, bisect_right

def question1_fast(t, s):

  if len(s) < len(t):
    return False

  t = sorted(t)
  window = sorted(s[:len(t)])

  if window == t:
    return True

  for i, c_new in enumerate(s[len(t):]):

    c_old = s[i]

    if c_old < c_new:
      idx_old = bisect_right(window, c_old) - 1
      idx_new = bisect_left(window, c_new) - 1

    elif c_new < c_old:
      idx_new = bisect_right(window, c_new)
      idx_old = bisect_left(window, c_old)

    else:
      continue

    del window[idx_old]
    window.insert(idx_new, c_new)

    if window == t:
      return True

  return False

A little comparison of efficiency on large s and t:

>>> from random import randint, shuffle
>>>
>>> # `s` is a string of 10,000 random characters
>>> s = ''.join(chr(randint(32,127)) for _ in range(10000))
>>>
>>> # `t` is a random permutation of `s[5000:6000]`
>>> t_list = list(s[5000:6000])
>>> shuffle(t_list)
>>> t = ''.join(t_list)
>>>
>>> timeit('Question1_original(t,s)', globals=locals(), number=10)
25.469552749997092
>>> timeit('Question1_Caridorc(t,s)', globals=locals(), number=10)
12.726046560999748
>>> timeit('question1_fast(t,s)', globals=locals(), number=10)
0.10736406500291196

With small strings, the differences are not so significant:

>>> timeit('Question1_original("ad", "udacity")', globals=locals())
4.730070723002427
>>> timeit('Question1_Caridorc("ad", "udacity")', globals=locals())
6.0496803390014975
>>> timeit('question1_fast("ad", "udacity")', globals=locals())
3.89388234800208

EDIT

Scratch that. I just realized there's a more efficient way to do it. One that is \$\ O(s+t)\ \$ in time complexity.

My original idea was to have a window which is a sorted list, and keeping the window sorted the whole time, all for the sake of the comparison window == t. If window and t weren't both sorted, the comparison obviously wouldn't work as needed.

But to find out whether one string is an anagram of another string, we don't have to sort them, we just need to know if they contain the same characters. Sorting and comparing them is one way. A more efficient way is to count the number of occurrences of each distinct character. The result would be a dictionary mapping to each character the number of its occurrences. Then you just compare the dictionary of str1 with the dictionary of str2, and if they are equal, the strings are anagrams of each other.

The comparison of dictionaries is not much more efficient than comparing sorted lists. If anything, I suspect it might be even slower. It's the insertions and deletions, which are nearly \$O(1)\$ in dictionary, but \$O(t)\$ in the case of a sorted list.

Since neither t nor window have to be sorted, neither be kept sorted, the time complexity is drastically reduced:

Creation of dictionary for t : \$\ t\$
Creation of dictionary for window : \$\ t\$
For each character in s[len(t):] (repeated \$s-t\$ times):
- one decrementation of counter : \$\ 1\$
- one incrementation of counter : \$\ 1\$

That adds up to \$\ O(\ t+(s-t)\ )\ \$ or just \$\ O(s+t)\$.

def make_counter_dict(s, t):
  counter_dict = dict.fromkeys(s, 0)
  for c in t:
    counter_dict[c] += 1
  return counter_dict

def question1_even_faster(t, s):

  if len(s) < len(t):
    return False

  t_dict = make_counter_dict(s, t)
  window = make_counter_dict(s, s[:len(t)])

  for c_old, c_new in zip(s, s[len(t):]):

    if window == t_dict:
      return True

    window[c_new] += 1
    window[c_old] -= 1

  return window == t_dict

Performance (measured with another random s and t of the same parameters as before):

>>> timeit('question1_fast(t,s)', globals=locals(), number=1000)
10.559728353000537
>>> timeit('question1_even_faster(t,s)', globals=locals(), number=1000)
2.1034470079976018

This version is slower on small strings though:

>>> timeit('question1_fast("ad", "udacity")', globals=locals())
3.886769873002777
>>> timeit('question1_even_faster("ad", "udacity")', globals=locals())
4.691746060998412

Not in Python 3 it is not, it is highly optimised in C code. — Martijn Pieters
– Martijn Pieters, Commented Apr 8, 2017 at 12:36
@MartijnPieters Actually it won't even work with Counter. Try: c1 = Counter('abc') c2 = Counter('bc') c1['a'] -= 1 c1 == c2. Returns False but should be True. — kyrill
– kyrill, Commented Apr 8, 2017 at 12:44
Interesting; I got even worse times. Something is up with Python 3.6 Counter it seems. I tried replacing the loop in make_counter_dict() with collections._count_elements (the C-optimised loop from Counter) and got the same slower results. dict.get(), even when using internal access, is the culprit here. — Martijn Pieters
– Martijn Pieters, Commented Apr 8, 2017 at 12:48

Stack Exchange Network

Checking whether any anagram of a string appears within another string

2 Answers 2

EDIT

You must log in to answer this question.

Hot Network Questions

Checking whether any anagram of a string appears within another string

2 Answers 2

EDIT

You must log in to answer this question.

Related

Hot Network Questions