7

I was asked to find the total number of substring (case insensitive with/without punctuations) occurrences in a given string. Some examples:

count_occurrences("Text with", "This is an example text with more than +100 lines") # Should return 1
count_occurrences("'example text'", "This is an 'example text' with more than +100 lines") # Should return 1
count_occurrences("more than", "This is an example 'text' with (more than) +100 lines") # Should return 1
count_occurrences("clock", "its 3o'clock in the morning") # Should return 0

I chose regex over .count() as I needed an exact match, and ended up with:

def count_occurrences(word, text):
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text, re.IGNORECASE))

and I've got every matching count but my code took 0.10secs while expected time is 0.025secs. Am I missing something? is there any better (performance optimised) way to do this?

8
  • What is an extra match you need? Only case insensitivity? Commented Dec 23, 2020 at 7:46
  • I already have every matchings, the question is to know if there is any better way to do it. As that can bring the execution time to the expected 0.25secs Commented Dec 23, 2020 at 7:51
  • Regex is usually much more than what one needs. If you choose regex only for case insensitive matching, text.lower().count(word.lower()) is much faster. Do you need another regex? Or, you could find messy but more specifically optimized code. Commented Dec 23, 2020 at 7:59
  • See my example above, Its mixed(case, punctuations, brackets) etc.. if I go with .count lets say txt = "texts texts texts' count will return 3 if I search for text and I dont want that (It needs to return a match only for exact word) Commented Dec 23, 2020 at 8:06
  • 1
    Yes sure, as far as I get my target performance.. Commented Dec 28, 2020 at 7:17

5 Answers 5

3
+50

OK, I was struggling to make it work without regexes, as we all know that regexes are slow. Here is what I came up with:

def count_occurrences(word, text):
    spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__"]
    endings = spaces + ['?', '.', '!', ',', ')', '"', '»']
    s = text.lower().split(word.lower())
    l = len(s)
    return sum((
            (i == 0 and (s[0] == '' or any(s[i].endswith(t) for t in spaces)) and (s[1] == '' or any(s[i+1].startswith(t) for t in endings))) 
            or (i == l - 2 and any(s[i].endswith(t) for t in spaces) and (s[i+1] == '' or any(s[i+1].startswith(t) for t in endings)))
            or (i != 0 and i != l - 2 and any(s[i].endswith(t) for t in spaces) and any(s[i+1].startswith(t) for t in endings))
        ) for i in range(l - 1))

The whole file runs in ideone:

Ran 1 test in 0.025s

OK

Which is what the question is asking for.

The logic is pretty simple. Let's split the text by word, both lower cased. Now let's look at each couple of neighbours. If, for example index 0 finished with a valid delimiter, and index 1 starts with a valid delimiter, let's count it as an occurrence. Let's do that up to the last couple of the split.

Since performance is important here, we have to be aware to the order of spaces and endings. We are basically looking for the first in the list to fulfil the condition. So it is important to locate the variables that are more common first. For example, If I declare:

spaces = ['(', '«', '\u201d', '\u201c', ':', "''", "__", '\n', ' ']

instead of what I have in my solution, I get a run of 0.036 seconds.

If for example I declare one array:

spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__", '?', '.', '!', ',', ')', '"', '»']

which has all delimiters and use only that, I get 0.053 seconds. Which is 60% more than my solution.

It is probably possible that there is a better solution with declaring the delimiters in another order.

Sign up to request clarification or add additional context in comments.

2 Comments

This looks like some effort and performs good. Let's see, if it satisfies OP! Plus from me. However, I don't generally agree, that regexes are slow :) It depends.
Wow it's fast ! I thought regex was faster, you proved me wrong :), I'll accept this as answer as it fulfils my performance requirement. Thank you !!
1

If the words are you searching is defined and finite, regex pre-compilation through re.compile could help to make things faster. Something like:

search_words = [
  'foo',
  'bar',
  'baz',
]

words_to_re = {w: re.compile(f"(?<![a-z])((?<!')|(?<='')){w}(?![a-z])((?!')|(?=''))") for w in search_words}

def count_occurrences(word, text):
    regex = words_to_re[word]
    return len(regex.findall(text))

4 Comments

re.compile(f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))", re.IGNORECASE) tried but the performance is bad. it took +1 sec
You have to precompile searching patterns once and then reuse them on your function.
No my text is not predefined, its comes through function parameter count_occurences("different word", "different text to compare +100 line") every time
Precompiling was more important some time ago. Last seen, the re module uses an LRU cache for recent regexes, so compilation happens only once anyway.
0

You can turn all word to lowercase manually using string.lower() function. Check this maybe this will help you:

def count_occurrences2(word, text):
    text = text.lower()
    word = word.lower()
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text))

I have inspected execution time using timeit library:

import timeit

def checkTime(word, text, function):
  now = timeit.default_timer()
  function("more than", lines)
  return timeit.default_timer()-now

text = "This is an example 'text' with (more than) +100 lines "*1000
word = "more than"
time_0 = checkTime("more than",text, count_occurrences)
time_1 = checkTime("more than",text, count_occurrences2)
print(time_0)
print(time_1)
print(time_1 < time_0) //true

Edit:

This is another way:

def count_occurences_in_text(word, text):
    pattern = r"(?<![a-z])((?<!')|(?<=''))"+str(word.lower())+"(?![a-z])((?!')|(?=''))"
    line_now = text.lower()
    count = 0
    search = re.search(pattern, line_now)
    while search:
        count +=1
        line_now = line_now[search.span()[1]:]
        search = re.search(pattern,line_now)
    return count

Edit 2:

This function will pass all assertion in your code(considering execution time):

def count_occurences_in_text(word, text):
    text = text.lower()
    word = word.lower()
    word_len = len(word)
    text_len = len(text)
    if not (word[0] >= 'a' and word[0] <= 'z') :
        word = word[1:word_len]
        if not (word[len(word) - 1] >= 'a' and word[len(word) - 1] <= 'z') :
            word = word[1:len(word)-1]
    count = 0
    index = 0
    have = [' ', ",","!","?",".","\n",":"]
    haveP = [' ',':']
    if word_len > text_len:
        return 0;
    while index < text_len-word_len+1:
        if text[index:index+word_len] == word:
            if index != 0:
                prev_word = text[index-1]
                # if (prev_word >= 'a' and prev_word <= 'z') or prev_word == "'":
                if prev_word not in haveP:
                    if index > 1 and text[index-1] =="'" and text[index-2]=="'":
                        count+=1
                        index += word_len+1
                        continue
                    if index > 1 and text[index-1] =="_" and text[index-2]=="_":
                        count+=1
                        index += word_len+1
                        continue
                    else:
                        index += 1
                        continue
            if index + word_len <= text_len-1:
                last_word = text[index+word_len]
                # if (last_word >= 'a' and last_word <= 'z') or last_word == "'":
                if last_word not in have:
                    if index+word_len <= text_len-2 and last_word =="'" and text[index+word_len+1]=="'":
                        count +=1
                        index += word_len+1
                        continue
                    if index+word_len <= text_len-2 and last_word =="_" and text[index+word_len+1]=="_":
                        count +=1
                        index += word_len+1
                        continue
                    else:
                        index += 1
                        continue
            count += 1
            index += word_len+1
        index += 1
    return count

4 Comments

It helps but not entirely, have you seen the example file(link) in my question ?
yes, Here: follow this link, I have tried another way,
Your first method is performing better than the second, but my execution time 0.1728 still not under 0 though
I have tried another approach and successfully passed all assertion, but the execution time is not meet your expectation. Still, you can check this follow this link
0

use a regular expression split

def count_occurrences(search_word,text):
    alist=re.split(r'\s+',text)
    matches=[word for word in alist if word==search_word]
    return len(matches)

count_occurrences("clock", "its 3o'clock in the morning")

output

0

Comments

-1

The first error was using the words "Python" and "Performance" in the same sentence. Python is heavily oriented in the direction of "cheap--develop code quickly" with "good--runs as expected" as doable. Fast is right out. Any advice here is strictly implementation-dependent.

  1. You can clean up your regular expressions. I recommend using the group shortcut \b (word boundary). In all cases, I strongly recommend playing with your regex interactively in regex101 or equivalent.

  2. You can write your own search function in Python. It will be slower by running in Python and faster by skipping storage of matches and other generality.

  3. You can compare your speeds against simple string .count(). You will need to use lower() and decide if the word "this" matches "sthisany" or not.

  4. You can modify your test function to actually have a hundred lines, e.g., text = (text + '\n')*100.

  5. You can use PyPi, which will generally speed up your execution by trading away some start-uptime and some meta-programming flexiblity.

  6. You can write a small snippet of C code and learn to call it from Python. Fun on its own.

I recommend you keep notes and comparisons and turn those in with your homework, not just the final product.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.