Python string occurence count regex performance

Question

I was asked to find the total number of substring (case insensitive with/without punctuations) occurrences in a given string. Some examples:

count_occurrences("Text with", "This is an example text with more than +100 lines") # Should return 1
count_occurrences("'example text'", "This is an 'example text' with more than +100 lines") # Should return 1
count_occurrences("more than", "This is an example 'text' with (more than) +100 lines") # Should return 1
count_occurrences("clock", "its 3o'clock in the morning") # Should return 0

I chose regex over .count() as I needed an exact match, and ended up with:

def count_occurrences(word, text):
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text, re.IGNORECASE))

and I've got every matching count but my code took 0.10secs while expected time is 0.025secs. Am I missing something? is there any better (performance optimised) way to do this?

I already have every matchings, the question is to know if there is any better way to do it. As that can bring the execution time to the expected 0.25secs — anon
– anon, Commented Dec 23, 2020 at 7:51
Regex is usually much more than what one needs. If you choose regex only for case insensitive matching, text.lower().count(word.lower()) is much faster. Do you need another regex? Or, you could find messy but more specifically optimized code. — ghchoi
– ghchoi, Commented Dec 23, 2020 at 7:59
See my example above, Its mixed(case, punctuations, brackets) etc.. if I go with .count lets say txt = "texts texts texts' count will return 3 if I search for text and I dont want that (It needs to return a match only for exact word) — anon
– anon, Commented Dec 23, 2020 at 8:06

Tomer Shetah · Accepted Answer · 2020-12-30 19:03:25Z

3

+50

OK, I was struggling to make it work without regexes, as we all know that regexes are slow. Here is what I came up with:

def count_occurrences(word, text):
    spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__"]
    endings = spaces + ['?', '.', '!', ',', ')', '"', '»']
    s = text.lower().split(word.lower())
    l = len(s)
    return sum((
            (i == 0 and (s[0] == '' or any(s[i].endswith(t) for t in spaces)) and (s[1] == '' or any(s[i+1].startswith(t) for t in endings))) 
            or (i == l - 2 and any(s[i].endswith(t) for t in spaces) and (s[i+1] == '' or any(s[i+1].startswith(t) for t in endings)))
            or (i != 0 and i != l - 2 and any(s[i].endswith(t) for t in spaces) and any(s[i+1].startswith(t) for t in endings))
        ) for i in range(l - 1))

The whole file runs in ideone:

Ran 1 test in 0.025s

OK

Which is what the question is asking for.

The logic is pretty simple. Let's split the text by word, both lower cased. Now let's look at each couple of neighbours. If, for example index 0 finished with a valid delimiter, and index 1 starts with a valid delimiter, let's count it as an occurrence. Let's do that up to the last couple of the split.

Since performance is important here, we have to be aware to the order of spaces and endings. We are basically looking for the first in the list to fulfil the condition. So it is important to locate the variables that are more common first. For example, If I declare:

spaces = ['(', '«', '\u201d', '\u201c', ':', "''", "__", '\n', ' ']

instead of what I have in my solution, I get a run of 0.036 seconds.

If for example I declare one array:

spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__", '?', '.', '!', ',', ')', '"', '»']

which has all delimiters and use only that, I get 0.053 seconds. Which is 60% more than my solution.

It is probably possible that there is a better solution with declaring the delimiters in another order.

edited Dec 30, 2020 at 19:03

answered Dec 30, 2020 at 18:13

Tomer Shetah

8,5597 gold badges31 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

bobble bubble Over a year ago

This looks like some effort and performs good. Let's see, if it satisfies OP! Plus from me. However, I don't generally agree, that regexes are slow :) It depends.

anon Over a year ago

Wow it's fast ! I thought regex was faster, you proved me wrong :), I'll accept this as answer as it fulfils my performance requirement. Thank you !!

ilov3 · Accepted Answer · 2020-12-23 08:27:15Z

1

If the words are you searching is defined and finite, regex pre-compilation through re.compile could help to make things faster. Something like:

search_words = [
  'foo',
  'bar',
  'baz',
]

words_to_re = {w: re.compile(f"(?<![a-z])((?<!')|(?<='')){w}(?![a-z])((?!')|(?=''))") for w in search_words}

def count_occurrences(word, text):
    regex = words_to_re[word]
    return len(regex.findall(text))

edited Dec 23, 2020 at 8:27

answered Dec 23, 2020 at 8:01

ilov3

4272 silver badges7 bronze badges

4 Comments

anon Over a year ago

re.compile(f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))", re.IGNORECASE) tried but the performance is bad. it took +1 sec

ilov3 Over a year ago

You have to precompile searching patterns once and then reuse them on your function.

anon Over a year ago

No my text is not predefined, its comes through function parameter count_occurences("different word", "different text to compare +100 line") every time

Charles Merriam Over a year ago

Precompiling was more important some time ago. Last seen, the re module uses an LRU cache for recent regexes, so compilation happens only once anyway.

MIARD · Accepted Answer · 2020-12-26 14:25:16Z

0

You can turn all word to lowercase manually using string.lower() function. Check this maybe this will help you:

def count_occurrences2(word, text):
    text = text.lower()
    word = word.lower()
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text))

I have inspected execution time using timeit library:

import timeit

def checkTime(word, text, function):
  now = timeit.default_timer()
  function("more than", lines)
  return timeit.default_timer()-now

text = "This is an example 'text' with (more than) +100 lines "*1000
word = "more than"
time_0 = checkTime("more than",text, count_occurrences)
time_1 = checkTime("more than",text, count_occurrences2)
print(time_0)
print(time_1)
print(time_1 < time_0) //true

Edit:

This is another way:

def count_occurences_in_text(word, text):
    pattern = r"(?<![a-z])((?<!')|(?<=''))"+str(word.lower())+"(?![a-z])((?!')|(?=''))"
    line_now = text.lower()
    count = 0
    search = re.search(pattern, line_now)
    while search:
        count +=1
        line_now = line_now[search.span()[1]:]
        search = re.search(pattern,line_now)
    return count

Edit 2:

This function will pass all assertion in your code(considering execution time):

def count_occurences_in_text(word, text):
    text = text.lower()
    word = word.lower()
    word_len = len(word)
    text_len = len(text)
    if not (word[0] >= 'a' and word[0] <= 'z') :
        word = word[1:word_len]
        if not (word[len(word) - 1] >= 'a' and word[len(word) - 1] <= 'z') :
            word = word[1:len(word)-1]
    count = 0
    index = 0
    have = [' ', ",","!","?",".","\n",":"]
    haveP = [' ',':']
    if word_len > text_len:
        return 0;
    while index < text_len-word_len+1:
        if text[index:index+word_len] == word:
            if index != 0:
                prev_word = text[index-1]
                # if (prev_word >= 'a' and prev_word <= 'z') or prev_word == "'":
                if prev_word not in haveP:
                    if index > 1 and text[index-1] =="'" and text[index-2]=="'":
                        count+=1
                        index += word_len+1
                        continue
                    if index > 1 and text[index-1] =="_" and text[index-2]=="_":
                        count+=1
                        index += word_len+1
                        continue
                    else:
                        index += 1
                        continue
            if index + word_len <= text_len-1:
                last_word = text[index+word_len]
                # if (last_word >= 'a' and last_word <= 'z') or last_word == "'":
                if last_word not in have:
                    if index+word_len <= text_len-2 and last_word =="'" and text[index+word_len+1]=="'":
                        count +=1
                        index += word_len+1
                        continue
                    if index+word_len <= text_len-2 and last_word =="_" and text[index+word_len+1]=="_":
                        count +=1
                        index += word_len+1
                        continue
                    else:
                        index += 1
                        continue
            count += 1
            index += word_len+1
        index += 1
    return count

edited Dec 26, 2020 at 14:25

answered Dec 25, 2020 at 15:36

MIARD

1261 silver badge8 bronze badges

4 Comments

anon Over a year ago

It helps but not entirely, have you seen the example file(link) in my question ?

MIARD Over a year ago

yes, Here: follow this link, I have tried another way,

anon Over a year ago

Your first method is performing better than the second, but my execution time 0.1728 still not under 0 though

MIARD Over a year ago

I have tried another approach and successfully passed all assertion, but the execution time is not meet your expectation. Still, you can check this follow this link

ListenSoftware Louise Ai Agent · Accepted Answer · 2021-10-23 20:27:38Z

0

use a regular expression split

def count_occurrences(search_word,text):
    alist=re.split(r'\s+',text)
    matches=[word for word in alist if word==search_word]
    return len(matches)

count_occurrences("clock", "its 3o'clock in the morning")

output

answered Oct 23, 2021 at 20:27

ListenSoftware Louise Ai Agent

4,3432 gold badges31 silver badges39 bronze badges

Comments

Lyndon Gingerich · Accepted Answer · 2020-12-31 10:31:43Z

The first error was using the words "Python" and "Performance" in the same sentence. Python is heavily oriented in the direction of "cheap--develop code quickly" with "good--runs as expected" as doable. Fast is right out. Any advice here is strictly implementation-dependent.

You can clean up your regular expressions. I recommend using the group shortcut \b (word boundary). In all cases, I strongly recommend playing with your regex interactively in regex101 or equivalent.
You can write your own search function in Python. It will be slower by running in Python and faster by skipping storage of matches and other generality.
You can compare your speeds against simple string .count(). You will need to use lower() and decide if the word "this" matches "sthisany" or not.
You can modify your test function to actually have a hundred lines, e.g., text = (text + '\n')*100.
You can use PyPi, which will generally speed up your execution by trading away some start-uptime and some meta-programming flexiblity.
You can write a small snippet of C code and learn to call it from Python. Fun on its own.

I recommend you keep notes and comparisons and turn those in with your homework, not just the final product.

Collectives™ on Stack Overflow

Python string occurence count regex performance

5 Answers 5

2 Comments

4 Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

4 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related