9

I'm using regex to find occurrences of string patterns in a body of text. Once I find that the string pattern occurs, I want to get x words before and after the string as well (x could be as small as 4, but preferably ~10 if still as efficient).

I am currently using regex to find all instances, but occasionally it will hang. Is there a more efficient way to solve this problem?

This is the solution I currently have:

sub = r'(\w*)\W*(\w*)\W*(\w*)\W*(\w*)\W*(%s)\W*(\w*)\W*(\w*)\W*(\w*)\W*(\w*)' % result_string #refind string and get surrounding += 4 words
surrounding_text = re.findall(sub, text)
for found_text in surrounding_text:
  result_found.append(" ".join(map(str,found_text)))
8
  • 2
    You don't need to use a regex to get the window. Just find the position of the match and go from there. Though I'm not sure why it'd be slow, maybe the *s, and I'm not sure why you're using those btw. Commented Aug 8, 2015 at 18:58
  • 1
    How strict is your definition of a "word". e.g. is 'foo-bar-baz' 1 word or 3? Given your regex, it's 3, but if it can be 1, then your problem might simplify a bit ... Commented Aug 8, 2015 at 19:02
  • 1
    definition of a word is very loose. How does that simplify things? Commented Aug 8, 2015 at 19:04
  • Why do you need to individually capture surrounding words? And what input causes it to hang? Commented Aug 8, 2015 at 19:17
  • 1
    I guess I could capture characters if that's any easier, I just need to have some notion of what surrounds the matches. Commented Aug 8, 2015 at 19:29

4 Answers 4

10

I'm not sure if this is what you're looking for:

>>> text = "Hello, world. Regular expressions are not always the answer."
>>> words = text.partition("Regular expressions")
>>> words
('Hello, world. ', 'Regular expressions', ' are not always the answer.')
>>> words_before = words[0]
>>> words_before
'Hello, world. '
>>> separator = words[1]
>>> separator
'Regular expressions'
>>> words_after = words[2]
>>> words_after
' are not always the answer.'

Basically, str.partition() splits the string into a 3-element tuple. In this example, the first element is all of the words before the specific "separator", the second element is the separator, and the third element is all of the words after the separator.

Sign up to request clarification or add additional context in comments.

1 Comment

I agree with the sentiment! Regular expressions are clearly not the answer here. But can you make something like this work for when text contains multiple matches?
2

The main problem with your pattern is that it begins with optional things that causes a lot of tries for each positions in the string until a match is found. The number of tries increases with the text size and with the value of n (the number of words before and after). This is why only few lines of text suffice to crash your code.

A way consists to begin the pattern with the target word and to use lookarounds to capture the text (or the words) before and after:

keyword (?= words after ) (?<= words before - keyword)

Starting a pattern with the searched word (a literal string) makes it very fast (because a fast algorithm is used to find this literal string positions in the string and then the pattern is tested only at these positions), and words around are then quickly found from this position in the string. Unfortunately the re module has some limitations and doesn't allow variable length lookbehinds (as many other regex flavors).

The new regex module supports variable length lookbehinds and other useful features like the ability to store the matches of a repeated capture group (handy to get the separated words in one shot).

import regex

text = '''In strange contrast to the hardly tolerable constraint and nameless
invisible domineerings of the captain's table, was the entire care-free
license and ease, the almost frantic democracy of those inferior fellows
the harpooneers. While their masters, the mates, seemed afraid of the
sound of the hinges of their own jaws, the harpooneers chewed their food
with such a relish that there was a report to it.'''

word = 'harpooneers'
n = 4

pattern = r'''
 %s  # target word
(?<= # content before
    (?<before> (?: (?<wdb>\w+) \W+ ){0,%d} )
    \m (?<target> %s ) \M
)
(?=  # content after
    (?<after>  (?: \W+ (?<wda>\w+) ){0,%d} )
)
''' % (word, n, word, n)

rgx = regex.compile(pattern, regex.VERBOSE | regex.IGNORECASE)

class Result(object):
    def __init__(self, m):
        self.target_span = m.span()
        self.excerpt_span = (m.starts('before')[0], m.ends('after')[0])
        self.excerpt = m.expandf('{before}{target}{after}')
        self.words_before = m.captures('wdb')[::-1]
        self.words_after = m.captures('wda')


results = [Result(m) for m in rgx.finditer(text)]

print(results[0].excerpt)
print(results[0].excerpt_span)
print(results[0].words_before)
print(results[0].words_after)
print(results[1].excerpt)

3 Comments

AWESOME! would you mind make it work for multi-word search? I tried modify it by enclosing %s between quotes but no luck
@scavenger: What do you mean by "multi-word search" ? Note that the pattern uses the VERBOSE flag, so, If you want to figure several words separated by blank characters, you have to escape them several\ words\ separated\ by\ spaces or to put the searched expression between \Q and \E to quote them for the pattern, like that: \Qseveral words separated by spaces\E, otherwise spaces are ignored.
@scavenger: note also that the way (derived from C formatted strings) I inserted the searched word and the number of repetitions is probably a bit "old school" today since there's now many ways to write a formatted string in python.
0

Making a regex (well, anything, for that matter) with "as much repetitions as you will ever possibly need" is an extremely bad idea. That's because you

  • do an excessive amount of needless work every time
  • cannot really know for sure how much you will ever possibly need, thus introducing an arbitrary limitation

The bottom line for the below solutions: the 1st solution is the most effective one for large data; the 2nd one is the closest to your current, but scales much worse.


  1. strip your entities to exactly what you are interested in at each moment:

    • find the substring (e.g. str.index. For whole words only, re.find with e.g. r'\b%s\b'%re.escape(word) is more suitable)
    • go N words back.
      • Since you mentioned a "text", your strings are likely to be very large, so you want to avoid copying potentially unlimited chunks of them.
      • E.g. re.finditer over a substring-reverse-iterator-in-place according to slices to immutable strings by reference and not copy and Best way to loop over a python string backwards. This will only become better than slicing when the latter is expensive in terms of CPU and/or memory - test on some realistic examples to find out. Doesn't work. re works directly with the memory buffer. Thus it's impossible to reverse a string for it without copying the data.
      • There's no function to find a character from a class in Python, nor an "xsplit". So the fastest way appears to be (i for i,c in enumerate(reversed(buffer(text,0,substring_index)) if c.isspace()) (timeit gives ~100ms on P3 933MHz for a full pass through a 100k string).

Alternatively:

  1. Fix your regex to not be subject to catastrophic backtracking and eliminate code duplication (DRY principle).
    The 2nd measure will eliminate the 2nd issue: we'll make the number of repetitions explicit (Python Zen, koan 2) and thus highly visible and manageable.
    As for the 1st issue, if you really only need "up to known, same N" items in each case, you won't actually be doing "excessive work" by finding them together with your string.

    • The "fix" part here is \w*\W* -> \w+\W+. This eliminates major ambiguity (see the above link) from the fact that each x* can be a blank match.
    • Matching up to N words before the string effectively is harder:
      • with (\w+\W+){,10} or equivalent, the matcher will be finding every 10 words before discovering that your string doesn't follow them, then trying 9,8, etc. To ease it up on the matcher somewhat, \b before the pattern will make it only perform all this work at the beginning of each word
      • lookbehind is not allowed here: as the linked article explains, the regex engine must know how many characters to step back before trying the contained regex. And even if it was - a lookbehind is tried before every character - i.e. it's even more of a CPU hog
      • As you can see, regexes aren't quite cut to match things backwards
    • To eliminate code duplication, either
      • use the aforementioned {,10}. This will not save individual words but should be noticeably faster for large text (see the above on how the matching works here). We can always parse the retrieved chunk of text in more details (with the regex in the next item) once we have it. Or
      • autogenerate the repetitive part
        • note that (\w+\W+)? repeated mindlessly is subject to the same ambiguity as above. To be unambiguous, the expression must be like this (w=(\w+\W+) here for brevity): (w(w...(ww?)?...)?)? (and all the groups need to be non-capturing).

Comments

0

I personally think that using text.partition() is the best option, as it eliminates the messy regular expressions, and automatically leaves output in an easy-to-access tuple.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.