python regex exclude text containing word

Question

Im trying to filter texts using regex in python. The goal is: Check if the text has the word W not preceded by X or not followed by Y. so lets say:

W="day", X="awful", Y="light"

"what a beautiful day it is" => should pass
"nice day"          => should pass    
"awful day"         => should fail
"such an awful day" => should fail
"the day light"     => should fail
"awful day light"   => should fail
"day light"         => should fail

I've tried several things like:

r".*\b(?!awful\b)day\b.*"
r"\W*\b(?!awful\b)day\b.*"  => to be able to include \n \r since '.' doesnt

r".*\b(day)\b(?!light\b).*"
r"\W*\b(day)\b(?!light\b)\W*"  => to be able to include \n \r since '.' doesnt

So complete example would be, (should fail)

if (re.search(r".*\b(?!awful\b)day\b.*", "such an awful day", re.UNICODE):
    print "Found awful day! no good!"

Still wondering how to do that! any ideas?

Does it have to use regex? What if the string is just daylight? How about today? How about this day is awful? — NullUserException
– NullUserException, Commented Jan 27, 2014 at 22:03
i get your point, but is targeted only at some particular words. Like people names, etc. Maybe i didnt pick the best words for the example. I thought regex would be cool, but im starting to think it might be better to do it in a couple more lines of code without regex. — Sebastian
– Sebastian, Commented Jan 27, 2014 at 22:11
Regex might be more power than you need. Split on whitespace, then do your own inspection. — user557597
– user557597, Commented Jan 27, 2014 at 22:15

score 2 · Accepted Answer · 2014-01-28 16:37:41Z

2

Something like this?

 # ^(?s)((?!X).)*W((?!Y).)*$

 ^ 
 (?s)
 (
      (?! X )
      . 
 )*
 W 
 (
      (?! Y )
      . 
 )*
 $

or, with word boundries

 # ^(?s)((?!\bX\b).)*\bW\b((?!\bY\b).)*$

 ^ 
 (?s)
 (
      (?! \b X \b )
      . 
 )*
 \b W \b 
 (
      (?! \b Y \b )
      . 
 )*
 $

edit - It was unclear if you meant X<->W<->Y was separated by whitespace
or any number of characters. This expanded, commented example shows both ways.
Good luck!
Note - the (?add-remove) construct is a modifier group. Typically its a way to
embed options like s (Dot-All), i(Ignore case), etc.., within the regex.
Where (?s) means add Dot-All modifier, and (?si) is the same but with ignore case as well.

 #  ^(?s)(?!.*(?:\bX\b\s+\bW\b|\bW\b\s+\bY\b))(?:.*\b(W)\b.*|.*)$

 # This regex validates W is not preceded by X
 # nor followed by Y.
 # It also optionally finds W.
 # Only fails if its invalid.
 # If passed, can check if W present by
 # examining capture group 1.

 ^                         # Beginning of string
 (?s)                      # Modifier group, with s = DOT_ALL
 (?!                       # Negative looahead assertion
      .*                        # 0 or more any character (dot-all is set, so we match newlines too)
      (?:
           \b X \b \s+ \b W \b       # Trying to match X, 1 or more whitespaces, then W
        |  \b W \b \s+ \b Y \b       # Or, Trying to match W, 1 or more whitespaces, then Y

           # Substitute this to find any interval between X<->W<->Y
           #    \b X \b .* \b W \b       <- Trying to match X, 0 or more any char, then W
           # |  \b W \b .* \b Y \b       <- Or, Trying to match W, 0 or more any char, then Y
      )
 )

 # Still at start of line. 
 # If here, we didn't find any X<->W, nor W<->Y.
 # Opotioinally finds W in group 1.
 (?:
      .* \b 
      ( W )                     # (1), W
      \b .* 
   |  
      .* 
 )
 $                         # End of string

edited Jan 28, 2014 at 16:37

answered Jan 27, 2014 at 22:05

user557597

Sign up to request clarification or add additional context in comments.

3 Comments

Sebastian Over a year ago

wow, thanks, that seems to be working. Im gonna make work tests, but so far it worked!

Sebastian Over a year ago

what does '?s' mean? im gooling but couldnt find it yet.

user557597 Over a year ago

@sebastian - Sorry, didn't see this till now. (?s) is the Dot-All embedded modifier. See my latest edit. Good luck!

dawg · Accepted Answer · 2014-01-27 22:47:04Z

2

You are almost there. Try:

(?<!\bawful\b )\bday\b(?!\s+\blight\b)

Demo:

st='''\
"what a beautiful day it is" => should pass
"nice day"          => should pass    
"awful day"         => should fail
"such an awful day" => should fail
"the day light"     => should fail
"awful day light"   => should fail
"day light"         => should fail'''

W, X, Y = 'day', 'awful', 'light'
pat=r'(?<!\b{}\b )\b{}\b(?!\s+\b{}\b)'.format(X, W, Y)

import re

for line in st.splitlines():
    m=re.search(pat, line)
    if m:
        print line

answered Jan 27, 2014 at 22:47

dawg

105k24 gold badges143 silver badges217 bronze badges

1 Comment

user557597 Over a year ago

Seems to match awful day, that has extra whitespace between them. Don't know if it can be resolved with that lookbehind assertion there.

Collectives™ on Stack Overflow

python regex exclude text containing word

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related