0

Im trying to filter texts using regex in python. The goal is: Check if the text has the word W not preceded by X or not followed by Y. so lets say:

W="day", X="awful", Y="light"

"what a beautiful day it is" => should pass
"nice day"          => should pass    
"awful day"         => should fail
"such an awful day" => should fail
"the day light"     => should fail
"awful day light"   => should fail
"day light"         => should fail

I've tried several things like:

r".*\b(?!awful\b)day\b.*"
r"\W*\b(?!awful\b)day\b.*"  => to be able to include \n \r since '.' doesnt

r".*\b(day)\b(?!light\b).*"
r"\W*\b(day)\b(?!light\b)\W*"  => to be able to include \n \r since '.' doesnt

So complete example would be, (should fail)

if (re.search(r".*\b(?!awful\b)day\b.*", "such an awful day", re.UNICODE):
    print "Found awful day! no good!"

Still wondering how to do that! any ideas?

3
  • 2
    Does it have to use regex? What if the string is just daylight? How about today? How about this day is awful? Commented Jan 27, 2014 at 22:03
  • i get your point, but is targeted only at some particular words. Like people names, etc. Maybe i didnt pick the best words for the example. I thought regex would be cool, but im starting to think it might be better to do it in a couple more lines of code without regex. Commented Jan 27, 2014 at 22:11
  • Regex might be more power than you need. Split on whitespace, then do your own inspection. Commented Jan 27, 2014 at 22:15

2 Answers 2

2

Something like this?

 # ^(?s)((?!X).)*W((?!Y).)*$

 ^ 
 (?s)
 (
      (?! X )
      . 
 )*
 W 
 (
      (?! Y )
      . 
 )*
 $

or, with word boundries

 # ^(?s)((?!\bX\b).)*\bW\b((?!\bY\b).)*$

 ^ 
 (?s)
 (
      (?! \b X \b )
      . 
 )*
 \b W \b 
 (
      (?! \b Y \b )
      . 
 )*
 $

edit - It was unclear if you meant X<->W<->Y was separated by whitespace
or any number of characters. This expanded, commented example shows both ways.
Good luck!
Note - the (?add-remove) construct is a modifier group. Typically its a way to
embed options like s (Dot-All), i(Ignore case), etc.., within the regex.
Where (?s) means add Dot-All modifier, and (?si) is the same but with ignore case as well.

 #  ^(?s)(?!.*(?:\bX\b\s+\bW\b|\bW\b\s+\bY\b))(?:.*\b(W)\b.*|.*)$

 # This regex validates W is not preceded by X
 # nor followed by Y.
 # It also optionally finds W.
 # Only fails if its invalid.
 # If passed, can check if W present by
 # examining capture group 1.

 ^                         # Beginning of string
 (?s)                      # Modifier group, with s = DOT_ALL
 (?!                       # Negative looahead assertion
      .*                        # 0 or more any character (dot-all is set, so we match newlines too)
      (?:
           \b X \b \s+ \b W \b       # Trying to match X, 1 or more whitespaces, then W
        |  \b W \b \s+ \b Y \b       # Or, Trying to match W, 1 or more whitespaces, then Y

           # Substitute this to find any interval between X<->W<->Y
           #    \b X \b .* \b W \b       <- Trying to match X, 0 or more any char, then W
           # |  \b W \b .* \b Y \b       <- Or, Trying to match W, 0 or more any char, then Y
      )
 )

 # Still at start of line. 
 # If here, we didn't find any X<->W, nor W<->Y.
 # Opotioinally finds W in group 1.
 (?:
      .* \b 
      ( W )                     # (1), W
      \b .* 
   |  
      .* 
 )
 $                         # End of string
Sign up to request clarification or add additional context in comments.

3 Comments

wow, thanks, that seems to be working. Im gonna make work tests, but so far it worked!
what does '?s' mean? im gooling but couldnt find it yet.
@sebastian - Sorry, didn't see this till now. (?s) is the Dot-All embedded modifier. See my latest edit. Good luck!
2

You are almost there. Try:

(?<!\bawful\b )\bday\b(?!\s+\blight\b)

Demo:

st='''\
"what a beautiful day it is" => should pass
"nice day"          => should pass    
"awful day"         => should fail
"such an awful day" => should fail
"the day light"     => should fail
"awful day light"   => should fail
"day light"         => should fail'''

W, X, Y = 'day', 'awful', 'light'
pat=r'(?<!\b{}\b )\b{}\b(?!\s+\b{}\b)'.format(X, W, Y)

import re

for line in st.splitlines():
    m=re.search(pat, line)
    if m:
        print line

1 Comment

Seems to match awful day, that has extra whitespace between them. Don't know if it can be resolved with that lookbehind assertion there.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.