Confusing: python regex does not capture a working regex pattern

Question

I am using regex to capture a string from a word file (and many such word files). But weirdly enough, a seemingly good regex pattern (working on regex101.com) is not working on python.

Just in case it has something to do with the word file, I am attaching a drive link here for your reference.

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword

# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')

text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")

nob = text.split('BID OPENING DATE')
del nob[0]

txt = nob[0]

engineers_estimate = re.search('ENGINEERS EST\s+(?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S)', txt)
if not (engineers_estimate is None):
    engineers_estimate = engineers_estimate.group(1)
else:
    engineers_estimate = 'Not captured'

amount_under_over = re.search('(AMOUNT (?:OVER|UNDER))\s+((?:\d{1,3}(?:\,\d{3})*(?:\.\d\d)?))\b', txt)
if not (amount_under_over is None):
    amount_under_over1 = amount_under_over.group(2)
else:    
    amount_under_over1 = 'Not captured'

The code successfully captures the engineers_estimate variable but cannot capture anything for amount_under_over.

print(amount_uner_over) returns None.

According to this regex101 template, the code should capture the relevant amount under over string. Thank you so much!

Edit: Removing \b from the pattern worked! I'm not sure why it worked though.

Usually regex strings in python should be raw strings. Please try to define your regexes with r prefix first. Also it would be great if you could post just a sample text instead of drive link to some not-very-programmer-friendly file type. — STerliakov
– STerliakov, Commented Jan 31, 2023 at 23:11
Thank you so much, and noted! I just wasn't sure if it was the text or the file - but I would make sure to add a sample text regardless. Thanks! — Pepa
– Pepa, Commented Jan 31, 2023 at 23:14

Aleksa Majkic · Accepted Answer · 2023-01-31 23:12:20Z

1

I think the problem is escape characters which are allowed in Python strings by default. You can use r before your string to indicate it is a raw string, for example: engineers_estimate = re.search(r'ENGINEERS EST\s+(?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S)', txt)

Removing \b fixed your problem because that is an escape character Backspace.

answered Jan 31, 2023 at 23:12

Aleksa Majkic

8156 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

STerliakov Over a year ago

\b certainly has nothing to do with a backspace. It is word boundary match.

Aleksa Majkic Over a year ago

@SUTerliakov Yes, \b is a word boundary when talking about regex, but string allows escape characters so \b will be interpreted as an escape character Backspace.

STerliakov Over a year ago

Augh, sorry, I misinterpreted this statement as still applying to raw strings, you're right.

Collectives™ on Stack Overflow

Confusing: python regex does not capture a working regex pattern

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related