0

I am using regex to capture a string from a word file (and many such word files). But weirdly enough, a seemingly good regex pattern (working on regex101.com) is not working on python.

Just in case it has something to do with the word file, I am attaching a drive link here for your reference.

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword

# setting directory
os.chdir('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc-test')

text = textract.process('/Users/aartimalik/Documents/GitHub/revenue_procurement/pdfs/bidsummaries-doc/081204R0.doc_133.doc')
text = text.decode("utf-8")

nob = text.split('BID OPENING DATE')
del nob[0]

txt = nob[0]

engineers_estimate = re.search('ENGINEERS EST\s+(?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S)', txt)
if not (engineers_estimate is None):
    engineers_estimate = engineers_estimate.group(1)
else:
    engineers_estimate = 'Not captured'

amount_under_over = re.search('(AMOUNT (?:OVER|UNDER))\s+((?:\d{1,3}(?:\,\d{3})*(?:\.\d\d)?))\b', txt)
if not (amount_under_over is None):
    amount_under_over1 = amount_under_over.group(2)
else:    
    amount_under_over1 = 'Not captured'

The code successfully captures the engineers_estimate variable but cannot capture anything for amount_under_over.

print(amount_uner_over) returns None.

According to this regex101 template, the code should capture the relevant amount under over string. Thank you so much!

Edit: Removing \b from the pattern worked! I'm not sure why it worked though.

2
  • 1
    Usually regex strings in python should be raw strings. Please try to define your regexes with r prefix first. Also it would be great if you could post just a sample text instead of drive link to some not-very-programmer-friendly file type. Commented Jan 31, 2023 at 23:11
  • Thank you so much, and noted! I just wasn't sure if it was the text or the file - but I would make sure to add a sample text regardless. Thanks! Commented Jan 31, 2023 at 23:14

1 Answer 1

1

I think the problem is escape characters which are allowed in Python strings by default. You can use r before your string to indicate it is a raw string, for example: engineers_estimate = re.search(r'ENGINEERS EST\s+(?:^|\s)(?=.)((?:0|(?:[1-9](?:\d*|\d{0,2}(?:,\d{3})*)))?(?:\.\d*[0-9])?)(?!\S)', txt)

Removing \b fixed your problem because that is an escape character Backspace.

Sign up to request clarification or add additional context in comments.

3 Comments

\b certainly has nothing to do with a backspace. It is word boundary match.
@SUTerliakov Yes, \b is a word boundary when talking about regex, but string allows escape characters so \b will be interpreted as an escape character Backspace.
Augh, sorry, I misinterpreted this statement as still applying to raw strings, you're right.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.