writing flexible regex expressions

Question

I am trying to write a program that reads text from screenshots and then identifies various PII from it. Using pytesseract to read in the text, I am trying to write regex for urls, email IDs etc. Here is an example of a function which takes in a string and returns True email IDs and False otherwise:

def email_regex(text):
    pattern = compile(r"\A[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?")
    return bool(pattern.match(text))

This function works well for all email IDs in a proper format([email protected]), but since the input to the function is text read in from pytesseract, the text is not guaranteed to be properly formatted. My function returns False for abc@xyzdd. I'm running into the same issues with URL regex,domain name regex etc. Is there a way to make my regex expressions more robust to reading in errors from pytesseract?

I have tried following the accepted solution to this answer, but that leads to the regex functions returning True for random words as well. Any help to resolve this would be greatly appreciated.

EDIT :- Here are my url and domain regexs, where I'm running into the same problem as my email regex. Any help with these will be very useful for me.

    pattern = compile(r'\b(((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,86} 
    [a-zA-Z0-9]))\.(([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,73}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25})))|((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,162}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25}))))\b', re.IGNORECASE)
    return pattern.match(text)```


  def url_regex(text):
    pattern = compile(r'(http|https://)?:(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])+)', re.IGNORECASE)
    return pattern.match(text)

Where do you want draw the line? Isn't something like .+@[^@]+ sufficient, matching everything that has exactly one @ symbol in the middle? If your OCR fails to recognize the @ symbol correctly, there's little hope that the result is easily recognizable as an email address. — Thomas
– Thomas, Commented May 25, 2020 at 12:22
something like abc@xyzcom should be acceptable. A missing @ is going to be difficult to manage, as you pointed out. Similarly, httyis://www.facebook.com should be acceptable to the url_regex... — WitchKingofAngmar
– WitchKingofAngmar, Commented May 25, 2020 at 12:25

Gustav Rasmussen · Accepted Answer · 2020-05-25 12:49:31Z

0

Perhaps adding some flags, such as ignorecase and DOTALL for newlines:

# Match email ID:
my_pattern = compile(r"^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]?\w{2,3}$", re.I, re.S)

Match URLs:

https://gist.github.com/gruber/8891611

edited May 25, 2020 at 12:49

answered May 25, 2020 at 12:23

Gustav Rasmussen

4,0394 gold badges32 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

WitchKingofAngmar Over a year ago

can you help me with an example here, where adding a flag takes care of the case I've outlined in the body of the question?

Gustav Rasmussen Over a year ago

Yes: The character sets currently checks lower case letters: a-z, which won't find mails like [email protected], right ?

Gustav Rasmussen Over a year ago

So adding the re.I flag will give more robustness

WitchKingofAngmar Over a year ago

That's true. Thanks for the tip. Is there anything I can do to make the regex I've posted in the question for example more robust against edge cases like I've described in the question?

WitchKingofAngmar Over a year ago

this was helpful. Can you edit your answer so that I can select it as the accepted answer? Also, if you could look at the updated question body and have a look at the other functions as well, I'd be grateful

|

Collectives™ on Stack Overflow

writing flexible regex expressions

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related