1

I am trying to write a program that reads text from screenshots and then identifies various PII from it. Using pytesseract to read in the text, I am trying to write regex for urls, email IDs etc. Here is an example of a function which takes in a string and returns True email IDs and False otherwise:

def email_regex(text):
    pattern = compile(r"\A[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?")
    return bool(pattern.match(text))

This function works well for all email IDs in a proper format([email protected]), but since the input to the function is text read in from pytesseract, the text is not guaranteed to be properly formatted. My function returns False for abc@xyzdd. I'm running into the same issues with URL regex,domain name regex etc. Is there a way to make my regex expressions more robust to reading in errors from pytesseract?

I have tried following the accepted solution to this answer, but that leads to the regex functions returning True for random words as well. Any help to resolve this would be greatly appreciated.

EDIT :- Here are my url and domain regexs, where I'm running into the same problem as my email regex. Any help with these will be very useful for me.

    pattern = compile(r'\b(((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,86} 
    [a-zA-Z0-9]))\.(([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,73}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25})))|((([a-zA-Z0-9])|([a-zA-Z0-9][a-zA-Z0-9\-]{0,162}[a-zA-Z0-9]))\.(([a-zA-Z0-9]{2,12}\.[a-zA-Z0-9]{2,12})|([a-zA-Z0-9]{2,25}))))\b', re.IGNORECASE)
    return pattern.match(text)```


  def url_regex(text):
    pattern = compile(r'(http|https://)?:(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])+)', re.IGNORECASE)
    return pattern.match(text)
2
  • 1
    Where do you want draw the line? Isn't something like .+@[^@]+ sufficient, matching everything that has exactly one @ symbol in the middle? If your OCR fails to recognize the @ symbol correctly, there's little hope that the result is easily recognizable as an email address. Commented May 25, 2020 at 12:22
  • something like abc@xyzcom should be acceptable. A missing @ is going to be difficult to manage, as you pointed out. Similarly, httyis://www.facebook.com should be acceptable to the url_regex... Commented May 25, 2020 at 12:25

1 Answer 1

0

Perhaps adding some flags, such as ignorecase and DOTALL for newlines:

# Match email ID:
my_pattern = compile(r"^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]?\w{2,3}$", re.I, re.S)

Match URLs:

https://gist.github.com/gruber/8891611

Sign up to request clarification or add additional context in comments.

6 Comments

can you help me with an example here, where adding a flag takes care of the case I've outlined in the body of the question?
Yes: The character sets currently checks lower case letters: a-z, which won't find mails like [email protected], right ?
So adding the re.I flag will give more robustness
That's true. Thanks for the tip. Is there anything I can do to make the regex I've posted in the question for example more robust against edge cases like I've described in the question?
this was helpful. Can you edit your answer so that I can select it as the accepted answer? Also, if you could look at the updated question body and have a look at the other functions as well, I'd be grateful
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.