0

EDIT: I have no access to "replace" function, to any code, or to the REGEX matches. All I can do is provide a regex string to the API, and it strips out whatever was matched (not part of an email), and leaves the rest (leaving behind only emails).


I am working with an API that reads data from an OCR document. I have no control over the API, however I have access to a function in the API which can strip out whatever is matched by a provided REGEX. I am trying to strip out whatever is NOT an email address, leaving only the email addresses behind, separated by spaces if there is more than one email. I know REGEX isn't the best for matching emails, but I have no other choice here.

Thanks to the OCR document, there are often characters that should not be present in an email e.g the text could be (simple example) User Email:[email protected]*required field and I would like to end up with just [email protected] by stripping out the rest.

  1. I can't define or use regex replace or any other functions. All I can do is define a regex for what to strip off (basically I need to invert an email match).
  2. I certainly don't expect this to work for all RFC-compliant email addresses, just reasonably most use-cases.
  3. In case it matters, I happen to know the architecture of the API is in C#

Here is what I tried (non-working) to use to invert the email match, but it doesn't match anything.

^(?![A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}(?!.)

I also searched SO and found this link but it was inconclusive.

16
  • The problem is that either you will get concatenated (glued to one another) email addresses, or if you use some separator, a lot of these separators. It is done with either ([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})|. => $1 (or \1), or $1\n (\1\n). Commented Nov 3, 2021 at 17:45
  • Yes, email addresses separated by spaces are totally fine for this purpose, if there is more than one email address. The REGEX I posted just doesn't seem to match anything though. What do you think is wrong with it, please? Commented Nov 3, 2021 at 17:50
  • The main thing is that regex is invalid. And you wanted to use a non-consuming pattern to replace text - it is nonsense. To replace text, we MUST consume it. Lookarounds are used to check for presence or absence without actually putting the matched text into match value. Commented Nov 3, 2021 at 17:52
  • 1
    @sln Thanks for your help, but sadly this stripped out all the data and I just ended up with an empty string. May need to contact the company behind the API. Commented Nov 3, 2021 at 20:41
  • 1
    I think I have found a way if it is .NET, but the pattern will look rather monstrous and not readable/maintainable. Commented Nov 3, 2021 at 21:24

2 Answers 2

2

This works in C#, uses variable look behind.

(?i)(?:(?<=([A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2,3}(?![A-Z])|[A-Z]{4})(?!\.[A-Z]{2})))|^)((?:(?![A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})[\S\s])+)

RegexStormSample

Left a couple captures to fully view the parts.
Did a few tweaks in the lookbehind because it looks like in C#, lookbehind ranges are treated with non-greedy bias.
And they have to be controled with extra sub assertions to make it grab all the sub domain.

 (?i)
 (?:
    (?<=
       (                             # (1 start)
          [A-Z0-9._%+-]+ @ [A-Z0-9.-]+ \.
          (?:
             [A-Z]{2,3} 
             (?! [A-Z] )
           | [A-Z]{4} 
          )
          (?! \. [A-Z]{2} )
       )                             # (1 end)
    )
  | ^
 )
 (                             # (2 start)
    (?:
       (?! [A-Z0-9._%+-]+ @ [A-Z0-9.-]+ \. [A-Z]{2,4} )
       [\S\s]
    )+
 )                             # (2 end)
Sign up to request clarification or add additional context in comments.

2 Comments

That regex fails validation when not used in a C# context.
@AndyUK It was specifically written to target C#
1

You can also use a negative lookbehind pattern like

(?s)(?<![\w.%+-]+@[\w.-]+\.[A-Za-z]{0,3}(?=[A-Za-z])|[\w.%+-]+@[\w.-]*(?=[\w.-]*\.[A-Za-z]{2,3})|(?=[\w.%+-]*@[\w.-]*\.[A-Za-z]{2,3})).

See the .NET regex demo.

Details:

  • (?s) - now, . matches line feed chars
  • (?<! - start of a negative lookbehind, the following patterns - if matched - will fail the match:
    • [\w.%+-]+@[\w.-]+\.[A-Za-z]{0,3}(?=[A-Za-z])| - one or more word, ., %, + or - chars, @, one or more word, . or - chars, ., zero to three letters that are followed with a letter, or
    • [\w.%+-]+@[\w.-]*(?=[\w.-]*\.[A-Za-z]{2,3})| - one or more word, ., %, + or - chars, @, zero or more word, . or - chars followed with zero or more word, . or - chars, ., two or three letters, or
    • (?=[\w.%+-]*@[\w.-]*\.[A-Za-z]{2,3}) - a position immediately followed with zero or more word, ., %, + or - chars, @, zero or more word, . or - chars, ., two or three letters -) - end of the negative lookbehind
  • . - any 1 char.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.