Remove everything except email addresses from text using REGEX only

Question

EDIT: I have no access to "replace" function, to any code, or to the REGEX matches. All I can do is provide a regex string to the API, and it strips out whatever was matched (not part of an email), and leaves the rest (leaving behind only emails).

I am working with an API that reads data from an OCR document. I have no control over the API, however I have access to a function in the API which can strip out whatever is matched by a provided REGEX. I am trying to strip out whatever is NOT an email address, leaving only the email addresses behind, separated by spaces if there is more than one email. I know REGEX isn't the best for matching emails, but I have no other choice here.

Thanks to the OCR document, there are often characters that should not be present in an email e.g the text could be (simple example) User Email:[email protected]*required field and I would like to end up with just [email protected] by stripping out the rest.

I can't define or use regex replace or any other functions. All I can do is define a regex for what to strip off (basically I need to invert an email match).
I certainly don't expect this to work for all RFC-compliant email addresses, just reasonably most use-cases.
In case it matters, I happen to know the architecture of the API is in C#

Here is what I tried (non-working) to use to invert the email match, but it doesn't match anything.

^(?![A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}(?!.)

I also searched SO and found this link but it was inconclusive.

The problem is that either you will get concatenated (glued to one another) email addresses, or if you use some separator, a lot of these separators. It is done with either ([A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})|. => $1 (or \1), or $1\n (\1\n). — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 3, 2021 at 17:45
Yes, email addresses separated by spaces are totally fine for this purpose, if there is more than one email address. The REGEX I posted just doesn't seem to match anything though. What do you think is wrong with it, please? — Cogicero
– Cogicero, Commented Nov 3, 2021 at 17:50
The main thing is that regex is invalid. And you wanted to use a non-consuming pattern to replace text - it is nonsense. To replace text, we MUST consume it. Lookarounds are used to check for presence or absence without actually putting the matched text into match value. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 3, 2021 at 17:52
@sln Thanks for your help, but sadly this stripped out all the data and I just ended up with an empty string. May need to contact the company behind the API. — Cogicero
– Cogicero, Commented Nov 3, 2021 at 20:41
I think I have found a way if it is .NET, but the pattern will look rather monstrous and not readable/maintainable. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 3, 2021 at 21:24

sln · Accepted Answer · 2021-11-03 22:17:40Z

2

This works in C#, uses variable look behind.

(?i)(?:(?<=([A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-Z]{2,3}(?![A-Z])|[A-Z]{4})(?!\.[A-Z]{2})))|^)((?:(?![A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4})[\S\s])+)

RegexStormSample

Left a couple captures to fully view the parts.
Did a few tweaks in the lookbehind because it looks like in C#, lookbehind ranges are treated with non-greedy bias.
And they have to be controled with extra sub assertions to make it grab all the sub domain.

 (?i)
 (?:
    (?<=
       (                             # (1 start)
          [A-Z0-9._%+-]+ @ [A-Z0-9.-]+ \.
          (?:
             [A-Z]{2,3} 
             (?! [A-Z] )
           | [A-Z]{4} 
          )
          (?! \. [A-Z]{2} )
       )                             # (1 end)
    )
  | ^
 )
 (                             # (2 start)
    (?:
       (?! [A-Z0-9._%+-]+ @ [A-Z0-9.-]+ \. [A-Z]{2,4} )
       [\S\s]
    )+
 )                             # (2 end)

edited Nov 3, 2021 at 22:17

answered Nov 3, 2021 at 21:54

sln

3,6431 gold badge7 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AndyUK Over a year ago

That regex fails validation when not used in a C# context.

Cogicero Over a year ago

@AndyUK It was specifically written to target C#

Wiktor Stribiżew · Accepted Answer · 2021-11-03 22:15:46Z

You can also use a negative lookbehind pattern like

(?s)(?<![\w.%+-]+@[\w.-]+\.[A-Za-z]{0,3}(?=[A-Za-z])|[\w.%+-]+@[\w.-]*(?=[\w.-]*\.[A-Za-z]{2,3})|(?=[\w.%+-]*@[\w.-]*\.[A-Za-z]{2,3})).

See the .NET regex demo.

Details:

(?s) - now, . matches line feed chars
(?<! - start of a negative lookbehind, the following patterns - if matched - will fail the match:
- [\w.%+-]+@[\w.-]+\.[A-Za-z]{0,3}(?=[A-Za-z])| - one or more word, ., %, + or - chars, @, one or more word, . or - chars, ., zero to three letters that are followed with a letter, or
- [\w.%+-]+@[\w.-]*(?=[\w.-]*\.[A-Za-z]{2,3})| - one or more word, ., %, + or - chars, @, zero or more word, . or - chars followed with zero or more word, . or - chars, ., two or three letters, or
- (?=[\w.%+-]*@[\w.-]*\.[A-Za-z]{2,3}) - a position immediately followed with zero or more word, ., %, + or - chars, @, zero or more word, . or - chars, ., two or three letters -) - end of the negative lookbehind
. - any 1 char.

Collectives™ on Stack Overflow

Remove everything except email addresses from text using REGEX only

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related