3

I want to get e-mail formed texts in a field. I have tried sql below but no luck. See SqlFiddle. Removing ^ and $ from regexp not working too.

WITH TEST_DATA AS (
  SELECT '[email protected]' AS EMAIL FROM DUAL UNION ALL 
  SELECT 'mail [email protected]' FROM DUAL UNION ALL           
  SELECT 'mail [email protected] sent' FROM DUAL UNION ALL                
  SELECT '[email protected] sent count 23' FROM DUAL UNION ALL          
  SELECT 'mail already sent to [email protected] and [email protected]' FROM DUAL UNION ALL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  SELECT '[email protected] sent count 23' FROM DUAL             
)SELECT REGEXP_SUBSTR(EMAIL,'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$') MAIL
 FROM TEST_DATA;

Expected output for this dataset

[email protected] 
[email protected] 
[email protected] 
[email protected] 
[email protected], [email protected] 
[email protected]

Any help appreciated.

3
  • You definitely want to remove the anchors for start of text and end of text (^$) because in this case you'll completely miss emails like [email protected]. Also, I'm guessing here, but I think your regex will not pick emails like [email protected]. Email recognition in regex is incredibly hard, and prone to errors, I wish you luck. Commented Jan 22, 2014 at 15:06
  • thanks @MauriceReeves. it retrives both your samples. Commented Jan 22, 2014 at 15:09
  • Don't forget that [email protected] is also a valid email. Commented Jan 22, 2014 at 15:33

2 Answers 2

5

If you want to extract multiple mail ids in a single column, you can use REGEXP_REPLACE function.

Assuming all the ids in your data are valid ones,

REGEXP_REPLACE (EMAIL, '(\w+@\w+\.\w+ ?)|(.)', '\1')

This removes all other text except for mail ids that are separated by at least a space.

You can then remove any trailing spaces and add comma to separate multiple ids.

REPLACE (TRIM (REGEXP_REPLACE (EMAIL, '(\w+@\w+\.\w+ ?)|(.)', '\1')),
            ' ',
            ', ')

Example:

WITH TEST_DATA
     AS (SELECT '[email protected]' AS EMAIL FROM DUAL
         UNION ALL
         SELECT 'mail [email protected]' FROM DUAL
         UNION ALL
         SELECT 'mail [email protected] sent to [email protected] and [email protected]' FROM DUAL
         UNION ALL
         SELECT '[email protected] sent count 23 and [email protected]' FROM DUAL
         UNION ALL
         SELECT 'mail already sent to [email protected] and [email protected]' FROM DUAL
         UNION ALL
         SELECT '[email protected] sent count 23' FROM DUAL)
SELECT REPLACE (TRIM (REGEXP_REPLACE (EMAIL, '(\w+@\w+\.\w+ ?)|(.)', '\1')),
                ' ',
                ', ')
          MAIL
  FROM TEST_DATA;

MAIL
-----------------------------
[email protected]
[email protected]
[email protected], [email protected], [email protected]
[email protected], [email protected]
[email protected], [email protected]
[email protected]
Sign up to request clarification or add additional context in comments.

Comments

3

You are close! try this

SELECT REGEXP_SUBSTR(EMAIL,'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}') MAIL

edited:

Maybe this helps:

WITH TEST_DATA AS (
  SELECT '[email protected]' AS EMAIL FROM DUAL UNION ALL 
  SELECT 'mail [email protected]' FROM DUAL UNION ALL           
  SELECT 'mail [email protected] sent' FROM DUAL UNION ALL                
  SELECT '[email protected] sent count 23' FROM DUAL UNION ALL          
  SELECT 'mail already sent to [email protected] and [email protected]' FROM DUAL UNION ALL 
  SELECT '[email protected] sent count 23' FROM DUAL             
)SELECT REGEXP_SUBSTR(EMAIL,'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}') MAIL,
        REGEXP_SUBSTR(EMAIL,'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}',1,2) MAIL2
 FROM TEST_DATA

I don't see a way to report 'n' number of matches. I also do not realize how to insert a comma and output into one column. I would bet that if possible, the query will be become quite complex with multiple inner selects/finds/replaces occuring. A better solution may be to return the original result to another language for parsing or to perform such parsing using pl/sql.

Another edit:

Here is what I meant regarding the inner selects. Exact solution to the asked question :-)

select CASE WHEN MAIL2 is not null THEN mail||', '||mail2 ELSE mail END as mail
from (
    WITH TEST_DATA AS (
      SELECT '[email protected]' AS EMAIL FROM DUAL UNION ALL 
      SELECT 'mail [email protected]' FROM DUAL UNION ALL           
      SELECT 'mail [email protected] sent' FROM DUAL UNION ALL                
      SELECT '[email protected] sent count 23' FROM DUAL UNION ALL          
      SELECT 'mail already sent to [email protected] and [email protected]' FROM DUAL UNION ALL 
      SELECT '[email protected] sent count 23' FROM DUAL             
    )SELECT REGEXP_SUBSTR(EMAIL,'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}') MAIL,
            REGEXP_SUBSTR(EMAIL,'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}',1,2) MAIL2
     FROM TEST_DATA
)

I also stubled upon this Oracle articale which discusses e-mail matching at point 8. It might be worth a peek. http://www.orafaq.com/node/2404

6 Comments

It's not generating expected output. all rows returns [email protected]. Is it possible to get [email protected] too?
Not elegant solution @Michael Ford but it's very hard to get all matches just using sql. So this is the best we can :) Thank you.
Just be careful because this regex will also match [email protected] and [email protected]. It's a really good regex, but you're still going to possibly run into some junk data. The only thing I'd probably add are a \b at each end to specify word boundaries because it will also match something like [email protected] but will just grab out to .comp, which isn't the intent of your extraction.
@MauriceReeves Does Oracle allow word boundary tokens?
@hkutluay This regex: [A-Z0-9._%+-]+@([A-Z0-9-]+\.)+[A-Z]{2,6} should avoid the consecutive dot issue mentioned above
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.