8

I'm trying to extract email addresses from an existing comments field and put it into its own column. The string may be something like this "this is an example comment with an email address of [email protected]" or just literally the email itself "[email protected]".

I figure the best thing to do would be to find the index of the '@' symbol and search in both directions until either the end of the string was hit or there was a space. Can anyone help me out with this implementation?

1
  • 1
    I would use the PATINDEX to find the start position of the email address. Search online the patterns of email addresses: you will find from a simplest one to the most complex pattern, which may not be even recognisable by SQL-Server. I will then use CHARINDEX to locate the next space or the end of string (if CHARINDEX will not return anything) Commented Apr 13, 2015 at 2:00

8 Answers 8

11

I know wewesthemenace already answered the question, but his/her solution seems over complicated. Why concatenate the left and right sides of the email address together? I'd rather just find the beginning and the end of the email address and then use substring to return the email address like so:

My Table

DECLARE @Table TABLE (comment NVARCHAR(50));
INSERT INTO @Table
VALUES ('blah [email protected]'),            --At the end
        ('blah [email protected] blah blah'), --In the middle
        ('[email protected] blah'),           --At the beginning
        ('no email');

Actual Query:

SELECT  comment,        
        CASE
            WHEN CHARINDEX('@',comment) = 0 THEN NULL
            ELSE SUBSTRING(comment,beginningOfEmail,endOfEmail-beginningOfEmail)
        END email
FROM @Table
CROSS APPLY (SELECT CHARINDEX(' ',comment + ' ',CHARINDEX('@',comment))) AS A(endOfEmail)
CROSS APPLY (SELECT DATALENGTH(comment)/2 - CHARINDEX(' ',REVERSE(' ' + comment),CHARINDEX('@',REVERSE(' ' + comment))) + 2) AS B(beginningOfEmail)

Results:

comment                                            email
-------------------------------------------------- --------------------------------------------------
blah [email protected]                     [email protected]
blah [email protected] blah blah           [email protected]
[email protected] blah                     [email protected]
no email                                           NULL
Sign up to request clarification or add additional context in comments.

4 Comments

This seems to throw an "Invalid length parameter passed to the left or substring function" exception when I use it.
Datatype in my table is nvarchar so I changed DATALENGTH to LEN and good to go. Thanks.
LEN() ignores white space at the ends. DATALENGTH() doesn't ignore white space, but it works a little different. It lists the bytes. So VARCHAR(non-Unicode) the bytes = the length of the string. For NVARCHAR(Unicode), you need to use DATALENGTH() divided by 2
Worked well for me, and a lot less verbose than the accepted solution which also generated incorrect results for my data. Also for learning the fact that LEN doesn't include white space at the end of a string, I never knew that!
7

You can search for '@' in the string. Then you get the string at the LEFT and RIGHT side of '@'. You then want to REVERSE the LEFT side and get first occurrence of ' ' then get the SUBSTRING from there. Then REVERSE it to get the original form. Same principle apply to the RIGHT side without doing REVERSE.

Example string: 'some text [email protected] some text'

  1. LEFT = 'some text someemail'
  2. RIGHT = '@domain.org some text'
  3. Reverse LEFT = 'liameemos txet emos'
  4. SUBSTRING up to the first space = 'liameemos'
  5. REVERSE(4) = someemail
  6. SUBSTRING (2) up to the first space = '@domain.org'
  7. Combine 5 and 6 = '[email protected]'

Your query would be:

;WITH CteEmail(email) AS(
    SELECT '[email protected]' UNION ALL
    SELECT 'some text [email protected] some text' UNION ALL
    SELECT 'no email'
)
,CteStrings AS(
    SELECT
        [Left] = LEFT(email, CHARINDEX('@', email, 0) - 1),
        Reverse_Left = REVERSE(LEFT(email, CHARINDEX('@', email, 0) - 1)),
        [Right] = RIGHT(email, CHARINDEX('@', email, 0) + 1)
    FROM CteEmail
    WHERE email LIKE '%@%'
)
SELECT *,
    REVERSE(
        SUBSTRING(Reverse_Left, 0, 
            CASE
                WHEN CHARINDEX(' ', Reverse_Left, 0) = 0 THEN LEN(Reverse_Left) + 1
                ELSE CHARINDEX(' ', Reverse_Left, 0)
            END
        )
    )
    +
    SUBSTRING([Right], 0,
        CASE
            WHEN CHARINDEX(' ', [Right], 0) = 0 THEN LEN([Right]) + 1
            ELSE CHARINDEX(' ', [Right], 0)
        END
    )
FROM CteStrings

Sample Data:

email
----------------------------------------
[email protected]
some text [email protected] some text
no email

Result

---------------------
[email protected]
[email protected]

2 Comments

Be sure to read on SUBSTRING, LEFT and RIGHT functions.
the RIGTH column has a issue. I should be include LEN [Right] = RIGHT(email, LEN(email) - CHARINDEX('@', email, 0) + 1)
6

Stephan's answer is great when looking for a single email address in each row.

However, I was running into this error when trying to get multiple email addresses in each row:

Invalid length parameter passed to the LEFT or SUBSTRING function

I used this answer from DBA Stack Exchange to get all of the positions of @ inside the string. It entails a table-valued function that returns the number of positions equal to the number a certain pattern inside the string. I also had to modify the CROSS APPLY functions to handle multiple email addresses as well.

My Table:

DECLARE @Table TABLE (comment VARCHAR(500));
INSERT INTO @Table (comment)
VALUES ('blah blah [email protected] more blah [email protected] even more blah [email protected]'),
       ('blah [email protected] more'),
       ('no email')

Table-valued Function:

CREATE FUNCTION dbo.fnFindPatternLocation
(
    @string NVARCHAR(MAX),
    @term   NVARCHAR(255)
)
RETURNS TABLE
AS
    RETURN 
    (
        SELECT pos = Number - LEN(@term) 
        FROM (SELECT Number, Item = LTRIM(RTRIM(SUBSTRING(@string, Number, 
        CHARINDEX(@term, @string + @term, Number) - Number)))
        FROM (SELECT ROW_NUMBER() OVER (ORDER BY [object_id])
        FROM sys.all_objects) AS n(Number)
        WHERE Number > 1 AND Number <= CONVERT(INT, LEN(@string))
        AND SUBSTRING(@term + @string, Number, LEN(@term)) = @term
    ) AS y);
GO

Query:

SELECT comment, pos, SUBSTRING(comment,beginningOfEmail,endOfEmail-beginningOfEmail) AS email
FROM @Table
CROSS APPLY (SELECT pos FROM dbo.fnFindPatternLocation(comment, '@')) AS A(pos)
CROSS APPLY (SELECT CHARINDEX(' ',comment + ' ', pos)) AS B(endOfEmail)
CROSS APPLY (SELECT pos - CHARINDEX(' ', REVERSE(SUBSTRING(comment, 1, pos))) + 2) AS C(beginningOfEmail)

Results:

comment
---------------------------------------------------------------------------------------------------------
blah blah [email protected] more blah [email protected] even more blah [email protected]
blah blah [email protected] more blah [email protected] even more blah [email protected]
blah blah [email protected] more blah [email protected] even more blah [email protected]
blah [email protected] more

pos    email
---    ------------------------------
26     [email protected]
64     [email protected]
95     [email protected]
17     [email protected]

1 Comment

Very helpful. I added a few PATINDEX -clauses to avoid things like "make sure to @mesomething" or other inaccuracies.
3
DECLARE @t TABLE (row_id INT, email VARCHAR(100))

INSERT @t (row_id, email)
VALUES (1, 'drgkls<[email protected]>, [email protected], @ dgh507-16-65@'),
        (2, '[email protected] [email protected] [email protected] u3483dhj@[email protected]'),
        (3, '[email protected] лдоврывплдоо isgfsi@ klsdfksdl@,dd.com')

DECLARE @pat VARCHAR(100) = '%[^a-z0-9@._ ]%';

WITH f AS (
         SELECT    row_id,
                 CAST(' ' + email + ' ' AS VARCHAR(102)) email,
                 SUBSTRING(email, PATINDEX(@pat, email), 1) bad,
                 PATINDEX(@pat, email) pat
         FROM    @t
         UNION ALL
         SELECT    row_id,
                 CAST(REPLACE(email, bad, ' ') AS VARCHAR(102)),
                 SUBSTRING(REPLACE(email, bad, ' '), PATINDEX(@pat, REPLACE(email, bad, ' ')), 1) bad,
                 PATINDEX(@pat, REPLACE(email, bad, ' '))
         FROM    f
         WHERE    PATINDEX(@pat, email) > 0
     ),
     s AS 
     (
         SELECT    row_id,
                 email, PATINDEX('%@%', email) pos 
         FROM    f 
         WHERE    pat = 0
                 AND    PATINDEX('%@%', email) > 0
         UNION ALL
         SELECT    row_id,
                 SUBSTRING(email, pos + 1, 102), 
                 PATINDEX('%@%', SUBSTRING(email, pos + 1, 102))
         FROM    s
         WHERE    PATINDEX('%@%', SUBSTRING(email, pos + 1, 102)) > 0
     )

SELECT  row_id, o1 + pp
FROM    s   
        CROSS APPLY (SELECT    REVERSE(LEFT(email, pos -1)) s1) x
        CROSS APPLY (SELECT    CHARINDEX(' ', s1) i1) y
        CROSS APPLY (SELECT    REVERSE(LEFT(s1, i1 -1)) o1 WHERE i1 > 0) z
        CROSS APPLY (SELECT    CHARINDEX(' ', email, pos) i2) e
        CROSS APPLY (SELECT    SUBSTRING(email, pos, i2 -pos) pp WHERE    i2 > pos + 1) q
WHERE    LEN(o1) > 1
        AND CHARINDEX('.', pp) > 0
        AND PATINDEX('%@%@%', pp) = 0
        AND PATINDEX('%@.%', pp) = 0
        AND PATINDEX('%.', pp) = 0

1 Comment

WHERE LEN(o1) > 1 is invalid (last WHERE); should be WHERE LEN(o1) > 0; otherwise email addresses with 1 character before @ sign will be omitted.
1

If you need it in a function then this works for me...

CREATE FUNCTION [dbo].[extractEmail]
(
    @input nvarchar(500)
)
RETURNS nvarchar(100)
AS
BEGIN
    DECLARE @atPosition int
    DECLARE @firstRelevantSpace int
    DECLARE @name nvarchar(100)
    DECLARE @secondRelelvantSpace int
    DECLARE @everythingAfterAt nvarchar(500)
    DECLARE @domain nvarchar(100)
    DECLARE @email nvarchar(100) = ''
    IF CHARINDEX('@', @input,0) > 0
    BEGIN
        SET @input = ' ' + @input
        SET @atPosition = CHARINDEX('@', @input, 0)
        SET @firstRelevantSpace = CHARINDEX(' ',REVERSE(LEFT(@input, CHARINDEX('@', @input, 0) - 1)))
        SET @name = REVERSE(LEFT(REVERSE(LEFT(@input, @atPosition - 1)),@firstRelevantSpace-1))
        SET @everythingAfterAt = SUBSTRING(@input, @atPosition,len(@input)-@atPosition+1)
        SET @secondRelelvantSpace = CHARINDEX(' ',@everythingAfterAt)
        IF @secondRelelvantSpace = 0
            SET @domain = @everythingAfterAt
        ELSE
            SET @domain = LEFT(@everythingAfterAt, @secondRelelvantSpace)
        SET @email = @name + @domain
    END
    RETURN @email
END

Comments

0

This one line would also work (a bit long line though lol):

--declare @a varchar(100) 
--set @a = 'a asfd saasd [email protected] wqe z zx cxzc '
select substring(substring(@a,0,charindex('@',@a)),len(substring(@a,0,charindex('@',@a)))-charindex(' ',reverse(substring(@a,0,charindex('@',@a))))+2,len(substring(@a,0,charindex('@',@a)))) + substring(substring(@a,charindex('@',@a),len(@a)),0,charindex(' ',substring(@a,charindex('@',@a),len(@a))))

Comments

0

For strings that contain new line characters I modified Felix's answer using PATINDEX to search for the first control character rather than white space.

I also had to modify the Right field to subtract the correct amount of text.

    WITH CteEmail(email) AS(
        SELECT 'example string with new lines

    Email: [email protected]
(first email address - should be returned)

    Email: [email protected]
(other email addresses should be ignored

more example text' UNION ALL
        SELECT 'Email: [email protected]' UNION ALL
        SELECT '[email protected]' UNION ALL
        SELECT 'some text [email protected] some text' UNION ALL
        SELECT 'no email'
    )
    ,CteStrings AS(
        SELECT
            [Left] = LEFT(email, CHARINDEX('@', email, 0) - 1),
            Reverse_Left = REVERSE(LEFT(email, CHARINDEX('@', email, 0) - 1)),
            [Right] = RIGHT(email, LEN(email) - CHARINDEX('@', email, 0) + 1 )
        FROM CteEmail
        WHERE email LIKE '%@%'
    )
    SELECT *,
        REVERSE(
            SUBSTRING(Reverse_Left, 0, 
                CASE
                    WHEN PATINDEX('%[' + CHAR(10)+'- ]%', Reverse_Left) = 0 THEN LEN(Reverse_Left) + 1
                    ELSE PATINDEX('%[' + CHAR(0)+'- ]%', Reverse_Left)
                END
            )
        )
        +
        SUBSTRING([Right], 0,
            CASE
                WHEN PATINDEX('%[' + CHAR(0)+'- ]%', [Right]) = 0 THEN LEN([Right]) + 1
                ELSE PATINDEX('%[' + CHAR(0)+'- ]%', [Right])
            END
        )
    FROM CteStrings

Comments

0

Using Cymorg's Function: I ran into an issue where my data included CR/LF and it prevented the Function from working 100%. It was tough to figure out because, when using the function in a select statement, it would return occasionally incorrect results. If I copied the offending text from my query results and invoked the function using sql print with the text in quotes it would work fine. Inconceivable!

After much trial and error, I used sql replace to replace the CR/LF with spaces and huzza! I am an excellent guesser.

select extractEmail(replace(replace(MyColumn,CHAR(10),' '),CHAR(13),' ')) as AsYouWish from FacilityContacts

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.