1

I have data from a flat file (client sent to me, can't edit), that has some duplicate email addresses that I would like to set to null. Our software requires a unique email address, so when it encounters a duplicate, it fails. Our developers are working to correct this, but in the meantime, I want to set the duplicate emails to null. Here is an example:

Client ID |  Client Name    | Email address
 1234     |   Mike Smith    |  [email protected]
 5678     |   Mike's Motors |  [email protected]

So in the above example, I would want both rows to go into the DB, but I want to set the email address to null on one of them, but not both of them.

1
  • to confirm, multiple emails is bad and multiple nulls is OK? Commented Aug 11, 2020 at 17:33

3 Answers 3

1

You can use row_number function to figure out duplicates and null them

here is one way to do it

;

WITH mycte
AS (
    SELECT 1234 ClientID
        ,'Mike Smith' ClientName
        ,'[email protected]' Emailaddress
    
    UNION ALL
    
    SELECT 5678
        ,'Mikes Motors'
        ,'[email protected]'
    )
SELECT ClientID
    ,ClientName
    ,CASE 
        WHEN ROW_NUMBER() OVER (PARTITION BY Emailaddress ORDER BY Emailaddress) > 1
            THEN NULL
        ELSE Emailaddress
        END AS Emailaddress
FROM mycte
Sign up to request clarification or add additional context in comments.

2 Comments

this assumes data is loaded to a staging table
@KeithL SSIS is tagged. no reason why this cannot be done! Even without SSIS you can do this in one query without any staging tables
0

There is no native component in an SSIS data flow that can accomplish this. The problem being, the data flow engine is an amazingly fast processor of data but it generally only knows about this row. Not the one before it, not the row after - just current row (and it has many minions running at once that only know of their row).

The Aggregate operator and Cached lookup might be able to be hacked to do this but you're going to have to double process the file. The priming data flow will be source -> Aggregate component -> Cache Destination. You group by the email address and then min or max the client id in the aggregate component. And as I type that, a niggling part of my brain says there's a silly limit with the aggregate and string fields. Maybe it's just that you can't min/max them but grouping is allowed. I am assuming that ClientID and email address are unique. If ClientId 123 has both [email protected] and [email protected], this approach will work but you'll need a better mechanism for determining data survivorship.

So priming data flow is run and you have a cache filled with unique email addresses and the client ID you will want to retain the email address for.

In the existing data flow, we're going to ignore the email address from the source. You can either unmap it so it never enters the row buffers, preferable, or remember that we want the email address from the lookup. Add a Lookup transformation between the source and destination. Configure it use a Cache Connection Manager and use the CCM we just created/filled in the priming step. Indicate that in the event of no match, ignore the failure. Map the Client ID in the data flow buffer to the client ID column in the CCM. Check the EmailAddress from the CCM so it will be available in the data flow buffers. Assume we call it EmailAddress_LKP

In your destination, map the EmailAddress column to the value generated from the lookup, EmailAddress_LKP

The other approach would be to write an Asynchronous Script Component (async is the only way you can access more than current buffer but at the price of memory and speed). There you'd likely build a map of seen email addresses and in the event you have a match, specify that the output buffer's column's IsNull property is true

Comments

0

So, I found a "low-tech" solution. I used a multicast then a sort. I then sorted by the email field and set it to delete duplicate records. I unchecked all columns in the sort's passthrough except for the email field and the join key. I then re-joined it to the dataflow with a left join, taking all fields except for the email field on the left side of the join and only the email field on the right.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.