SSIS - Set duplicate columns to null

Question

I have data from a flat file (client sent to me, can't edit), that has some duplicate email addresses that I would like to set to null. Our software requires a unique email address, so when it encounters a duplicate, it fails. Our developers are working to correct this, but in the meantime, I want to set the duplicate emails to null. Here is an example:

Client ID |  Client Name    | Email address
 1234     |   Mike Smith    |  [email protected]
 5678     |   Mike's Motors |  [email protected]

So in the above example, I would want both rows to go into the DB, but I want to set the email address to null on one of them, but not both of them.

to confirm, multiple emails is bad and multiple nulls is OK? — KeithL
– KeithL, Commented Aug 11, 2020 at 17:33

Eric Brandt · Accepted Answer · 2020-08-11 13:17:30Z

1

You can use row_number function to figure out duplicates and null them

here is one way to do it

;

WITH mycte
AS (
    SELECT 1234 ClientID
        ,'Mike Smith' ClientName
        ,'[email protected]' Emailaddress
    
    UNION ALL
    
    SELECT 5678
        ,'Mikes Motors'
        ,'[email protected]'
    )
SELECT ClientID
    ,ClientName
    ,CASE 
        WHEN ROW_NUMBER() OVER (PARTITION BY Emailaddress ORDER BY Emailaddress) > 1
            THEN NULL
        ELSE Emailaddress
        END AS Emailaddress
FROM mycte

edited Aug 11, 2020 at 13:17

Eric Brandt

8,1313 gold badges20 silver badges39 bronze badges

answered Aug 10, 2020 at 21:02

Harry

2,9691 gold badge23 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

KeithL Over a year ago

this assumes data is loaded to a staging table

Harry Over a year ago

@KeithL SSIS is tagged. no reason why this cannot be done! Even without SSIS you can do this in one query without any staging tables

billinkc · Accepted Answer · 2020-08-12 16:36:51Z

There is no native component in an SSIS data flow that can accomplish this. The problem being, the data flow engine is an amazingly fast processor of data but it generally only knows about this row. Not the one before it, not the row after - just current row (and it has many minions running at once that only know of their row).

The Aggregate operator and Cached lookup might be able to be hacked to do this but you're going to have to double process the file. The priming data flow will be source -> Aggregate component -> Cache Destination. You group by the email address and then min or max the client id in the aggregate component. And as I type that, a niggling part of my brain says there's a silly limit with the aggregate and string fields. Maybe it's just that you can't min/max them but grouping is allowed. I am assuming that ClientID and email address are unique. If ClientId 123 has both [email protected] and [email protected], this approach will work but you'll need a better mechanism for determining data survivorship.

So priming data flow is run and you have a cache filled with unique email addresses and the client ID you will want to retain the email address for.

In the existing data flow, we're going to ignore the email address from the source. You can either unmap it so it never enters the row buffers, preferable, or remember that we want the email address from the lookup. Add a Lookup transformation between the source and destination. Configure it use a Cache Connection Manager and use the CCM we just created/filled in the priming step. Indicate that in the event of no match, ignore the failure. Map the Client ID in the data flow buffer to the client ID column in the CCM. Check the EmailAddress from the CCM so it will be available in the data flow buffers. Assume we call it EmailAddress_LKP

In your destination, map the EmailAddress column to the value generated from the lookup, EmailAddress_LKP

The other approach would be to write an Asynchronous Script Component (async is the only way you can access more than current buffer but at the price of memory and speed). There you'd likely build a map of seen email addresses and in the event you have a match, specify that the output buffer's column's IsNull property is true

goodeyebrian · Accepted Answer · 2020-09-01 21:56:08Z

0

So, I found a "low-tech" solution. I used a multicast then a sort. I then sorted by the email field and set it to delete duplicate records. I unchecked all columns in the sort's passthrough except for the email field and the join key. I then re-joined it to the dataflow with a left join, taking all fields except for the email field on the left side of the join and only the email field on the right.

answered Sep 1, 2020 at 21:56

goodeyebrian

111 bronze badge

Collectives™ on Stack Overflow

SSIS - Set duplicate columns to null

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related