Powershell Help: How can I remove duplicates (using multiple columns simultaneously, not sequentially)?

Question

I have tried several different variations based on some other stack overflow articles, but I will share a sample of what I have and a sample output and then some cobbled-together code hoping for some direction from the community:

C:\Scripts\contacts.csv:

id,first_name,last_name,email
1,john,smith,[email protected]
1,jane,smith,[email protected]
2,jane,smith,[email protected]
2,john,smith,[email protected]
3,sam,jones,[email protected]
3,sandy,jones,[email protected]

Need to turn this into a file where column "email" is unique to column "id". In other words there can be duplicate addresses, but only if there is a different id.

desired output C:\Scripts\contacts-trimmed.csv:

id,first_name,last_name,email
1,john,smith,[email protected]
2,john,smith,[email protected]
3,sam,jones,[email protected]
3,sandy,jones,[email protected]

I have tried this with a few different variations:

Import-Csv C:\Scripts\contacts.csv | sort first_name | Sort-Object -Property id,email -Unique | Export-Csv C:\Scripts\contacts-trim.csv -NoTypeInformation

Any help or direction would be most appreciated

What are the rules for discarding duplicates? I. e. why isn't the 2nd row of desired output 2,jane,smith,[email protected]? — zett42
– zett42, Commented Feb 8, 2021 at 20:35
the email address is the same even though the name is different. Basically, there can be multiple id's and multiple emails, but no duplicates of email for each id. So the group of id and email must be unique. — Andy
– Andy, Commented Feb 8, 2021 at 21:43
When going through the records one by one, I understand that you keep first record and discard second, because same ID and same email. Taking 3rd record, there is a new ID, so shouldn't 3rd record be kept and 4th one discarded? — zett42
– zett42, Commented Feb 8, 2021 at 21:49
Without getting too in depth... There can only be one user with an email address and the id is a student id. Many of our parents have multiple students and we can only have one parent with an email, but in many situations the parents both use the same email. We have to eliminate one or the other, but can't keep both so I have to sort by first name so that when it eliminates duplicates; it completely eliminates one of the parents and keeps the other if they are assigned to multiple students. I hope this makes sense. — Andy
– Andy, Commented Feb 9, 2021 at 13:05

Mathias R. Jessen · Accepted Answer · 2021-02-08 19:21:50Z

1

You'll want to use the Group-Object cmdlet, to, well, group together records with similar values:

$records = @'
id,first_name,last_name,email
1,john,smith,[email protected]
1,jane,smith,[email protected]
2,jane,smith,[email protected]
2,john,smith,[email protected]
3,sam,jones,[email protected]
3,sandy,jones,[email protected]
'@ |ConvertFrom-Csv

# group records based on id and email column
$records |Group-Object id,email |ForEach-Object {
  # grab only the first record from each group
  $_.Group |Select-Object -First 1
} |Export-Csv .\no_duplicates.csv -NoTypeInformation

answered Feb 8, 2021 at 19:21

Mathias R. Jessen

178k13 gold badges175 silver badges234 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

zett42 Over a year ago

Btw, this produces the same output as $records | Sort-Object id, email -Unique. It doesn't match OPs "desired" output though...

Andy Over a year ago

unless I am missing something the above answer seemed to work. I am checking through the output now.

zett42 Over a year ago

It produces 2,jane... as the 2nd line, while in OPs "desired" output it is 2,john.... I think your output is correct though and OPs "desired" output isn't (unless I'm missing something ;-)).

Mathias R. Jessen Over a year ago

@Andy if you need to explicitly sort the individual groups, you can always change the inner pipeline to $_.Group |Sort-Object Name -Descending |Select-Object -First1 for example

Andy Over a year ago

@MathiasR.Jessen thank you for your help. your code pointed me in the right direction. I'm new to powershell and was stumped :)

|

Collectives™ on Stack Overflow

Powershell Help: How can I remove duplicates (using multiple columns simultaneously, not sequentially)?

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related