drop duplicates and keep first in a csv file in unix

Question

Hi I have a csv file whose content is like

NAME,AGE
abc,12
def,13
NAME,AGE  ##here duplicates :though these are column names
sdd,34
krgj,656

I tried a sort command to do that as:

sort -u file.csv -o file.csv

but all the duplicate rows got dropped(kept the last one ), but i need to keep the first one , so that I can have my column/header safe.

Please help in this regards.

glenn jackman · Accepted Answer · 2016-11-22 04:04:24Z

1

The idiomatic awk program for this task is:

awk '!seen[$0]++' file

For each line ($0) in the file, we increment the number of times we've seen that line. Since we're using the post-increment operator, the first time a line is encountered, the value of seen[$0]++ is zero. For all other instances of that line, the value is non-zero. So we negate the value to get a true value for the first time seen. The default action is to print the line.

answered Nov 22, 2016 at 4:04

glenn jackman

249k42 gold badges233 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user7079832 Over a year ago

Any way to redirect those seen values to a FIle? That will help me a lot.

glenn jackman Over a year ago

Just use regular redirection like you've seen a million times awk ... file > otherfile

Hans Ginzel Over a year ago

Almost same in perl: perl -ne 'print if not $seen{$_}++' file. See -n.

Tim · Accepted Answer · 2016-11-20 19:36:10Z

This isn't the most elegant solution but it works.

head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv

It works by writing > the first line to output.csv then removing all the first lines using grep -v and appending >> the result to output.csv

Example:

root@merlin:/tmp# cat source.csv 
NAME,AGE
abc,12
def,13
NAME,AGE
sdd,34
krgj,656
root@merlin:/tmp# head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv
root@merlin:/tmp# cat output.csv 
NAME,AGE
abc,12
def,13
sdd,34
krgj,656

If you need to dedup it as well:

head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv |sort -u >> output.csv

agc · Accepted Answer · 2016-11-21 07:24:12Z

0

Using datamash's non-sorting deduplication line filter "rmdup", (requires datamash v1.0.7 or newer):

datamash rmdup 1 < source.csv

Output:

NAME,AGE
abc,12
def,13
sdd,34
krgj,656

edited Nov 21, 2016 at 7:24

answered Nov 20, 2016 at 20:06

agc

8,5342 gold badges33 silver badges53 bronze badges

2 Comments

user7079832 Over a year ago

@agc-I installed datamash by rpm...and then executed the command as per your answer, but got an error "datamash: invalid operation 'rmdup'". the command was >>> datamash rmdup 1 < test.csv

agc Over a year ago

@LoneRanger, rmdup requires datamash v1.0.7 (from 2015) or newer. Online I only see v1.0.6 rpms. Perhaps I've missed it. Then there's the option of installing the hard way, using the upstream source for v1.1.0.

Collectives™ on Stack Overflow

drop duplicates and keep first in a csv file in unix

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related