0

Hi I have a csv file whose content is like

NAME,AGE
abc,12
def,13
NAME,AGE  ##here duplicates :though these are column names
sdd,34
krgj,656

I tried a sort command to do that as:

sort -u file.csv -o file.csv

but all the duplicate rows got dropped(kept the last one ), but i need to keep the first one , so that I can have my column/header safe.

Please help in this regards.

3 Answers 3

1

The idiomatic awk program for this task is:

awk '!seen[$0]++' file

For each line ($0) in the file, we increment the number of times we've seen that line. Since we're using the post-increment operator, the first time a line is encountered, the value of seen[$0]++ is zero. For all other instances of that line, the value is non-zero. So we negate the value to get a true value for the first time seen. The default action is to print the line.

Sign up to request clarification or add additional context in comments.

3 Comments

Any way to redirect those seen values to a FIle? That will help me a lot.
Just use regular redirection like you've seen a million times awk ... file > otherfile
Almost same in perl: perl -ne 'print if not $seen{$_}++' file. See -n.
0

This isn't the most elegant solution but it works.

head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv

It works by writing > the first line to output.csv then removing all the first lines using grep -v and appending >> the result to output.csv

Example:

root@merlin:/tmp# cat source.csv 
NAME,AGE
abc,12
def,13
NAME,AGE
sdd,34
krgj,656
root@merlin:/tmp# head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv >> output.csv
root@merlin:/tmp# cat output.csv 
NAME,AGE
abc,12
def,13
sdd,34
krgj,656

If you need to dedup it as well:

head -n1 source.csv > output.csv; grep -v "$(head -n1 source.csv)" source.csv |sort -u >> output.csv

Comments

0

Using datamash's non-sorting deduplication line filter "rmdup", (requires datamash v1.0.7 or newer):

datamash rmdup 1 < source.csv

Output:

NAME,AGE
abc,12
def,13
sdd,34
krgj,656

2 Comments

@agc-I installed datamash by rpm...and then executed the command as per your answer, but got an error "datamash: invalid operation 'rmdup'". the command was >>> datamash rmdup 1 < test.csv
@LoneRanger, rmdup requires datamash v1.0.7 (from 2015) or newer. Online I only see v1.0.6 rpms. Perhaps I've missed it. Then there's the option of installing the hard way, using the upstream source for v1.1.0.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.