awk: Remove duplicates and create a new csv file

Question

I have following CSV file:

1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en

I want to remove the duplicates based on the value of the first column. If there are more than one record of the same value I want to only keep one in the new file:

I started with following which actually finds the duplicates but I want to create a new file instead of just printing:

sort input.csv | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","

Jotne · Accepted Answer · 2014-11-05 06:50:27Z

7

Nut 100% sure what you like, but this will only get the last input if there are equals:

awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}' file > newfile
cat newfile
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
1393036,293296,68,59,Mithridates,ny,io
353,291765,434,434,Lar,ny,en
10155431,14595886,1807,135860,Riemogerz,ny,id
19332,7401441,296,352647,WikiDreamer,ny,fr

If its not important what record to keep, as long as field 1 is unique.
This will show the first hit if there are several equal:

awk -F, '!a[$1]++' file > newfile
cat newfile
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh

To get the duplicated into a new file:

awk -F, '++a[$1]==2 {print $1}' file > newfile
cat newfile
1536088
353

edited Nov 5, 2014 at 6:50

answered Nov 5, 2014 at 6:32

Jotne

41.7k13 gold badges54 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

add-semi-colons Over a year ago

I had a header on the csv file but after I create the new file based on you first block of the code I lost the header. How do i keep the header?

Jotne Over a year ago

@Null-Hypothesis Add this in start of code: NR==1 {print;next} eks: awk -F, 'NR==1 {print;next} !a[$1]++' file > newfile This way line 1 always goes to newfile

chthonicdaemon Over a year ago

This will use quite a bit of memory on a large file as well as having to traverse the file as well as the array. If you print the first time you see a line, you don't have to save it.

chthonicdaemon · Accepted Answer · 2014-11-05 08:09:09Z

3

This will show only the first entry for a given first column value:

awk -F, '!(seen[$1]++)' file > newfile

answered Nov 5, 2014 at 8:09

chthonicdaemon

19.9k2 gold badges55 silver badges70 bronze badges

Collectives™ on Stack Overflow

awk: Remove duplicates and create a new csv file

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related