3

I have following CSV file:

1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en

I want to remove the duplicates based on the value of the first column. If there are more than one record of the same value I want to only keep one in the new file:

I started with following which actually finds the duplicates but I want to create a new file instead of just printing:

sort input.csv | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","

2 Answers 2

7

Nut 100% sure what you like, but this will only get the last input if there are equals:

awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}' file > newfile
cat newfile
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
1393036,293296,68,59,Mithridates,ny,io
353,291765,434,434,Lar,ny,en
10155431,14595886,1807,135860,Riemogerz,ny,id
19332,7401441,296,352647,WikiDreamer,ny,fr

If its not important what record to keep, as long as field 1 is unique.
This will show the first hit if there are several equal:

awk -F, '!a[$1]++' file > newfile
cat newfile
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh

To get the duplicated into a new file:

awk -F, '++a[$1]==2 {print $1}' file > newfile
cat newfile
1536088
353
Sign up to request clarification or add additional context in comments.

3 Comments

I had a header on the csv file but after I create the new file based on you first block of the code I lost the header. How do i keep the header?
@Null-Hypothesis Add this in start of code: NR==1 {print;next} eks: awk -F, 'NR==1 {print;next} !a[$1]++' file > newfile This way line 1 always goes to newfile
This will use quite a bit of memory on a large file as well as having to traverse the file as well as the array. If you print the first time you see a line, you don't have to save it.
3

This will show only the first entry for a given first column value:

awk -F, '!(seen[$1]++)' file > newfile

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.