csv remove all rows with duplicate values in one column

Question

example1.csv

id1, value1
id2, value2
id3, value3
id1, value4

example2.csv

"06e04,0428","405872,8637110"
"06e04,0428","405872,8637111"
"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"

Need to remove rows with duplicate values in column1, with output as below

required output

example1_out.csv

id2, value2
id3, value3

example2_out.csv

"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"

there are solutions to remove duplicate records which retain one of the duplicate records as in this SO question. However in this case all the rows with duplicate values for column1 need to excluded from the output.

@user3206440 I am sorry The selected solution with the use of awk is shorter. — Grzesiek
– Grzesiek, Commented Mar 3, 2017 at 18:08

anubhava · Accepted Answer · 2017-03-05 17:25:01Z

1

This awk can do that in a single command:

awk -F, '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv

id2, value2
id3, value3

For your edited question use:

awk -F'","' '{arr[$1]=$0} seen[$1]++{delete arr[$1]} END{for (i in arr) print arr[i]}' file.csv

"06e04,0429","405872,8637110"
"06e04,0430","405872,8637110"
"06e04,0431","405872,8637111"

edited Mar 5, 2017 at 17:25

answered Mar 3, 2017 at 17:51

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Bertrand Martel Over a year ago

Note that this does not preserve line order id3, value3 id2, value2

user3206440 Over a year ago

this doesn't work well when column 1 has values with , - see updated question.

anubhava Over a year ago

I've modified my answer based on your edited question. Check updated answer which is giving expected output.

choroba · Accepted Answer · 2017-03-03 17:41:16Z

1

cut -f1 -d, somecsv.csv | sort | uniq -u | grep -Ff- somecsv.csv

The first command extracts the first column from the input. The second command sorts the ids, so the next command can only list the unique ones. The final grep takes the unique ids and searches for them in the input file.

answered Mar 3, 2017 at 17:41

choroba

245k27 gold badges221 silver badges304 bronze badges

2 Comments

clt60 Over a year ago

Hm... this will also match the lines, where the text in col1 is contained in the other colums... e.g. id1, valid2

choroba Over a year ago

So change grep -Ff- to sed 's/^/^/' | grep -f-. Again, it can break if the first column contains special characters...

ghoti · Accepted Answer · 2017-03-03 18:15:00Z

1

Here's a shorter awk option.

awk -F, 'NR==FNR{a[$1]++;next} a[$1]<2' file.csv file.csv

This reads the file twice -- once to populate an array of counters of the first field, and the second time to print lines whose count is less than 2.

If you'd prefer to do this in pure shell rather than awk, and your shell is bash, you could get similar functionality with something like this:

$ declare -A a=()
$ while IFS=, read f _; do ((a[$f]++)); done < file.csv
$ declare -p a
declare -A a=([id1]="2" [id3]="1" [id2]="1" )
$ while IFS=, read f1 f2; do [ "${a[$f1]}" -lt 2 ] && printf '%s,%s\n' "$f1" "$f2"; done < file.csv
id2, value2
id3, value3

Again, this is two steps -- the first to populate an array of counters, the second to step through the file and print appropriate lines.

edited Mar 3, 2017 at 18:15

answered Mar 3, 2017 at 18:01

ghoti

47.2k8 gold badges71 silver badges108 bronze badges

Comments

Bertrand Martel · Accepted Answer · 2017-03-03 17:44:22Z

0

You can use awk :

awk -F  "," '{
    if (length(arr[$1]) == 0){
        arr[$1]=$0
        order[i++]=$1
    }
    else{
        delete arr[$1]
    }
}
 END { 
    for (i = 1; i < length(order); i++) { 
        print arr[order[i]]
    } 
}' somecsv.csv

It stores all entries in an array and remove the item if it's found twice. The order is preserved with an additional order array

answered Mar 3, 2017 at 17:44

Bertrand Martel

46k17 gold badges156 silver badges170 bronze badges

Collectives™ on Stack Overflow

csv remove all rows with duplicate values in one column

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related