Unix: Find duplicate occurrences in column in csv file, omit one possible value

Question

I am hoping for a line or two of code for a bash script to find and print repeated items in a column in 2.5G csv file except for an item that I know is commonly repeated.

The data file has a header, but it is not duplicated, so I'm not worried about code that accounts for the header being present.

Here is an illustration of what the data look like:

header,cat,Everquest,mermaid
1f,2r,7g,8c
xc,7f,66,rp
Kf,87,gH,||
hy,7f,&&,--
rr,2r,89,))
v6,2r,^&,!c
92,@r,hd,m
2r,2r,2r,2r
7f,7f,7f,7f
9,10,11,12
7f,2r,7f,7f
76,@r,88,u|

I am seeking the output:

7f
@r

as both of these are duplicated in column two. As you can see, 2r is also duplicated, but it is commonly duplicated and I know it, so I just want to ignore it.

To be clear, I can't know the values of the duplicates other than the common one, which, in my real data files, is actually the word 'none'. It's '2r' above.

I read here that I can do something like

awk -F, ' ++A[$2] > 1 { print $2; exit 1 } ' input.file

However, I cannot figure out how to skip '2r' nor what ++A means.

I have read the awk manual, but I am afraid I find it a little confusing with respect to the question I am asking.

Additionally,

uniq -d

looks promising based on a few other questions and answers, but I am still unsure how to skip over the value that I want to ignore.

Thank you in advance for you help.

Yes, two or more is what I mean by duplicate+. I will edit above. — 17th Lvl Botanist
– 17th Lvl Botanist, Commented May 25, 2018 at 22:03

James Brown · Accepted Answer · 2018-05-25 22:08:58Z

4

how to skip '2r':

$ awk -F, ' ++a[$2] == 2 && $2 != "2r" { print $2 } ' file
7f
@r

++a[$2] adds an element to a hash array and increases its value by 1, ie counts how many occurrences of each value in the second column exist.

edited May 25, 2018 at 22:08

answered May 25, 2018 at 22:03

James Brown

37.7k8 gold badges52 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

17th Lvl Botanist Over a year ago

This works great, and I truly appreciate your easy-to-understand explanation of ++a!

17th Lvl Botanist Over a year ago

deleted my own comment here -

l0b0 · Accepted Answer · 2018-05-25 22:06:50Z

1

Get only the second column using cut -d, -f2
sort
uniq -d to get repeated lines
grep -Fv 2r to exclude a value, or grep -Fv -e foo -e bar … to exclude multiple values

In other words something like this:

cut -d, -f2 input.csv | sort | uniq -d | grep -Fv 2r

Depending on the data it might be faster if you move grep earlier in the pipeline, but you should verify that with some benchmarking.

edited May 25, 2018 at 22:06

answered May 25, 2018 at 21:58

l0b0

59.6k32 gold badges155 silver badges247 bronze badges

1 Comment

17th Lvl Botanist Over a year ago

So something like infile=cut -d, -f2; infile=sort $infile; infile=uniq -d $infile; grep -v 2r $infile ? Please excuse my newness to this syntax.

Collectives™ on Stack Overflow

Unix: Find duplicate occurrences in column in csv file, omit one possible value

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related