2

I am hoping for a line or two of code for a bash script to find and print repeated items in a column in 2.5G csv file except for an item that I know is commonly repeated.

The data file has a header, but it is not duplicated, so I'm not worried about code that accounts for the header being present.

Here is an illustration of what the data look like:

header,cat,Everquest,mermaid
1f,2r,7g,8c
xc,7f,66,rp
Kf,87,gH,||
hy,7f,&&,--
rr,2r,89,))
v6,2r,^&,!c
92,@r,hd,m
2r,2r,2r,2r
7f,7f,7f,7f
9,10,11,12
7f,2r,7f,7f
76,@r,88,u|

I am seeking the output:

7f
@r

as both of these are duplicated in column two. As you can see, 2r is also duplicated, but it is commonly duplicated and I know it, so I just want to ignore it.

To be clear, I can't know the values of the duplicates other than the common one, which, in my real data files, is actually the word 'none'. It's '2r' above.

I read here that I can do something like

awk -F, ' ++A[$2] > 1 { print $2; exit 1 } ' input.file

However, I cannot figure out how to skip '2r' nor what ++A means.

I have read the awk manual, but I am afraid I find it a little confusing with respect to the question I am asking.

Additionally,

uniq -d 

looks promising based on a few other questions and answers, but I am still unsure how to skip over the value that I want to ignore.

Thank you in advance for you help.

1
  • Yes, two or more is what I mean by duplicate+. I will edit above. Commented May 25, 2018 at 22:03

2 Answers 2

4

how to skip '2r':

$ awk -F, ' ++a[$2] == 2 && $2 != "2r" { print $2 } ' file
7f
@r

++a[$2] adds an element to a hash array and increases its value by 1, ie counts how many occurrences of each value in the second column exist.

Sign up to request clarification or add additional context in comments.

2 Comments

This works great, and I truly appreciate your easy-to-understand explanation of ++a!
deleted my own comment here -
1
  1. Get only the second column using cut -d, -f2
  2. sort
  3. uniq -d to get repeated lines
  4. grep -Fv 2r to exclude a value, or grep -Fv -e foo -e bar … to exclude multiple values

In other words something like this:

cut -d, -f2 input.csv | sort | uniq -d | grep -Fv 2r

Depending on the data it might be faster if you move grep earlier in the pipeline, but you should verify that with some benchmarking.

1 Comment

So something like infile=cut -d, -f2; infile=sort $infile; infile=uniq -d $infile; grep -v 2r $infile ? Please excuse my newness to this syntax.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.