awk: how to remove duplicated lines in a file and output them in another file at the same time?

Question

I am currently working on a script which processes csv files, and one of the things it does is remove and keep note of duplicate lines in the files. My current method to do this is to run uniq once using uniq -d once to display all duplicates, then run uniq again without any options to actually remove the duplicates. Having said that, I was wondering if it would be possible to perform this same function in one action instead of having to run uniq twice. I've found a bunch of different examples of using awk to remove duplicates out there, but as far as I know I have not been able to find any that both displayed the duplicates and removed them at the same time. If anyone could offer advice or help for this I would really appreciate it though, thanks!

Stock answer to all text manipulation questions: yes, it is trivial in awk. Now - what is it you want to do? Post some small sample input, the expected output after running the desired tool on that input and an explanation of why that would be the output. — Ed Morton
– Ed Morton, Commented Nov 29, 2012 at 19:26

Ed Morton · Accepted Answer · 2012-11-29 19:30:45Z

6

Here's something to get you started:

awk 'seen[$0]++{print|"cat>&2";next}1' file > tmp && mv tmp file

The above will print any duplicated lines to stderr at the same time as removing them from your input file. If you need more, tell us more....

answered Nov 29, 2012 at 19:30

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mavam · Accepted Answer · 2012-11-29 19:27:28Z

1

In general, the size of you input shall be your guide. If you're processing GBs of data, you often have no choice other than relying on sort and uniq, because these tools support external operations.

That said, here's the AWK way:

If you input is sorted, you can keep track of duplicate items in AWK easily by comparing line i to line i-1 with O(1) state: if i == i-1 you have a duplicate.
If your input is not sorted, you have to keep track of all lines, requiring O(c) state, where c is the number of unique lines. You can use a hash table in AWK for this purpose.

answered Nov 29, 2012 at 19:27

mavam

12.6k10 gold badges57 silver badges88 bronze badges

Comments

dinesh · Accepted Answer · 2012-11-29 19:50:34Z

0

This solution does not use awk but it does produce the result you need. In the command below replace sortedfile.txt with your csv file.

cat sortedfile.txt | tee >(uniq -d > duplicates_only.txt) | uniq > unique.txt

tee sends the output of the cat command to uniq -d.

answered Nov 29, 2012 at 19:50

dinesh

8052 gold badges11 silver badges22 bronze badges

1 Comment

tripleee Over a year ago

Lose the Useless Use of Cat, though.

Collectives™ on Stack Overflow

awk: how to remove duplicated lines in a file and output them in another file at the same time?

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related