1

I am currently working on a script which processes csv files, and one of the things it does is remove and keep note of duplicate lines in the files. My current method to do this is to run uniq once using uniq -d once to display all duplicates, then run uniq again without any options to actually remove the duplicates. Having said that, I was wondering if it would be possible to perform this same function in one action instead of having to run uniq twice. I've found a bunch of different examples of using awk to remove duplicates out there, but as far as I know I have not been able to find any that both displayed the duplicates and removed them at the same time. If anyone could offer advice or help for this I would really appreciate it though, thanks!

1
  • 1
    Stock answer to all text manipulation questions: yes, it is trivial in awk. Now - what is it you want to do? Post some small sample input, the expected output after running the desired tool on that input and an explanation of why that would be the output. Commented Nov 29, 2012 at 19:26

3 Answers 3

6

Here's something to get you started:

awk 'seen[$0]++{print|"cat>&2";next}1' file > tmp && mv tmp file

The above will print any duplicated lines to stderr at the same time as removing them from your input file. If you need more, tell us more....

Sign up to request clarification or add additional context in comments.

Comments

1

In general, the size of you input shall be your guide. If you're processing GBs of data, you often have no choice other than relying on sort and uniq, because these tools support external operations.

That said, here's the AWK way:

  • If you input is sorted, you can keep track of duplicate items in AWK easily by comparing line i to line i-1 with O(1) state: if i == i-1 you have a duplicate.

  • If your input is not sorted, you have to keep track of all lines, requiring O(c) state, where c is the number of unique lines. You can use a hash table in AWK for this purpose.

Comments

0

This solution does not use awk but it does produce the result you need. In the command below replace sortedfile.txt with your csv file.

cat sortedfile.txt | tee >(uniq -d > duplicates_only.txt) | uniq > unique.txt

tee sends the output of the cat command to uniq -d.

1 Comment

Lose the Useless Use of Cat, though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.