I am currently working on a script which processes csv files, and one of the things it does is remove and keep note of duplicate lines in the files. My current method to do this is to run uniq once using uniq -d once to display all duplicates, then run uniq again without any options to actually remove the duplicates. Having said that, I was wondering if it would be possible to perform this same function in one action instead of having to run uniq twice. I've found a bunch of different examples of using awk to remove duplicates out there, but as far as I know I have not been able to find any that both displayed the duplicates and removed them at the same time. If anyone could offer advice or help for this I would really appreciate it though, thanks!
-
1Stock answer to all text manipulation questions: yes, it is trivial in awk. Now - what is it you want to do? Post some small sample input, the expected output after running the desired tool on that input and an explanation of why that would be the output.Ed Morton– Ed Morton2012-11-29 19:26:19 +00:00Commented Nov 29, 2012 at 19:26
3 Answers
In general, the size of you input shall be your guide. If you're processing GBs of data, you often have no choice other than relying on sort and uniq, because these tools support external operations.
That said, here's the AWK way:
If you input is sorted, you can keep track of duplicate items in AWK easily by comparing line
ito linei-1with O(1) state: ifi == i-1you have a duplicate.If your input is not sorted, you have to keep track of all lines, requiring O(c) state, where c is the number of unique lines. You can use a hash table in AWK for this purpose.
Comments
This solution does not use awk but it does produce the result you need. In the command below replace sortedfile.txt with your csv file.
cat sortedfile.txt | tee >(uniq -d > duplicates_only.txt) | uniq > unique.txt
tee sends the output of the cat command to uniq -d.