I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.
Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.
I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni
awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of 4574766572 bytes.
I was told that a file that large is not possible and to try again.
sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt results in a file with a size of 1624577643 bytes. Significantly smaller.
sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt results in a file with a size of 1416298458 bytes.
I'm beginning to think I don't know what these commands do since the file sizes should be the same.
Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.
_uniq_combined.txt, no?_SORTEDC_duplicates.txtand_UNIQ_duplicates.txtby doing adifforcmp? If it tells something, track the line that differs in the original files.