1

I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.

Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.

I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of 4574766572 bytes.

I was told that a file that large is not possible and to try again.

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt results in a file with a size of 1624577643 bytes. Significantly smaller.

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt results in a file with a size of 1416298458 bytes.

I'm beginning to think I don't know what these commands do since the file sizes should be the same.

Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.

6
  • 1
    well, the last two should be different because the first contains the number of same lines, the other don't. Commented Aug 30, 2016 at 7:53
  • @MladenJablanović I'm not sure I'm following. They all sort or manipulate the original file - _uniq_combined.txt, no? Commented Aug 30, 2016 at 8:15
  • Did you try to get the difference between both _SORTEDC_duplicates.txt and _UNIQ_duplicates.txt by doing a diff or cmp? If it tells something, track the line that differs in the original files. Commented Aug 30, 2016 at 9:21
  • @oliv I'm not familiar with diff, what would that tell me? Commented Aug 30, 2016 at 9:33
  • @dsp_099 It would output the lines that differs between both files... Commented Aug 30, 2016 at 9:38

4 Answers 4

2

sort is designed especially to cope with huge files too. You could do:

cat *.txt | sort >all_sorted 
uniq all_sorted >unique_sorted
sdiff -sld all_sorted unique_sorted | uniq >all_duplicates
Sign up to request clarification or add additional context in comments.

4 Comments

Not sure I understand that last line. Combine all, uniq the combined file, then find the difference between the two and pipe it out?
also the result of all_duplicates is zero, which is not possible
The -s means that lines common to the left and right argument, are removed. Since the right argument is a subset of the left argument in sdiff, it means that the left column contains only the duplicate lines. The -l means that we only get the left column printed. Since each duplicate line may occur several times, the final uniq is used to compact it. If your final output is empty, you could investigate, where this happens. For example, you could do a sdiff -d all_sorted unique_sorted|less -n first and see, where are the duplicates.
sort *.txt > all_sorted would do. And I think you could use comm -2 for the last line.
1

The sort command should work fine with a 12 GB file. And uniq will output just duplicated lines if you specify the -d or -D options. That is:

sort all_combined > all_sorted
uniq -d all_sorted > duplicates

or

uniq -D all_sorted > all_duplicates

The -d option displays one line for each duplicated element. So if "foo" occurs 12 times, it will display "foo" one time. -D prints all duplicates.

uniq --help will give you a bit more information.

Comments

0

Maybe if you split that big file into smaller files, sort --uniqueed them out and tried to merge them with sort --merge:

$ cat > test1
1
1
2
2
3
3
$ cat > test2
2
3
3
4
4
$ sort -m -u test1 test2
1
2
3
4

I would imagine merging sorted files would not happen in memory?

1 Comment

the massive file must consist of all the smaller ones because that's how the dupes will make themselves known. list 1, 2, 3, 99, 132 might not have any dupes but list 1 and list 920 might. So hence the merge. My question is why do those commands produce vastly different results?
0

I think your awk script is incorrect and your uniq -c-command includes the counts of occurrences of duplicates and sort _uniq_combined.txt | uniq -d is the correct thing :) .

Note that you could have directly sort *.txt > sorted_hashes or sort *.txt -o sorted_hashes.

If you have just two files at hand consider using comm (info coreutils to the rescue), which can give you columned-output of "lines just in first file", "lines just in second file", "lines in boths files". If you need just some of these columns you can suppress the others with options to comm. Or use the generated output as a base and continue working on it using cut, like cut -f 1 my_three_colum_file to get the first column.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.