How to sort out duplicates from a massive list using sort, uniq or awk?

Question

I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues.

Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates.

I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni

awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of 4574766572 bytes.

I was told that a file that large is not possible and to try again.

sort _uniq_combined.txt | uniq -c | grep -v '^ *1 ' > _SORTEDC_duplicates.txt results in a file with a size of 1624577643 bytes. Significantly smaller.

sort _uniq_combined.txt | uniq -d > _UNIQ_duplicates.txt results in a file with a size of 1416298458 bytes.

I'm beginning to think I don't know what these commands do since the file sizes should be the same.

Again, the goal is to look through a giant list and save instances of hashes seen more than once. Which (if any) of these results is correct? I thought they all do the same thing.

well, the last two should be different because the first contains the number of same lines, the other don't. — Mladen Jablanović
– Mladen Jablanović, Commented Aug 30, 2016 at 7:53
@MladenJablanović I'm not sure I'm following. They all sort or manipulate the original file - _uniq_combined.txt, no? — dsp_099
– dsp_099, Commented Aug 30, 2016 at 8:15
Did you try to get the difference between both _SORTEDC_duplicates.txt and _UNIQ_duplicates.txt by doing a diff or cmp? If it tells something, track the line that differs in the original files. — oliv
– oliv, Commented Aug 30, 2016 at 9:21
@dsp_099 It would output the lines that differs between both files... — oliv
– oliv, Commented Aug 30, 2016 at 9:38

user1934428 · Accepted Answer · 2016-08-30 08:16:47Z

2

sort is designed especially to cope with huge files too. You could do:

cat *.txt | sort >all_sorted 
uniq all_sorted >unique_sorted
sdiff -sld all_sorted unique_sorted | uniq >all_duplicates

answered Aug 30, 2016 at 8:16

user1934428

22.8k9 gold badges57 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

dsp_099 Over a year ago

Not sure I understand that last line. Combine all, uniq the combined file, then find the difference between the two and pipe it out?

dsp_099 Over a year ago

also the result of all_duplicates is zero, which is not possible

user1934428 Over a year ago

The -s means that lines common to the left and right argument, are removed. Since the right argument is a subset of the left argument in sdiff, it means that the left column contains only the duplicate lines. The -l means that we only get the left column printed. Since each duplicate line may occur several times, the final uniq is used to compact it. If your final output is empty, you could investigate, where this happens. For example, you could do a sdiff -d all_sorted unique_sorted|less -n first and see, where are the duplicates.

Felix Over a year ago

sort *.txt > all_sorted would do. And I think you could use comm -2 for the last line.

Jim Mischel · Accepted Answer · 2016-08-30 14:34:49Z

1

The sort command should work fine with a 12 GB file. And uniq will output just duplicated lines if you specify the -d or -D options. That is:

sort all_combined > all_sorted
uniq -d all_sorted > duplicates

or

uniq -D all_sorted > all_duplicates

The -d option displays one line for each duplicated element. So if "foo" occurs 12 times, it will display "foo" one time. -D prints all duplicates.

uniq --help will give you a bit more information.

answered Aug 30, 2016 at 14:34

Jim Mischel

135k25 gold badges197 silver badges377 bronze badges

Comments

James Brown · Accepted Answer · 2016-08-30 08:01:59Z

0

Maybe if you split that big file into smaller files, sort --uniqueed them out and tried to merge them with sort --merge:

$ cat > test1
1
1
2
2
3
3
$ cat > test2
2
3
3
4
4
$ sort -m -u test1 test2
1
2
3
4

I would imagine merging sorted files would not happen in memory?

answered Aug 30, 2016 at 8:01

James Brown

37.7k8 gold badges52 silver badges64 bronze badges

1 Comment

dsp_099 Over a year ago

the massive file must consist of all the smaller ones because that's how the dupes will make themselves known. list 1, 2, 3, 99, 132 might not have any dupes but list 1 and list 920 might. So hence the merge. My question is why do those commands produce vastly different results?

Felix · Accepted Answer · 2016-08-31 09:51:26Z

0

I think your awk script is incorrect and your uniq -c-command includes the counts of occurrences of duplicates and sort _uniq_combined.txt | uniq -d is the correct thing :) .

Note that you could have directly sort *.txt > sorted_hashes or sort *.txt -o sorted_hashes.

If you have just two files at hand consider using comm (info coreutils to the rescue), which can give you columned-output of "lines just in first file", "lines just in second file", "lines in boths files". If you need just some of these columns you can suppress the others with options to comm. Or use the generated output as a base and continue working on it using cut, like cut -f 1 my_three_colum_file to get the first column.

answered Aug 31, 2016 at 9:51

Felix

4,7262 gold badges35 silver badges47 bronze badges

Collectives™ on Stack Overflow

How to sort out duplicates from a massive list using sort, uniq or awk?

4 Answers 4

4 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related