Sort and count number of occurrence of lines

Question

I have a 35GB file with various strings example:

test1
test2
test1
test34!
test56
test56
test896&
test1
test4
etc
...

There are several billion lines.

I want to sort them and count occurrences, but it took 2 days and it was not done by then.

This is what I've used in bash:

cat file.txt | sort | uniq -c | sort -nr

Is there a more efficient way in doing it? Or is there a way I can see a progress, or would it just load my computer even more and would make it even slower?

Do you have any estimate on the amount of duplicates in the file? You could maybe take a sample of the file with head -1000000 file | sort | uniq | wc -l If there are a lot of duplicates, just counting the lines with for example awk and not sorting them at first could be faster. — James Brown
– James Brown, Commented May 12, 2019 at 13:51
There are a lot of duplicates. How would I sort with awk? Would this be valid: awk ' {cnt[$1]++}END{for(k in cnt) print k,"- " cnt[k]}' file.txt | sort ? — Darth Vader
– Darth Vader, Commented May 12, 2019 at 14:02
How long does just the sort take? I assume this is gnu sort? — rcgldr
– rcgldr, Commented May 12, 2019 at 15:08

James Brown · Accepted Answer · 2019-05-13 18:14:10Z

3

If there are a lot of duplicates, ie. if the unique lines would fit in your available memory, you could count the lines and sort using GNU awk:

$ awk '{
    a[$0]++                                # hash the lines and count
}
END {                                      # after counting the lines
    PROCINFO["sorted_in"]="@val_num_desc"  # used for traverse order 
    for(i in a)
        print a[i],i
}' file

Output for your sample data:

3 test1
2 test56
1 test34!
1 test2
1 test4
1 etc
1 test896&
1 ...

Related documentation: https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html

Update Since the memory wasn't big enough (see comments), split file on 0-2 first characters of the line. The distribution will not be even:

$ awk '{
    ch=substr($0,match($0,/^.{0,2}/),RLENGTH)  # 0-2 first chars
    if(!(ch in a))                             # if not found in hash
        a[ch]=++i                              # hash it and give a unique number
    filename=a[ch]".txt"                       # which is used as filename
    print >> filename                          # append to filename
    close(filename)                            # close so you wont run out of fds
}' file

Output with your test data:

$ ls -l ?.txt
-rw-rw-r-- 1 james james 61 May 13 14:18 1.txt
-rw-rw-r-- 1 james james  4 May 13 14:18 2.txt
-rw-rw-r-- 1 james james  4 May 13 14:18 3.txt
$ cat 3.txt
...

300 MBs and 1.5 M lines in 50 seconds. If I removed the close() it only took 5 seconds but you risk running out of file descriptors. I guess you could increase the amount.

edited May 13, 2019 at 18:14

answered May 12, 2019 at 14:05

James Brown

37.7k8 gold badges52 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Darth Vader Over a year ago

I have 8GB RAM, would it then use swap (or virtual memory) in my HDD?

James Brown Over a year ago

It will swap, sure. I'd love to hear how long it took with that solution.

Darth Vader Over a year ago

It crashed after 12 hours. I am now instead doing a merge sort. First I want to split the file by 1000000 lines, i know to do ``` split -l 1000000 -a 3 file1 > output ``` I use -a 3 as it will create around 3000 files so I need a longer suffix length. But how would I do it in alphabetical (including symbols) order. So it would split into smaller files by first symbol or by first two symbols. I am thinking to use ``` grep '^$n' inputFile.txt > $n.txt ``` How would I do so it loops though? for each n ??? grep '^$n' inputFile.txt > $n.txt ?

James Brown Over a year ago

That's too bad, sorry to hear. Did it run out of memory? first two symbols might be a problem if they happen to be .. or /. or something. Let me think for a second.

Darth Vader Over a year ago

This works perfectly so far. A very genius way and even less resource hungry than just for each grep. Thank you

|

Collectives™ on Stack Overflow

Sort and count number of occurrence of lines

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related