3

I have a 35GB file with various strings example:

test1
test2
test1
test34!
test56
test56
test896&
test1
test4
etc
...

There are several billion lines.

I want to sort them and count occurrences, but it took 2 days and it was not done by then.

This is what I've used in bash:

cat file.txt | sort | uniq -c | sort -nr

Is there a more efficient way in doing it? Or is there a way I can see a progress, or would it just load my computer even more and would make it even slower?

3
  • Do you have any estimate on the amount of duplicates in the file? You could maybe take a sample of the file with head -1000000 file | sort | uniq | wc -l If there are a lot of duplicates, just counting the lines with for example awk and not sorting them at first could be faster. Commented May 12, 2019 at 13:51
  • There are a lot of duplicates. How would I sort with awk? Would this be valid: awk ' {cnt[$1]++}END{for(k in cnt) print k,"- " cnt[k]}' file.txt | sort ? Commented May 12, 2019 at 14:02
  • How long does just the sort take? I assume this is gnu sort? Commented May 12, 2019 at 15:08

1 Answer 1

3

If there are a lot of duplicates, ie. if the unique lines would fit in your available memory, you could count the lines and sort using GNU awk:

$ awk '{
    a[$0]++                                # hash the lines and count
}
END {                                      # after counting the lines
    PROCINFO["sorted_in"]="@val_num_desc"  # used for traverse order 
    for(i in a)
        print a[i],i
}' file

Output for your sample data:

3 test1
2 test56
1 test34!
1 test2
1 test4
1 etc
1 test896&
1 ...

Related documentation: https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html

Update Since the memory wasn't big enough (see comments), split file on 0-2 first characters of the line. The distribution will not be even:

$ awk '{
    ch=substr($0,match($0,/^.{0,2}/),RLENGTH)  # 0-2 first chars
    if(!(ch in a))                             # if not found in hash
        a[ch]=++i                              # hash it and give a unique number
    filename=a[ch]".txt"                       # which is used as filename
    print >> filename                          # append to filename
    close(filename)                            # close so you wont run out of fds
}' file

Output with your test data:

$ ls -l ?.txt
-rw-rw-r-- 1 james james 61 May 13 14:18 1.txt
-rw-rw-r-- 1 james james  4 May 13 14:18 2.txt
-rw-rw-r-- 1 james james  4 May 13 14:18 3.txt
$ cat 3.txt
...

300 MBs and 1.5 M lines in 50 seconds. If I removed the close() it only took 5 seconds but you risk running out of file descriptors. I guess you could increase the amount.

Sign up to request clarification or add additional context in comments.

11 Comments

I have 8GB RAM, would it then use swap (or virtual memory) in my HDD?
It will swap, sure. I'd love to hear how long it took with that solution.
It crashed after 12 hours. I am now instead doing a merge sort. First I want to split the file by 1000000 lines, i know to do ``` split -l 1000000 -a 3 file1 > output ``` I use -a 3 as it will create around 3000 files so I need a longer suffix length. But how would I do it in alphabetical (including symbols) order. So it would split into smaller files by first symbol or by first two symbols. I am thinking to use ``` grep '^$n' inputFile.txt > $n.txt ``` How would I do so it loops though? for each n ??? grep '^$n' inputFile.txt > $n.txt ?
That's too bad, sorry to hear. Did it run out of memory? first two symbols might be a problem if they happen to be .. or /. or something. Let me think for a second.
This works perfectly so far. A very genius way and even less resource hungry than just for each grep. Thank you
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.