8

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried

Method 1

awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted

Method 2

awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted

Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?

EDIT

awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to @Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.

5
  • 1
    Try adding -S8G to allocate an 8G buffer to the sort reading from the pipe and see if that helps. To measure consistently you might want to drop caches before each run (e.g. myBigFile should be cached after reading it the first time), although that will not explain such a big difference. Commented Apr 12, 2017 at 8:12
  • @JimD. This really helps! Now it is slightly slower (134 seconds vs 105 seconds) than writing intermediate file to SSD (but shouldnt it be faster ...?) Commented Apr 12, 2017 at 9:18
  • When reading from a pipe, sort doesn't know how long the input is. When reading from a file it knows how big it is. Beyond this, the file is cached on your nice system with 128GB of memory, so the sort is not actually reading from disc, but from cache. So I think the difference is attributable to sort knowing the input is huge and effectively dividing the work efficiently, whereas the sort in the pipeline has to discover the input is huge (which it won't even do without the -S<huge>). Commented Apr 12, 2017 at 9:23
  • Also, sort can seek on the file, but not on the pipe. You'd have to take a look at the implementation to see if it takes advantage of that. Commented Apr 12, 2017 at 9:30
  • @JimD. Thanks for this detailed explanation. Could you kindly please make it an answer so I can accept it? Commented Apr 12, 2017 at 10:02

3 Answers 3

7

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.

This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

Sign up to request clarification or add additional context in comments.

4 Comments

Do you need to specify both -S and --parallel or is sort smart enough to know to use the memory without the --parallel switch?
In my experience -S alone will do it, and that reading from a pipe --parallel alone wasn't enough. See, e.g, superuser.com/questions/938558/sort-parallel-isnt-parallelizing
I am a little confused by the statement "the data need not be written to disc before awk exits, and sort will be able to read the file from cache" doesn't the file need to be written to disc for this to work?
@Cole No, linux writes to the cache and marks the pages as dirty. Dirty pages are eventually flushed to disc, but this is not necessary for a file read to hit a cached dirty page. See, e.g., oreilly.com/library/view/understanding-the-linux/0596005652/…
3

I think sort does not use threads when read from pipe.

  1. I have used this command for your first case. And it shows that sort uses only 1 CPU even though it is told to use 4. atop actually also shows that there is only one thread in sort:

    /usr/bin/time -v bash -c "seq 1 1000000 | sort --parallel 4  > bf.txt"
    
  2. I have used this command for your second case. And it shows that sort uses 2 CPU. atop actually also shows that there are four thread in sort:

    /usr/bin/time -v bash -c "seq 1 1000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
    

In you first scenario sort is an I/O bound task, it does lots of read syscalls from stdin. In your second scenario sort uses mmap syscalls to read file and it avoids being an I/O bound task.

Below are results for the first and second scenarios:

$ /usr/bin/time -v bash -c "seq 1 10000000 | sort --parallel 4  > bf.txt"
        Command being timed: "bash -c seq 1 10000000 | sort --parallel 4  > bf.txt"
        User time (seconds): 35.85
        System time (seconds): 0.84
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.43
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 9320
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2899
        Voluntary context switches: 1920
        Involuntary context switches: 1323
        Swaps: 0
        File system inputs: 0
        File system outputs: 459136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

$ /usr/bin/time -v bash -c "seq 1 10000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
        Command being timed: "bash -c seq 1 10000000 > tmp.bf.txt && sort --parallel 4  tmp.bf.txt > bf.txt"
        User time (seconds): 43.03
        System time (seconds): 0.85
        Percent of CPU this job got: 175%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:24.97
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1018004
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2445
        Voluntary context switches: 299
        Involuntary context switches: 4387
        Swaps: 0
        File system inputs: 0
        File system outputs: 308160
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Comments

2

You have more system calls, if you use the pipe.

seq 1000000 | strace sort --parallel=56 2>&1 >/dev/null | grep read | wc -l
2059

Without the pipe the file is mapped into memory.

seq 1000000 > input
strace sort --parallel=56 input 2>&1 >/dev/null | grep read | wc -l
33

Kernel calls are in most cases the bottle neck. That is the reason why sendfile has been invented.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.