I ran a shell pipeline under perf stat, using taskset 0x1 to pin the whole pipeline to a single CPU. I know taskset 0x1 had an effect, because it more than doubled the throughput of the pipeline. However, perf stat shows 0 context switches between the different processes of the pipeline.
So what exactly does perf stat mean by context switches?
I think I was interested in the number of context switches to/from the individual tasks in the pipeline. Is there a better way to measure that?
This was in the context of comparing dd bs=1M </dev/zero, to dd bs=1M </dev/zero | dd bs=1M >/dev/null. If I can measure context switches as desired, I assume that it would be useful in quantifying why the first version is several times more "efficient" than the second.
$ rpm -q perf
perf-4.15.0-300.fc27.x86_64
$ uname -r
4.15.17-300.fc27.x86_64
$ perf stat taskset 0x1 sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C18366+0 records in
18366+0 records out
19258146816 bytes (19 GB, 18 GiB) copied, 5.0566 s, 3.8 GB/s
Performance counter stats for 'taskset 0x1 sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':
5059.273255 task-clock:u (msec) # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
414 page-faults:u # 0.082 K/sec
36,915,934 cycles:u # 0.007 GHz
9,511,905 instructions:u # 0.26 insn per cycle
2,480,746 branches:u # 0.490 M/sec
188,295 branch-misses:u # 7.59% of all branches
5.061473119 seconds time elapsed
$ perf stat sh -c 'dd bs=1M </dev/zero | dd bs=1M >/dev/null'
^C6637+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.04907 s, 1.7 GB/s
6636+0 records in
6636+0 records out
6958350336 bytes (7.0 GB, 6.5 GiB) copied, 4.0492 s, 1.7 GB/s
sh: Interrupt
Performance counter stats for 'sh -c dd if=/dev/zero bs=1M | dd bs=1M of=/dev/null':
3560.269345 task-clock:u (msec) # 0.878 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
355 page-faults:u # 0.100 K/sec
32,302,387 cycles:u # 0.009 GHz
4,823,855 instructions:u # 0.15 insn per cycle
1,167,126 branches:u # 0.328 M/sec
88,982 branch-misses:u # 7.62% of all branches
4.052844128 seconds time elapsed