3

What are sources of latency in process of sending/receiving TCP/UDP packets in linux 2.6 ?

I want to know a latency sources in "ping-pong" latency tests.

There are some rather good papers of ethernet latency, but they cover only latency sources in the wire and switch (and rather cursory, only for specific switch).

What steps of processing does follow a packet?

Papers with deep latency analysis of usual ping (icmp) will be useful too.

I rely on community :)

2 Answers 2

5

Short Answer: for exact latency in the kernel you should use perf probe and perf script.

More details: Let's look at the following example. First we want to see what functions are used for TCP ping-pong test (I'm using netperf).

On receive flow: enter image description here

On transmit flow: enter image description here

So let's track some functions of the transmit flow (it is long so I'll show the major functions in the TCP flow). We can use perf probe in order to sample entry and exit point for each function:

perf probe --add sock_sendmsg='sock_sendmsg'
perf probe --add sock_sendmsg_exit='sock_sendmsg%return'
perf probe --add inet_sendmsg='inet_sendmsg'
perf probe --add inet_sendmsg_exit='inet_sendmsg%return'
perf probe --add tcp_sendmsg_exit='tcp_sendmsg%return'
perf probe --add tcp_sendmsg='tcp_sendmsg'
perf probe --add tcp_sendmsg_locked='tcp_sendmsg_locked'
perf probe --add tcp_sendmsg_locked_exit='tcp_sendmsg_locked%return'
perf probe --add sk_stream_alloc_skb='sk_stream_alloc_skb'
perf probe --add sk_stream_alloc_skb_exit='sk_stream_alloc_skb%return'
perf probe --add tcp_push_exit='tcp_push%return'
perf probe --add tcp_push='tcp_push'
perf probe --add tcp_send_mss='tcp_send_mss'
perf probe --add tcp_send_mss_exit='tcp_send_mss%return'
perf probe --add __tcp_push_pending_frames='__tcp_push_pending_frames'
perf probe --add __tcp_push_pending_frames_exit='__tcp_push_pending_frames%return'
perf probe --add tcp_write_xmit_exit='tcp_write_xmit%return'
perf probe --add tcp_transmit_skb_exit='tcp_transmit_skb%return'
perf probe --add tcp_transmit_skb='tcp_transmit_skb'

No we can record these:

perf record -e probe:* -aR taskset -c 7 netperf -t TCP_RR -l 5 -T 7,7

And run perf script for the latency report:

perf script -F time,event --ns

Output (1 iteration):

525987.403094082:                   probe:sock_sendmsg:
525987.403095586:                   probe:inet_sendmsg:
525987.403096192:                    probe:tcp_sendmsg:
525987.403098203:             probe:tcp_sendmsg_locked:
525987.403099296:                   probe:tcp_send_mss:
525987.403100002:              probe:tcp_send_mss_exit:
525987.403100740:            probe:sk_stream_alloc_skb:
525987.403101697:       probe:sk_stream_alloc_skb_exit:
525987.403103079:                       probe:tcp_push:
525987.403104284:      probe:__tcp_push_pending_frames:
525987.403105575:               probe:tcp_transmit_skb:
525987.403110178:               probe:tcp_transmit_skb:
525987.403111640:          probe:tcp_transmit_skb_exit:
525987.403112876:          probe:tcp_transmit_skb_exit:
525987.403114351:            probe:tcp_write_xmit_exit:
525987.403114768: probe:__tcp_push_pending_frames_exit:
525987.403115191:                  probe:tcp_push_exit:
525987.403115718:        probe:tcp_sendmsg_locked_exit:
525987.403117576:               probe:tcp_sendmsg_exit:
525987.403118082:              probe:inet_sendmsg_exit:
525987.403118568:              probe:sock_sendmsg_exit:

Now it is pretty easy to see where the latency is spent. For instance we can notice that between sock_sendmsg() call and inet_sendmsg() call there are 1504 nanoseconds (ns) or 1.504 microseconds (us) of latency. Also, we can see sk_stream_alloc_skb takes 957 ns. In total, the whole process (sock_sendmsg entry to exit) takes ~24.5us. Keep in mind this is not what you would see in netperf since the packet is physically transmitted somewhere in the middle of the flow.

You can use the same method to track any piece of code in the kernel.

Hope this helps.

P.S. This was done on kernel-4.14 not 2.6. Not sure how developed perf was back then, so it might not work as well.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! It is ok to use modern kernels now. How did you get the flamegraphs? Classic flamegraph as I know uses "sum", it is not draw of functions over absolute timeline, but is draw of time percentages over several executions and is used with perf in profiling mode. So, you did flamegraph-like draw of perf script output with absolute timeline? How I will see latency of access to the hardware in the graphs (driver should create descriptor in hw/dma queue and signal new descriptor to the network controller over ports/pio/mmio)?
[1] These flame-graphs were created using this (excellent) manual: brendangregg.com/flamegraphs.html. It is perf based and requires recorded data in order to work. I always start performance debug with them because they perfectly illustrate what's happening on runtime. The rest of the debug (perf probe, record and script) is done on the next iteration, not by using the same data.[2] Regarding HW access, each driver has its own. For mlx5 driver TX this is: elixir.free-electrons.com/linux/latest/source/drivers/net/….
4

Even though it's quite old, this guide details the linux networking stack.

http://www.cs.unh.edu/cnrg/people/gherrin/linux-net.html

If you read through there you should be able to spot places where latency can be added via the kernel.

e.g. if a frame is buffered temporarily in the kernel

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.