I'm running a multi-threaded java server application, that, among other things, is receiving UDP packets from 3 different multicast sources (ports), on 3 different threads.
It's running on a recent dual-socket redhat box (total of 8 cores (4 x 2 cpu), no hyperthreading).
The "top" command shows cpu usage at 250~300%. shift-H shows 2 threads at around 99% usage, 1 at 70%. A quick thread jstack analysis shows those threads correspond to my UDP handling threads.
I am a bit surprised by the level of the CPU usage considering the CPU speed vs the UDP message rate (about 300 msg/second, for a payload of about 250 bytes), and I'm investigating this. It's interesting to note that the third thread (corresponding to a lower cpu usage) has a lower data rate (50~100 msg/s)
I've included some debug code to measure where most time is spent, and it appears to be in the "receive()" method of the DatagramSocket:
_running = true;
_buf = new byte[300];
_packet = new DatagramPacket(_buf, _buf.length);
while(_running) {
try {
long t0 = System.nanoTime();
_inSocket.receive(_packet);
long t1 = System.nanoTime();
this.handle(_packet);
long t2 = System.nanoTime();
long waitingAndReceiveTime = t1-t0;
long handleTime = t2-t1;
_logger.info("{} : {} : update : {} : {}", t1, _port, waitingAndReceiveTime, handleTime);
}
catch(Exception e) {
_logger.error("Exception while receiving multicast packet", e);
}
}
handleTime averages at 4000ns, which is extremely fast and can not be responsible for the CPU usage. waitingAndReceiveTime is much higher, from around 30,000ns to several ms. I understand the method is blocking, so the time includes both the time blocking, and the time receiving.
I have several questions:
- am I right to suspect something is strange ?
- I'm thinking as "receive()" is blocking, it should not "waste" CPU cycles, so the waiting part should not be responsible for the high CPU usage, right ?
- would there be a way to split the measurement of the time blocking, and the time receiving the datagram in the receive method ?
- what could be responsible for this high CPU usage ?
EDIT: I played with Interrupt Coalescing parameters, putting rx-usecs at 0 and rx-frames at 10. I can now see the following:
- UPD messages indeed appear in groups of 10: for each group the first message has a LONG waitingAndReceiveTime (>= 1ms), and the following 9 waitingAndReceiveTime is much shorter (~2000ns). (handleTime is the same)
- CPU usage is reduced ! goes down to about 55% for the 2 first threads.
still no idea how to solve this
this.handledo?receive()is blocking, so it shouldn't be possible for it to take the CPU up to 100%. 4. I'm suspecting that it's not Java code that's responsible for (all of) the usage, but native code (or some significant Java code that was left out of the question).