I have an application which is based on Camunda Engine Embedded in Wildlfy. Camunda version - 7.21 Wildlfy Version - 31 In the BPMN Process it tries to read lot of data using HTTP connector of Camunda. Which internally uses Apache HTTP Client. We are facing problem where after running for a week the pod gets restarted, Because of OOM. After lot of test runs and native memory tracking, we finally got to know that it's due to Direct Memory Usage by undertow byte buffer pool. This direct memory is increasing at the rate of 300M per day and which will eventually cause OOM. Below is the Native Memory tracking difference after one day of running.
Native Memory Tracking:
(Omitting categories weighting less than 1MB)
Total: reserved=3383MB +221MB, committed=1330MB +669MB
- Java Heap (reserved=2152MB, committed=658MB +400MB)
(mmap: reserved=2152MB, committed=658MB +400MB)
- Class (reserved=5MB, committed=5MB)
(classes #28128 +393)
( instance classes #26459 +370, array classes #1669 +23)
(malloc=5MB #114630 +7518)
: ( Metadata)
( reserved=192MB, committed=182MB +4MB)
( used=179MB +4MB)
( waste=3MB =1.69% +1MB)
: ( Class space)
( reserved=208MB, committed=22MB)
( used=20MB)
( waste=2MB =10.09%)
- Thread (reserved=157MB -22MB, committed=13MB -9MB)
(thread #0)
(stack: reserved=157MB -22MB, committed=13MB -9MB)
- Code (reserved=249MB +1MB, committed=85MB +17MB)
(malloc=7MB +1MB #25344 +3034)
(mmap: reserved=242MB, committed=79MB +15MB)
- GC (reserved=124MB +3MB, committed=69MB +17MB)
(malloc=12MB +3MB #42350 +8980)
(mmap: reserved=112MB, committed=57MB +15MB)
- Compiler (reserved=1MB -1MB, committed=1MB -1MB)
(malloc=1MB -1MB #1944 -297)
- Internal (reserved=1MB, committed=1MB)
(malloc=1MB #33414 +3969)
Other (reserved=246MB +241MB, committed=246MB +241MB)
(malloc=246MB +241MB #15742 +15318)***
- Symbol (reserved=28MB, committed=28MB)
(malloc=25MB #706302 +10200)
(arena=3MB #1)
- Native Memory Tracking (reserved=16MB +1MB, committed=16MB +1MB)
(malloc=1MB #10560 +4262)
(tracking overhead=15MB +1MB)
- Arena Chunk (reserved=0MB -2MB, committed=0MB -2MB)
(malloc=0MB -2MB)
- Module (reserved=1MB, committed=1MB)
(malloc=1MB #6347 -24)
- Metaspace (reserved=194MB, committed=184MB +5MB)
(malloc=2MB #1544 +88)
(mmap: reserved=192MB, committed=182MB +4MB)
- Unknown (reserved=208MB, committed=22MB)
(mmap: reserved=208MB, committed=22MB)
You can notice that "other" section has increased by 241MB. All other memory is fine heap is moving up and down properly. Don't know if there is some memory leak in the undertow byte buffer pool or something.
First Solution we tried is
I configured below params for undertown byte buffer pool as well.
/subsystem=undertow/byte-buffer-pool=default:write-attribute(name=buffer- size,value=16384) /subsystem=undertow/byte-buffer-pool=default:write-attribute(name=leak-detection- percent,value=70) /subsystem=undertow/byte-buffer-pool=default:write-attribute(name=max-pool- size,value=20)
Nothing changed with above settings.
second solution we tried is 2) If we limit the direct memory usage with below settings of JVM, it starts failing with unable to allocate direct memory.
-XX:MaxDirectMemorySize=256M -Djdk.nio.maxCachedBufferSize=16384
Caused by: java.lang.OutOfMemoryError: Cannot reserve 16384 bytes of direct buffer memory (allocated: 805298186, limit: 805306368)
at java.base/java.nio.Bits.reserveMemory(Unknown Source)
at java.base/java.nio.DirectByteBuffer.<init>(Unknown Source)
at java.base/java.nio.ByteBuffer.allocateDirect(Unknown Source)
third one we tried 3) is to disable direct memory usage from undertow byte buffer pool using below command
/subsystem=undertow/byte-buffer-pool=default:write-attribute(name=direct,value=false)
After disabling - now the Rss Memory Cache is increasing continuosly.
top - 05:46:29 up 101 days, 18:56, 0 users, load average: 1.53, 0.99, 0.96
Tasks: 15 total, 1 running, 14 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.6 us, 0.6 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.1 hi, 0.2 si, 0.2 st
MiB Mem : 64292.6 total, 3543.6 free, 11899.7 used, 48849.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 50339.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1596 501 20 0 3367.7m 2.7g 33.7m S 0.7 4.2 491:38.37 java
1 501 20 0 11.8m 3.0m 2.6m S 0.0 0.0 0:00.03 entrypoint.sh
48 501 20 0 12.3m 3.6m 2.6m S 0.0 0.0 0:00.02 docker.wildfly.
1392 501 20 0 12.3m 3.5m 2.6m S 0.0 0.0 0:00.00 service.sh
1402 501 20 0 12.3m 3.5m 2.5m S 0.0 0.0 0:00.00 standalone.sh
Now i'm clueless on how to solve this problem.