Interpret ElasticSearch Out of Memory error

Question

First this is a two-node cluster each with "-Xms256m -Xmx1g -Xss256k" (which is really bad given the machine has 8G".

[2015-04-07 16:19:58,235][INFO ][monitor.jvm              ] [NODE1] [gc][ParNew][3246454][64605] duration [822ms], collections [1]/[4.3s], total [822ms]/[21m], memory [966.1mb]->[766.9mb]/[990.7mb], all_pools {[Code Cache] [13.1mb]->[13.1mb]/[48mb]}{[Par Eden Space] [266.2mb]->[75.6mb]/[266.2mb]}{[Par Survivor Space] [8.9mb]->[0b]/[33.2mb]}{[CMS Old Gen] [690.8mb]->[691.2mb]/[691.2mb]}{[CMS Perm Gen] [33.6mb]->[33.6mb]/[82mb]}
[2015-04-07 16:28:02,550][WARN ][transport.netty          ] [NODE1] exception caught on netty layer [[id: 0x03d14f1c, /10.0.6.100:36055 => /10.0.6.105:9300]]
java.lang.OutOfMemoryError: Java heap space
        at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
        at org.elasticsearch.search.controller.ScoreDocQueue.<init>(ScoreDocQueue.java:32)
....
[2015-04-07 21:55:54,743][WARN ][transport.netty          ] [NODE1] exception caught on netty layer [[id: 0xeea0018c, /10.0.6.100:36059 => /10.0.6.105:9300]]
java.lang.OutOfMemoryError: Java heap space
[2015-04-07 21:59:26,774][WARN ][transport.netty          ] [NODE1] exception caught on netty layer [[id: 0x576557fa, /10.0.6.100:36054 => /10.0.6.105:9300]]
...
[2015-04-07 22:51:05,890][WARN ][transport.netty          ] [NODE1] exception caught on netty layer [[id: 0x67f11ffe, /10.0.6.100:36052 => /10.0.
6.105:9300]]
org.elasticsearch.common.netty.handler.codec.frame.TooLongFrameException: transport content length received [1.5gb] exceeded [891.6mb]
[2015-04-07 22:51:05,973][WARN ][cluster.action.shard     ] [NODE1] sending failed shard for [test_index][15], nod
e[xvpLmlJkRSmZNj-pa_xUNA], [P], s[STARTED], reason [engine failure, message [OutOfMemoryError[Java heap space]]]

Then after rejoin (I restarted node 105)

[2015-04-07 22:59:11,095][INFO ][cluster.service          ] [NODE1] removed {[NODE2][GMBDo5K7RMGSgiIwZE7H8w][inet[/10.0.6.105:9300]],}, reason: zen-disco-node_failed([NODE7][GMBDo5K7RMGSgiIwZE7H8w][inet[/10.0.6.105:9300]]), reason transport disconnected (with verified connect)
[2015-04-07 22:59:30,954][INFO ][cluster.service          ] [NODE1] added {[NODE2][mMWcFGhVQY-aBR2r9DO3_A][inet[/10.0.6.105:9300]],}, reason: zen-disco-receive(join from node[[NODE7][mMWcFGhVQY-aBR2r9DO3_A][inet[/10.0.6.105:9300]]])
[2015-04-07 23:11:39,717][WARN ][transport.netty          ] [NODE1] exception caught on netty layer [[id: 0x14a605ce, /10.0.6.100:36201 => /10.0.6.105:9300]]
java.lang.OutOfMemoryError: Java heap space
[2015-04-07 23:16:04,963][WARN ][transport.netty          ] [NODE1] exception caught on netty layer [[id: 0x5a6d934d, /10.0.6.100:36196 => /10.0.6.105:9300]]
java.lang.OutOfMemoryError: Java heap space

So I don't know how to interpret the ">" part. Who actually went out of memory? NODE 1 (10.0.6.100)? Why port 9300? My API originally talks to NODE1, so in this case did it mean NODE1 sending a bulk data request to NODE2? Here is what happened the next day

From NODE1 log:

[2015-04-08 09:02:46,410][INFO ][cluster.service          ] [NODE1] removed {[NODE2][mMWcFGhVQY-aBR2r9DO3_A][inet[/10.0.6.105:9300]],}, reason: zen-disco-node_failed([NODE2][mMWcFGhVQY-aBR2r9DO3_A][inet[/10.0.6.105:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2015-04-08 09:03:27,554][WARN ][search.action            ] [NODE1] Failed to send release search context
org.elasticsearch.transport.NodeDisconnectedException: [NODE2][inet[/10.0.6.105:9300]][search/freeContext] disconnected
....
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [NODE2][inet[/10.0.6.105:9300]] Node not connected

But on NODE2 log, only a few lines from 04-08, but something like this:

[2015-04-08 09:09:13,797][INFO ][discovery.zen            ] [NODE2] master_left [[NDOE1][xvpLmlJkRSmZNj-pa_xUNA][inet[/10.0.6.100:9300]]], reason [do not exists on master, act as master failure]

So who exactly failed? I am confused here :| sorry. Any help is appreciated. I know NODE1 has a very very long GC (MarkSweep was 3hours+ until a full restart of my two-node cluster last night).

@eliasah Hi I understand this is a huge mistake on our side, but the log is not really clear to me (like the arrow part). By the way, I appreciate your help! Will you be able to help us interpret these errors so I can better understand which one failed first. Thanks and sorry for asking too much. — CppLearner
– CppLearner, Commented Apr 10, 2015 at 15:29

eliasah · Accepted Answer · 2015-04-10 15:49:44Z

The first part of your log concerns Elasticsearch Garbage Collection logging format

[2015-04-07 16:19:58,235][INFO][monitor.jvm][NODE1]

garbage collection run
```
[gc] 
```
new parallel garbage collector
```
[ParNew]
```
GC took 822ms
```
duration [822ms], 
```
one collection run, with a total of 4.3 seconds
```
collections [1]/[4.3s]
```
usage number of pool 'memory', it was previously 966.1mb, now is 766.9mb, with a total pool size of 990.7mb
```
memory [966.1mb]->[766.9mb]/[990.7mb], 
```
usage numbers for pool 'code cache'
```
[Code Cache] [13.1mb]->[13.1mb]/[48mb]
```

usage numbers for pool 'Par Eden Space'

[Par Eden Space] [266.2mb]->[75.6mb]/[266.2mb]

usage numbers for pool 'Par Survivor Space'

[Par Survivor Space] [8.9mb]->[0b]/[33.2mb]

usage numbers for pool 'CMS Old Gen'

[CMS Old Gen] [690.8mb]->[691.2mb]/[691.2mb]

usage numbers for pool 'CMS Perm Gen'

[CMS Perm Gen] [33.6mb]->[33.6mb]/[82mb]

And if you have noticed that your memory pool is almost 1G. I hope that this gives you a hint!

Collectives™ on Stack Overflow

Interpret ElasticSearch Out of Memory error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related