1

I am running a Spark streaming application on a cluster with 3 worker nodes. Once in a while jobs are failing due to the following exception:

Job aborted due to stage failure: Task 0 in stage 4508517.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4508517.0 (TID 1376191, 172.31.47.126): io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError
at sun.misc.Unsafe.allocateMemory(Native Method)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:165)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277)
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146)
... 10 more  

I am submitting the job in client mode without any special parameters. Both master and workers have 15 g of memory. Spark Version is 1.4.0.

Is this solvable by tuning configuration?

2
  • One thing worth pointing out is that we use a lot of DStream.cache in our code. Commented Oct 14, 2015 at 1:59
  • 2
    > Is this solvable by tuning configuration? You should've tried that already, see --executor-memory and --driver-memory. Don't forget to drop DStreams that are to no use for you anymore, with DStream.unpersist. Commented Oct 14, 2015 at 5:08

1 Answer 1

1

I'm facing the same problem and found out that its probably caused by a memory leak in netty version 4.0.23.Final which is used by Spark 1.4 (see https://github.com/netty/netty/issues/3837)

It is solved at least in Spark 1.5.0 (see https://issues.apache.org/jira/browse/SPARK-8101) which uses netty 4.0.29.Final.

So an upgrade to the latest Spark version should solve the problem. I will try it the next days.

Additionally Spark Jobserver in the current version forces netty 4.0.23.Final, so it needs a fix too.

EDIT: I upgraded to Spark 1.6 with netty 4.0.29.Final but still getting a direct buffer OOM using Spark Jobserver.

Sign up to request clarification or add additional context in comments.

1 Comment

Using Spark 1.6.2, still get this error, it's not about executor memory I am sure, but no idea how to fix it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.