1

Every once in a while a TorchServe worker dies with the following message io.netty.handler.codec.CorruptedFrameException: Message size exceed limit: 16. When I rerun the request in question, it completes with no problems. Interestingly, this error only occurs on similarity models batching enabled (see config.properties).

I have tried to solve the issue by increasing response and request size in the config.properties. This hasn't helped.

async_logging=true
max_response_size=1000000000
max_request_size=1000000000

Additional information:

I am using the docker image pytorch/torchserve:0.7.1-gpu to run a few BERT models on GPU.

Providing full TS config.properties:

load_models=all
async_logging=true
max_response_size=1000000000
max_request_size=1000000000

models={\
  "text_similarity_v1": {\
    "1.0": {\
        "defaultVersion": true,\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 16,\
        "maxBatchDelay": 100\
    }\
  },\
  "text_similarity_v2": {\
    "1.0": {\
        "defaultVersion": true,\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 16,\
        "maxBatchDelay": 100\
    }\
  },\
  "price": {\
    "1.5": {\
        "defaultVersion": true,\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 1,\
        "maxBatchDelay": 100\
    }\
  },\
  "categorization": {\
    "1.5": {\
        "defaultVersion": true,\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 1,\
        "maxBatchDelay": 100\
    }\
  },\
  "text_quality": {\
    "1.4": {\
        "defaultVersion": true,\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 1,\
        "maxBatchDelay": 100\
    }\
  }\
}

Providing full TorchServe error message:

2023-04-27T09:58:56.368+02:00    2023-04-27T07:58:56,360 [ERROR] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - Unknown exception
2023-04-27T09:58:56.368+02:00    io.netty.handler.codec.CorruptedFrameException: Message size exceed limit: 16
2023-04-27T09:58:56.368+02:00    Consider increasing the 'max_response_size' in 'config.properties' to fix.
2023-04-27T09:58:56.368+02:00    at org.pytorch.serve.util.codec.CodecUtils.readLength(CodecUtils.java:24) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at org.pytorch.serve.util.codec.CodecUtils.readMap(CodecUtils.java:54) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at org.pytorch.serve.util.codec.ModelResponseDecoder.decode(ModelResponseDecoder.java:73) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.epoll.EpollDomainSocketChannel$EpollDomainUnsafe.epollInReady(EpollDomainSocketChannel.java:138) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at java.lang.Thread.run(Thread.java:833) [?:?]
2023-04-27T09:58:56.368+02:00    2023-04-27T07:58:56,361 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 9003 Worker disconnected. WORKER_MODEL_LOADED
2023-04-27T09:58:56.368+02:00    2023-04-27T07:58:56,361 [ERROR] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - Unknown exception
2023-04-27T09:58:56.368+02:00    io.netty.handler.codec.CorruptedFrameException: Message size exceed limit: 16
2023-04-27T09:58:56.368+02:00    Consider increasing the 'max_response_size' in 'config.properties' to fix.
2023-04-27T09:58:56.368+02:00    at org.pytorch.serve.util.codec.CodecUtils.readLength(CodecUtils.java:24) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at org.pytorch.serve.util.codec.CodecUtils.readMap(CodecUtils.java:54) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at org.pytorch.serve.util.codec.ModelResponseDecoder.decode(ModelResponseDecoder.java:73) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:404) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:371) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:354) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:819) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[model-server.jar:?]
2023-04-27T09:58:56.368+02:00    at java.lang.Thread.run(Thread.java:833) [?:?]

UPDATE: We narrowed down the problem to batching. We are currently running all predictions with batchSize: 1 and workers are no longer dying. Will keep you posted if we figure out how to fix the issue with batching.

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.