0

I have this docker image:

# syntax = docker/dockerfile:1.2

FROM continuumio/miniconda3

# install os dependencies
RUN mkdir -p /usr/share/man/man1
RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
    ca-certificates \
    curl \
    python3-pip \
    vim \
    sudo \
    default-jre \
    git \
    gcc \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# install python dependencies
RUN pip install openmim
RUN pip install torch
RUN mim install mmcv-full==1.7.0
RUN pip install mmpose==0.29.0
RUN pip install mmdet==2.27.0
RUN pip install torchserve

# prep torchserve
RUN mkdir -p /home/torchserve/model-store
RUN wget https://github.com/facebookresearch/AnimatedDrawings/releases/download/v0.0.1/drawn_humanoid_detector.mar -P /home/torchserve/model-store/
RUN wget https://github.com/facebookresearch/AnimatedDrawings/releases/download/v0.0.1/drawn_humanoid_pose_estimator.mar -P /home/torchserve/model-store/
COPY config.properties /home/torchserve/config.properties

# print the contents of /model-store
RUN ls /home/torchserve/model-store

# starting command
CMD /opt/conda/bin/torchserve --start --ts-config /home/torchserve/config.properties && sleep infinity

and in the same folder I have the following config.properties:

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/home/torchserve/model-store
load_models=all
default_response_timeout=5000

it works perfectly fine locally but when I push it to gcloud run the following error occurs and the models do not run properly also /ping is returning healthy here is the error:

org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.

at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker ( org/pytorch.serve.wlm/WorkerLifeCycle.java:177 )
at org.pytorch.serve.wlm.WorkerThread.connect ( org/pytorch.serve.wlm/WorkerThread.java:339 )
at org.pytorch.serve.wlm.WorkerThread.run ( org/pytorch.serve.wlm/WorkerThread.java:183 )
at java.util.concurrent.ThreadPoolExecutor.runWorker ( java/util.concurrent/ThreadPoolExecutor.java:1128 )
at java.util.concurrent.ThreadPoolExecutor$Worker.run ( java/util.concurrent/ThreadPoolExecutor.java:628 )
at java.lang.Thread.run ( java/lang/Thread.java:829 )

what is the issue?

Here are the starter logs: Not sure what to look for and hope this isn't too small to look through

6
  • 1) The CLI ping does not test your app. It pings the Google Front End (GFE) only. The GFE will always respond. 2) Cloud Run only supports one HTTP listening port (default is 8080). 3) Unless you adjust CPU settings, Cloud Run does not provide CPU time to background threads. More details. 4) Add the Cloud Run container startup logs to your post. Commented Jun 27, 2023 at 18:31
  • Please read this guide. For Google Cloud logs, export the logs as text and then post the text in your question. Also, it is up to you to review the logs and only post the parts that are relevant to your problem. Posting everything usually means few will take the time to look at them. Commented Jun 27, 2023 at 19:28
  • 8080 is correct since that is the inference endpoint, the rest of the endpoint I am not interested in. the ping was to the docker containter, it returned the following json(jit returned healthy on my local machine) : { "status": "Unhealthy" } Commented Jun 27, 2023 at 19:28
  • ok I will post the correct logs, I did not know how to export it as text. Thanks. How do you export the logs as text? Commented Jun 27, 2023 at 19:29
  • Ping does not return JSON. Please be specific in how you are testing. Status unhealthy often means that your Cloud Run container does not have a process responding to HTTP requests on the configured TCP PORT. Commented Jun 27, 2023 at 19:30

1 Answer 1

1

I observed the same issue on GCP Cloud Run. The container was working locally but I got the same Backend worker timeout error on Cloud Run.

I increased the memory for the Cloud Run instance, reduced the number of worker threads, and increased the startup timeout value:

default_startup_timeout=600
default_workers_per_model=2

These parameters can be set in config.properties: https://pytorch.org/serve/configuration.html#other-properties

These changes fixed the issue so I think it was a memory related issue.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.