3

I have some python3 code applied to a json file, with some neural networks and random forests in the codes. I put my codes into a Docker container, but noticed that these ML tasks run faster without Docker than with Docker. In Docker, I'm using Flask to load the json file and run the codes. Of course, I used identical versions of python modules locally and inside Docker and these are:

  • theano 0.8.2
  • keras 2.0.5
  • scikit-learn 0.19.0

Also, Flask is

  • 0.12

At first, I thought theano might use different resources with vs without Docker, but it's running both single CPU and single thread. It's also not using my GPU. I realized it's probably not theano when I realized my random forest is also running slower in Docker. Here are a bunch of tests I performed (I made several tests for each, I'm reporting the mean timings as these were stable)

Without Docker, without Flask:

  • Task 1 (theano + keras code) : 1.0s
  • Task 2 (theano + keras code) : 0.7s
  • Task 3 (scikit-learn code) : 0.25s

Docker (cpus=1) + Flask (debug mode = True):

  • T1: 6.5s
  • T2: 2.2s
  • T3: 0.58s

Docker (cpus=2) + Flask (debug mode = True):

  • T1: 5.5s
  • T2: 1.4s
  • T3: 0.55s

Docker (cpus=2) + Flask (debug mode = False):

  • T1: 4.5s
  • T2: 1.2s
  • T3: 0.5s

Docker (cpus=2) (No Flask, just calling the json file as done locally):

  • T1: 2.8s
  • T2: 1.1s
  • T3: 0.5s

Flask (debug mode = True) (no Docker container):

  • T1: 2.8s
  • T2: 1.5s
  • T3: 0.2s

I guess the cpu=1 vs cpu=2 is just allocating more of one cpu to the codes, and that the second cpu is just taking over some other work. Clearly, there is some reduction in time when Flask OR Docker are not being used, but still, I'm not able to reach the speed I can have without Docker AND without Flask. Does anyone have any guess of why this is happening?

This is a minimal chunck of code of how we use Flask to run the app

api = Flask(__name__)
pipeline = Pipeline()  # private class calling multiple tasks

@api.route("/", methods=['POST'])
def entry():
    data = request.get_json(force=True)
    data = pipeline.process(data)
    # This calls the different tasks which are timed

if __name__ == "__main__":
    api.run(debug=True, host='0.0.0.0', threaded=False)

PS. Pardon me if the question is lacking anything, this is my 1st StackOverflow question

2
  • 1. Docker always brings some overhead, most of the time apps would be at least 5-10% slower in docker. 2. 2-3x difference between dockerless Flask and dockerless pure theano + keras means you are doing something wrong with Flask because there is no way that Flask can bring up to 1.8s overhead. 3. You can see some correlation between number of CPUs and algorithm time. It's possible that pure theano + keras uses all CPU cores (4?), but you limit docker with 1 or 2 CPUs. 4. Knowing your hardware spec and OS os quite necessary to resolve performance problems. Commented May 23, 2018 at 8:56
  • Thanks for your answer. We tested on other machines, with varying number of CPUs (up to 8), and it plateaus very quickly (well 8 is not better than 2). We also tried on an Amazon cluster, with same outcome. Commented May 23, 2018 at 9:25

2 Answers 2

2

I had a very similar problem when doing inference on a CPU with

  • Gunicorn
  • Flask
  • Pytorch

Even tough my setup is slightly different i think this will help you.

I was setting workers=1 and threads=1 in the gunicorn-settings. Inference-Times got extremely bad, when sending concurrent requests to the Flask endpoint.

It turned out that pytorch spinned up as many threads as it can get from Docker and they blocked each other heavily. See also: https://opendatascience.com/model-performance-optimization-with-torchserve/

The solution for me was setting torch.set_num_threads(1).

Please check if you have this problem also.

Sign up to request clarification or add additional context in comments.

1 Comment

Worked torch.set_num_threads(1) for my case, where I was deploying Torch model with Flask-only in Docker on Kubernetes. Container was working perfectly locally in Docker on Mac (6 cores) and around 10x slower on K8s (6 CPUs limit). Thanks, Michael!
0

Struggled with a similar problem where a test container was started with docker-compose in a kubernetes deployment. The pod has resource limits defined but docker (inside the pod) ignored those and used all node resources. pytorch spawned as many workers as available on the node and they blocked each other and runtime increased 6x

Fixed in docker-compose by setting the environment variable OMP_NUM_THREADS=1:

version: '3.3'
services:
  my-service:
    image: myimage/test-my-service:latest
    container_name: test-my-service
    environment:
      - OMP_NUM_THREADS=1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.