How to run tensorflow inference for multiple models on GPU in parallel?

Question

Do you know any elegant way to do inference on 2 python processes with 1 GPU tensorflow?

Suppose I have 2 processes, first one is classifying cats/dogs, 2nd one is classifying birds/planes, each process is running different tensorflow model and run on GPU. These 2 models will be given images from different cameras continuously. Usually, tensorflow will occupy all memory of the entire GPU. So when you start another process, it will crash saying OUT OF MEMORY or failed convolution CUDA or something along that line. Is there a tutorial/article/sample code that shows how to load 2 models in different processes and both run in parallel? This is very useful also in case you are running a model inference while you are doing some heavy graphics e.g. playing games. I also want to know how running the model affects the game.

I've tried using python Thread and it works but each model predicts 2 times slower (and you know that python thread is not utilizing multiple CPU cores). I want to use python Process but it's not working. If you have sample few lines of code that work I would appreciate that very much.

I've attached current Thread code also:

tynowell · Accepted Answer · 2020-03-11 13:27:55Z

4

As summarized here, you can specify the proportion of GPU memory allocated per process.

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

Using Keras, it may be simpler to allow 'memory growth' which will expand the allocated memory on demand as described here.

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

The following should work for Tensorflow 2.0:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

edited Mar 11, 2020 at 13:27

answered Mar 10, 2020 at 13:34

tynowell

414 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

offchan Over a year ago

Limiting memory usage like in the 2nd block doesn't seem to be enough because I've tried this before but it doesn't work. I changed Thread to Process and it causes memory error when the 2nd process is spawned. Can you try give working example script?

offchan Over a year ago

Also tf.GPUOptions is not available. I think it's because I use tf2?

J B Over a year ago

This worked for me, two processes ran at 45% each with this particular syntax:

gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.45,allow_growth=True) sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))

Ritesh · Accepted Answer · 2020-07-10 06:46:39Z

4

Apart from setting gpu memory fraction, you need to enable MPS in CUDA to get better speed if you are running more than one model on GPU simultaneoulsy. Otherwise, inference speed will be slower as compared to single model running on GPU.

sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
sudo nvidia-cuda-mps-control -d

Here 0 is your GPU number
After finishing stop the MPS daemon

echo quit | sudo nvidia-cuda-mps-control

answered Jul 10, 2020 at 6:46

Ritesh

3,9372 gold badges11 silver badges12 bronze badges

4 Comments

offchan Over a year ago

Interesting. I use windows, and if I want to do this in the python code itself, where would these pieces of code be?

Ritesh Over a year ago

Not sure if its supported in windows. Official docs only mention linux support. In linux, you need to run these command in terminal

offchan Over a year ago

Though the command should be automatic because I want a production-ready code that will be run on the client's desktop. If it supports only Linux then it's quite a problem. Maybe I need another way to deploy the model.

Ritesh Over a year ago

You can try WSL to run from windows to check if it works. You can write a bash script in beginning or use python subprocess module to run from python script

offchan · Accepted Answer · 2020-03-11 06:10:29Z

OK. I think I've found the solution now.

I use tensorflow 2 and there are essentially 2 methods to manage the memory usage of GPU.

set memory growth to true
set memory limit to some number

You can use both methods, ignore all the warning messages about out of memory stuff. I still don't know what it exactly means but the model is still running and that's what I care about. I measured the exact time the model uses to run and it's a lot better than running on CPU. If I run both processes at the same time, the speed drop a bit, but it's still lot better than running on CPU.

For memory growth approach, my GPU is 3GB so first process try to allocate everything and then 2nd process said out of memory. But it still works.

For memory limit approach, I set the limit to some number e.g. 1024 MB. Both processes work.

So What is the right minimum number that you can set?

I tried reducing the memory limit until I found that my model works with 64 MB limit fine. The prediction speed is still the same as when I set the memory limit to 1024 MB. When I set the memory limit to 32MB, I noticed 50% speed drop. When I set to 16 MB, the model refuses to run because it does not have enough memory to store the image tensor. This means that my model requires minimum of 64 MB which is very little considering that I have 3GB to spare. This also allows me to run the model while playing some video games.

Conclusion: I chose to use the memory limit approach with 64 MB limit. You can check how to use memory limit here: https://www.tensorflow.org/guide/gpu

I suggest you to try changing the memory limit to see the minimum you need for your model. You will see speed drop or model refusing to run when the memory is not enough.

Collectives™ on Stack Overflow

How to run tensorflow inference for multiple models on GPU in parallel?

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related