13

Do you know any elegant way to do inference on 2 python processes with 1 GPU tensorflow?

Suppose I have 2 processes, first one is classifying cats/dogs, 2nd one is classifying birds/planes, each process is running different tensorflow model and run on GPU. These 2 models will be given images from different cameras continuously. Usually, tensorflow will occupy all memory of the entire GPU. So when you start another process, it will crash saying OUT OF MEMORY or failed convolution CUDA or something along that line. Is there a tutorial/article/sample code that shows how to load 2 models in different processes and both run in parallel? This is very useful also in case you are running a model inference while you are doing some heavy graphics e.g. playing games. I also want to know how running the model affects the game.

I've tried using python Thread and it works but each model predicts 2 times slower (and you know that python thread is not utilizing multiple CPU cores). I want to use python Process but it's not working. If you have sample few lines of code that work I would appreciate that very much.

I've attached current Thread code also: enter image description here

3 Answers 3

4

As summarized here, you can specify the proportion of GPU memory allocated per process.

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

Using Keras, it may be simpler to allow 'memory growth' which will expand the allocated memory on demand as described here.

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

The following should work for Tensorflow 2.0:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
Sign up to request clarification or add additional context in comments.

3 Comments

Limiting memory usage like in the 2nd block doesn't seem to be enough because I've tried this before but it doesn't work. I changed Thread to Process and it causes memory error when the 2nd process is spawned. Can you try give working example script?
Also tf.GPUOptions is not available. I think it's because I use tf2?
This worked for me, two processes ran at 45% each with this particular syntax: gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=0.45,allow_growth=True) sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))
4

Apart from setting gpu memory fraction, you need to enable MPS in CUDA to get better speed if you are running more than one model on GPU simultaneoulsy. Otherwise, inference speed will be slower as compared to single model running on GPU.

sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
sudo nvidia-cuda-mps-control -d

Here 0 is your GPU number
After finishing stop the MPS daemon

echo quit | sudo nvidia-cuda-mps-control

4 Comments

Interesting. I use windows, and if I want to do this in the python code itself, where would these pieces of code be?
Not sure if its supported in windows. Official docs only mention linux support. In linux, you need to run these command in terminal
Though the command should be automatic because I want a production-ready code that will be run on the client's desktop. If it supports only Linux then it's quite a problem. Maybe I need another way to deploy the model.
You can try WSL to run from windows to check if it works. You can write a bash script in beginning or use python subprocess module to run from python script
3

OK. I think I've found the solution now.

I use tensorflow 2 and there are essentially 2 methods to manage the memory usage of GPU.

  1. set memory growth to true
  2. set memory limit to some number

You can use both methods, ignore all the warning messages about out of memory stuff. I still don't know what it exactly means but the model is still running and that's what I care about. I measured the exact time the model uses to run and it's a lot better than running on CPU. If I run both processes at the same time, the speed drop a bit, but it's still lot better than running on CPU.

For memory growth approach, my GPU is 3GB so first process try to allocate everything and then 2nd process said out of memory. But it still works.

For memory limit approach, I set the limit to some number e.g. 1024 MB. Both processes work.

So What is the right minimum number that you can set?

I tried reducing the memory limit until I found that my model works with 64 MB limit fine. The prediction speed is still the same as when I set the memory limit to 1024 MB. When I set the memory limit to 32MB, I noticed 50% speed drop. When I set to 16 MB, the model refuses to run because it does not have enough memory to store the image tensor. This means that my model requires minimum of 64 MB which is very little considering that I have 3GB to spare. This also allows me to run the model while playing some video games.

Conclusion: I chose to use the memory limit approach with 64 MB limit. You can check how to use memory limit here: https://www.tensorflow.org/guide/gpu

I suggest you to try changing the memory limit to see the minimum you need for your model. You will see speed drop or model refusing to run when the memory is not enough.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.