2

I am using Tensorflow-serving in deployment of my tensorflow models. I have multiple GPU's on the servers available, but, as of now during inference, only one GPU is utilized.

My idea for now, to parallelize classification of large number of images, is to spawn a tensorflow-serving image for each GPU available and have parallel "workers" which grab an image from a generator, make a request and wait for answer. Then grabs a new image from the generator and so on. This would mean that I would have to implement my own datahandler, but it seems achievable.

I read something about SharedBatchScheduler in TensorFlow Serving Batching, but I do not know if this would be useful check out more.

I am fairly new to tensorflow-serving in general and I am wondering if this is the most straightforward way to accomplish what I want.

Thanks in advance for any help/suggestions!


Edit: Thanks for clarification question: I am aware of the 311 issue, github.com/tensorflow/serving/issues/311. Do anyone have a workaround for this issue?

2

1 Answer 1

1

It is totally doable with docker and nvidia-docker 2.0 (judging from docker run --runtime=nvidia ... from the issue, they are using the first version). I did try it with multiple GPUs and Serving; however, didn't end up running it on multiple GPUs.

Nevertheless, I have a host with 4 GPUs, and currently scheduling 1 GPU per custom image that has Tensorflow running for training, so that each user can use a GPU in a isolated environment. Previously I was using Kubernetes for device provisioning, and container management, but it was just an overkill for what I needed to do. Currently, I am using docker-compose to do all the magic. Here is an example:

version: '3'
services:
    lab:
        build: ./tensorlab
        image: centroida/tensorlab:v1.1
        ports:
            - "30166:8888"
            - "30167:6006"
        environment:
            NVIDIA_VISIBLE_DEVICES: 0,1,2
       ...

The key part here is the NVIDIA_VISIBLE_DEVICES variable, where the index of a GPU corresponds to the output of nvidia-smi

Sign up to request clarification or add additional context in comments.

2 Comments

Alright! Thanks for your answer. Have you heard of Nvidia TensorRT Inference Server (developer.nvidia.com/tensorrt)? I just stumbled upon it, and they write "TensorRT Inference Server: Maximizes utilization by enabling inference for multiple models on one or more GPUs". It seemed promising with a multigpu situation.
You are welcome, hopefully that helps. And yes, I've heard and used TensorRT extensively. Note thought that the product is "quite" raw if your models are not linear and have branches, or use some new stuff from tensroflow. Currently, they don't support all of the operations, but I see that they add more and more stuff with each release. It is generally hard to setup, and makes more sense if you are using architectures later than Pascal with good float16 and int8 support. Ultimately, it does give you a good boost, 2-3 times in my experience (despite half of the model was not converted).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.