I am using Tensorflow-serving in deployment of my tensorflow models. I have multiple GPU's on the servers available, but, as of now during inference, only one GPU is utilized.
My idea for now, to parallelize classification of large number of images, is to spawn a tensorflow-serving image for each GPU available and have parallel "workers" which grab an image from a generator, make a request and wait for answer. Then grabs a new image from the generator and so on. This would mean that I would have to implement my own datahandler, but it seems achievable.
I read something about SharedBatchScheduler in TensorFlow Serving Batching, but I do not know if this would be useful check out more.
I am fairly new to tensorflow-serving in general and I am wondering if this is the most straightforward way to accomplish what I want.
Thanks in advance for any help/suggestions!
Edit: Thanks for clarification question: I am aware of the 311 issue, github.com/tensorflow/serving/issues/311. Do anyone have a workaround for this issue?