2

I'm currently developing a Rails application that interacts with a TorchServe instance for machine learning inference. The TorchServe server is hosted on-premises and equipped with 4 GPUs. We're working with stable diffusion models, and each inference request is expected to take around 30 seconds due to the complexity of the models.

Given the high latency per job, I'm exploring the best way to implement asynchronous request handling in TorchServe. The primary goal is to manage a large volume of incoming prediction requests efficiently without having each client blocked waiting for a response.

Here's the current setup and challenges:

  • Rails Application: This acts as the client sending prediction requests to TorchServe.
  • TorchServe Server: Running on an on-prem server with 4 GPUs.
  • Model Complexity: Due to stable diffusion processing, each request takes about 30 seconds.

I'm looking for insights or guidance on the following:

  1. Native Asynchronous Support: Does TorchServe natively support asynchronous request handling? If so, how can it be configured?
  2. Queue Management: If TorchServe does not support this natively, what are the best practices for implementing a queue system on the server side to handle requests asynchronously?
  3. Client-Side Implementation: Tips for managing asynchronous communication in the Rails application. Should I implement a polling mechanism, or are there better approaches?
  4. Resource Management: How to effectively utilize the 4 GPUs in an asynchronous setup to ensure optimal processing and reduced wait times for clients.

Any advice, experiences, or pointers to relevant documentation would be greatly appreciated. I'm aiming to make this process as efficient and scalable as possible, considering the high latency of each inference job.

Thank you in advance for your help!

0

1 Answer 1

0

I found a way to do it.

What we know:

  • Torchserve creates a server that you can ping on some ports;
  • If you send it data before it finished to process, it queues (up to 500 default, overridable with env var).
  • you CAN'T access the state of the queue. You don't know if it's full or not.

So what is this queue for ? Queing allows it to better scale its worker and handle the hardware, so it's important to use it. You could want to send item one by one but you would lose all the paralleling ability of torchServe which is made for it.

Here is what I did :

enter image description here

My api send batches on another queue I created (INPUT_QUEUE). A micro service I created for it listen to this queue, which contains chunk of 500 items to process. It split the chunk in 500 separate items and send them in one time to the torchserve micro service. Each item holds the chunkId with it.

In torchserve, you can override the handle method to pre/post process as you like. There, I added a message sent to OUTPUT_QUEUE with the processed data AND the chunkId.

My micro service listening to the queue OUTPUT_QUEUE then knows when it received 500 messages of the given chunkId, and acknowledge the message from the first queue INPUT_QUEUE to release another chunk of 500 that goes to torchServe.

Like that, with these 2 queue, I know torchServe will received maximum 500 at a time, always be loaded but not too much.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.