How to Implement Asynchronous Request Handling in TorchServe for High-Latency Inference Jobs?

Question

I'm currently developing a Rails application that interacts with a TorchServe instance for machine learning inference. The TorchServe server is hosted on-premises and equipped with 4 GPUs. We're working with stable diffusion models, and each inference request is expected to take around 30 seconds due to the complexity of the models.

Given the high latency per job, I'm exploring the best way to implement asynchronous request handling in TorchServe. The primary goal is to manage a large volume of incoming prediction requests efficiently without having each client blocked waiting for a response.

Here's the current setup and challenges:

Rails Application: This acts as the client sending prediction requests to TorchServe.
TorchServe Server: Running on an on-prem server with 4 GPUs.
Model Complexity: Due to stable diffusion processing, each request takes about 30 seconds.

I'm looking for insights or guidance on the following:

Native Asynchronous Support: Does TorchServe natively support asynchronous request handling? If so, how can it be configured?
Queue Management: If TorchServe does not support this natively, what are the best practices for implementing a queue system on the server side to handle requests asynchronously?
Client-Side Implementation: Tips for managing asynchronous communication in the Rails application. Should I implement a polling mechanism, or are there better approaches?
Resource Management: How to effectively utilize the 4 GPUs in an asynchronous setup to ensure optimal processing and reduced wait times for clients.

Any advice, experiences, or pointers to relevant documentation would be greatly appreciated. I'm aiming to make this process as efficient and scalable as possible, considering the high latency of each inference job.

Thank you in advance for your help!

Yoann Buzenet · Accepted Answer · 2024-06-07 17:14:18Z

I found a way to do it.

What we know:

Torchserve creates a server that you can ping on some ports;
If you send it data before it finished to process, it queues (up to 500 default, overridable with env var).
you CAN'T access the state of the queue. You don't know if it's full or not.

So what is this queue for ? Queing allows it to better scale its worker and handle the hardware, so it's important to use it. You could want to send item one by one but you would lose all the paralleling ability of torchServe which is made for it.

Here is what I did :

My api send batches on another queue I created (INPUT_QUEUE). A micro service I created for it listen to this queue, which contains chunk of 500 items to process. It split the chunk in 500 separate items and send them in one time to the torchserve micro service. Each item holds the chunkId with it.

In torchserve, you can override the handle method to pre/post process as you like. There, I added a message sent to OUTPUT_QUEUE with the processed data AND the chunkId.

My micro service listening to the queue OUTPUT_QUEUE then knows when it received 500 messages of the given chunkId, and acknowledge the message from the first queue INPUT_QUEUE to release another chunk of 500 that goes to torchServe.

Like that, with these 2 queue, I know torchServe will received maximum 500 at a time, always be loaded but not too much.

Collectives™ on Stack Overflow

How to Implement Asynchronous Request Handling in TorchServe for High-Latency Inference Jobs?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related