0

I have a use case where in my jax-rs rest api exposed will do a memory intensive operation like generating a pdf and sending it as response. I want to check pod memory usage before start doing the business logic and if the current memory usage is > 50% , i would like to send a error response to user asking him to try again later. how do i check kubernetes pod memory usage inside my rest api. Is it even possible?

My current code somewhat looks like below.

@Path("/doSomeMemoryStuff")
@Produces("text/plain")
public Response doStuff(){

     int memoryUsage = getPodCurrentMemoryUsage();  //Get pod current memory usage
     if( memoryUsage <= 50 ) {

          //do Some memory intensive operation
      }
      else{
           //memory usage is more than 50 % , return error
           return Response.ok("Try again later.").build();
          }
  }         

How do I find out the current memory usage of pod from my rest api. Thanks in advance.

1 Answer 1

2

Strictly, we could paraphrase your statement as "Kubernetes pod memory usage automatically" in the sense that if you have processes that, from time to time, consume a lot of resources that need to be freed, you can use Vertical Pod Autoscaler.

Some interesting references:

  1. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
  2. https://docs.aws.amazon.com/eks/latest/userguide/vertical-pod-autoscaler.html
  3. https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler

The use of VPA is appropriate when resource requirements are very different for one process or another.

If this is not the case (more or less all requests will use a known amount of resources) or even if you use VPA, it is recommended that you limit your services to not accept requests if they are working on an expensive operation, Kubernetes will automatically increase or decrease the number of Pods depending on the load and your users will receive a 503 error which is precisely to indicate that they cannot be served now and should try again later.

That is to say:

  1. do not use VPA if not strictly necessary.
  2. configure your deployment with an adequate number of pods.
  3. restrict your services within the pods to a single concurrent request (or as many as fit in your resource configuration).
  4. don't do anything special, if your system has reached the limit you have set, just let the users receive a 503 (your user interface will translate the error as "Try again later").

The details of a deployment may vary, but basically by acting at three levels, you can give your infrastructure some adaptability to the type of load:

  1. Application Level: for every application instance, you can define the (http requests) rate limit. It must be aligned with the Requests/Limits of your Pods. If you cannot modify your applications (i.e. using bucket4j) you can add an Adapter to your Pod using (e.g.) Nginx (see point 3 for specific configuration).
  2. Deployment Level: once your application will not break due to request overload, you should be able to scale your infrastructure, horizontally with Requests/Limits for each Pod or from replicas in your Deployment using https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ (there are a lot of metrics that can be used for auto-scaling).
  3. Load Balancer Level: for simple scenarios you can simply configure an ingress controller for rate limit (i.e. using Nginx controller) but, if you are able to segment your requests (e.g. ...?queryType=hard&...), you can segregate your configuration (points 1 and 2), to maintain multiple infrastructures (with several vertical scalings), each one already prepared to handle specific types of requests, this can be easily done with Nginx (Istio might be overkill).

With this strategy, suppose you have two zones: "LR: Low Resources" and "HR: High Resources". If there is no load on your system, neither of the two zones consume resources (i.e. minReplies: 1), if there are many "LR" requests the resources are used in this zone, if they are in "HR" they are used in this other one, if they are in both, they are distributed between both. Logically the maximum load will be LR.maxReplies + HR.maxReplies (you can make more complex rules e.g. using Istio but always use the simplest scheme that you think will work for you).

Sign up to request clarification or add additional context in comments.

4 Comments

Hey @josejuan.. thanks for pointing to the resources. very useful info.. But this vertical pod autoscaler automatically restarts the pod which requires at least 1-2 minutes downtime right.. or even if multiple pods are running , on the fly requests for that pod which is being restarted will be impacted.. how do we handle this.. User hardly use my api.. so I am sceptical to introduce autoscaler feature just for this use case...
You are right @Stunner that's why I said "do not use VPA" and recommend the limit the concurrent rate and infra limits to simply let kubernetes reject more request (#2 to #4). You can adjust the parameters to your need. Of course you can check Runtime.getRuntime().freeMemory() but a better approach is to multiplex expensive requests to pods with more resources (which is what I have already said but in layers). But to apply these strategies would require knowing more details about your requests and users (and would not be trivial to apply).
Got it.. Could you throw me some light on what parameters i need to set in my resource configuration to restrict single concurrent request. Thanks.
"Could you throw me..." sure, I've updated my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.