Strictly, we could paraphrase your statement as "Kubernetes pod memory usage automatically" in the sense that if you have processes that, from time to time, consume a lot of resources that need to be freed, you can use Vertical Pod Autoscaler.
Some interesting references:
- https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
- https://docs.aws.amazon.com/eks/latest/userguide/vertical-pod-autoscaler.html
- https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
The use of VPA is appropriate when resource requirements are very different for one process or another.
If this is not the case (more or less all requests will use a known amount of resources) or even if you use VPA, it is recommended that you limit your services to not accept requests if they are working on an expensive operation, Kubernetes will automatically increase or decrease the number of Pods depending on the load and your users will receive a 503 error which is precisely to indicate that they cannot be served now and should try again later.
That is to say:
- do not use VPA if not strictly necessary.
- configure your deployment with an adequate number of pods.
- restrict your services within the pods to a single concurrent request (or as many as fit in your resource configuration).
- don't do anything special, if your system has reached the limit you have set, just let the users receive a 503 (your user interface will translate the error as "Try again later").
The details of a deployment may vary, but basically by acting at three levels, you can give your infrastructure some adaptability to the type of load:
- Application Level: for every application instance, you can define the (http requests) rate limit. It must be aligned with the Requests/Limits of your Pods. If you cannot modify your applications (i.e. using bucket4j) you can add an Adapter to your Pod using (e.g.) Nginx (see point 3 for specific configuration).
- Deployment Level: once your application will not break due to request overload, you should be able to scale your infrastructure, horizontally with Requests/Limits for each Pod or from replicas in your Deployment using https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ (there are a lot of metrics that can be used for auto-scaling).
- Load Balancer Level: for simple scenarios you can simply configure an ingress controller for rate limit (i.e. using Nginx controller) but, if you are able to segment your requests (e.g.
...?queryType=hard&...), you can segregate your configuration (points 1 and 2), to maintain multiple infrastructures (with several vertical scalings), each one already prepared to handle specific types of requests, this can be easily done with Nginx (Istio might be overkill).
With this strategy, suppose you have two zones: "LR: Low Resources" and "HR: High Resources". If there is no load on your system, neither of the two zones consume resources (i.e. minReplies: 1), if there are many "LR" requests the resources are used in this zone, if they are in "HR" they are used in this other one, if they are in both, they are distributed between both. Logically the maximum load will be LR.maxReplies + HR.maxReplies (you can make more complex rules e.g. using Istio but always use the simplest scheme that you think will work for you).