2

TLDR: How can my load balancer efficiently and transparently forward an incoming connection to a container/VM in Linux?

Problem: For educational purposes, and maybe to write a patch for liburing in case some APIs are missing, I would like to learn how to implement a load balancer capable of scaling a target service from zero to hero. LB and target services are on the same physical node.

I would like for this approach to be:

  • Efficient: as little memory copying as possible, as little CPU utilization as possible
  • Transparent: the target service should not understand what's happening

I saw systemd socket activation, but it seems it can scale from 0 to 1, while it does not handle further scaling. Also the socket hands off code felt a bit hard to follow, but maybe I'm just a noob.

Current status: After playing a bit I managed to do this either efficiently or transparently, but not both. I would like to do both.

The load balancer process is written in Rust and uses io_uring.

Efficient approach:

  • LB binds to a socket and fires a multishot accept
  • On client connection the LB perform some business logic to decide which container should handle the incoming request
  • If the service is scaled to zero fires up the first container
  • If the service is overloaded fires up more instances
  • Pass the socket file descriptor to the container via sendmsg
  • The container receives the FD and fires a multishot receive to handle incoming data

This approach is VERY efficient (no memory copying, very little CPU usage) but the receiving process need to be aware of what's happening to receive and correctly handle the socket FD.

Let's say I want to run an arbitrary node.js container, then this approach won't work.

Transparent approach:

  • LB binds to a socket and fires a multishot accept
  • On client connection the LB perform some business logic to decide which container should handle the incoming request
  • If the service is scaled to zero fires up the first container
  • If the service is overloaded fires up more instances
  • LB connect to the container, fires a multishot receive
  • Incoming data get sent to the container via zerocopy send

This approach is less efficient because:

  • The incoming container copies the data once (but this happens also in the efficient case)
  • We double the number of active connections, for each connection between client and LB we have a connection between LB and service

The advantage of this approach is that the incoming service is not aware of what's happening

Questions:

  • What can I use to efficiently forward the connection from the LB to the container? Some kind of pipe?
  • Is there a way to make the container think there is a new accept event even though the connection was already accepted and without opening a new connection between the LB and the container?
  • If the connection is TCP, can I use the fact that both the LB and the container are on the same phyisical node and use some kind of lightweight protocol? For example I could use Unix Domain Sockets but then the target app should be aware of this, breaking transparency
New contributor
Mascarpone is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
4
  • 2
    I arrived here from your other question- interesting project! This Q might be better received if you were to condense or prune your three questions down to one :) Commented 2 days ago
  • 1
    I mean you're doing something extremely efficient on the LB, but then hand off to node.js; I'd have a wild guess how important it is that the frontend is very efficient in that case. Now, that being said: your second approach probably won't work in general, because the container process can't do the same thing to the socket it could do with a socket that's actually "connected all the way through"; namely, close, and set socket options. Commented 2 days ago
  • @MarcusMüller I made the example of NodeJS because I wanted to support generic container. I read in the systemd patches for socket activation that they had to patch the receiving service, and I wanted to avoid that. The close event can be handled, but you are right about socket options not being propagated. How could I deal with that? Maybe the LB can connect first, read the options, and then apply them to the other socket? Good catch anyway Commented 2 days ago
  • 1
    @Mascarpone there's no "other socket" that LB could manipulate! That other socket is created by the container, and owned by it. So, your LB can't do anything about that. Commented 2 days ago

1 Answer 1

2

What can I use to efficiently forward the connection from the LB to the container? Some kind of pipe?

Stick with the original transport layer protocol. These are very efficient in Linux, and you avoid having to parse and repackage.

Is there a way to make the container think there is a new accept event even though the connection was already accepted and without opening a new connection between the LB and the container?

No. That's the beauty of a socket! It's a state machine with its own, isolated life.

If the connection is TCP, can I use the fact that both the LB and the container are on the same phyisical node

yes, that makes TCP very low-overhead.

and use some kind of lightweight protocol?

Yep, TCP.

For example I could use Unix Domain Sockets

which is likely to be heavier than TCP between processes.

Try this: do an iperf3 on localhost, and then between your host and a container that's connected via one of the usual transports, for example the macvlan driver (which you'll find in docker and podman), ipvlan, and then try macvlan between two containers sharing the same container network.

I, to my biggest surprise, found that on my Linux 6.x, iperf to localhost (i.e., on the lo interface) was significantly slower than container-container networking via macvlan. Makes you discard the general idea that "a is more performant than b, without having measured it" is even a viable approach here.

4
  • When you say "stick with the original protocol" you refer to my transparent approach? So open another connection? And should I open one LB<>Container connection for each client<>LB connection or can I somehow multiplex them? Commented 2 days ago
  • I disagree on UDS being slower than TCP, at least the benchmark built on top of my io_uring custom event loop say otherwise, and you can find others confirming my opinion with a quick google search. Other than that THANK YOU and I'm full of admiration toward your knowledge. Commented 2 days ago
  • 1
    I mean it really depends on how you use it, right? A UNIX domain socket simply serves a fundamentally different set of problems, and hence, piping TCP packets through one is probably going to be heavier than just directly forwarding them (note that I'm not aware of a single piece of software that does TCP-payload-via-Unix-domain-sockets for deployment). Commented 2 days ago
  • 1
    I really think the takeaway here should mostly be that you really need to measure prototypes :) Commented 2 days ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.