Cloud Native L4 Load Balancer: MetalLB, NSX-T and Maglev

Something magical happens when MetalLB is used in the following fashion:

  1. MetalLB is deployed in a dedicated LB cluster;
  2. LB cluster is deployed in front of all workload clusters;
  3. all Service type=LoadBalancer are projected into the LB cluster;

Compare this setup with a traditional proprietary SDN, eg: NSX-T and Cloud LoadBalancer like Maglev used in GCP.

MetalLB in LB Cluster with Service Projected NSX-T Maglev
Control Plane K8s API Server NSX-T Manager not mentioned, Borg?
Control Plane concurrency limit 1 million per second? 199 per second(NSX-T 2.5) not mentioned, Borg?
CP Database Etcd Corfu not mentioned, Chubby?
Deployment Form VM or most commonly Containers VM unclear, mentioned Maglev deployment shares the same machines as other Applications, Borg?
South-North Data Plane K8s Nodes NSX-T Edge Node not mentioned, Borg nodes?
South-North Data Plane technology kube-proxy: iptables/ipvs Nginx optimized kernel-free datapath module
South-North Datapath DNAT only; two hops in total: VIP->NodeIP→PodIP; DNAT, DSR etc; one hop: VIP->PodIP; DSR; hardware encapsulator between router and Maglev for fast overlay; one hop: VIP→Service EP;
Data Plane Programmability K8s Controller + CR/Core Objects NSX-T Data Model: LB + VirtualServer + ServerPool MagLev Config objects which are committed atomically(implies a CP system like etcd or ZooKeeper(Google Chubby));
States management None Edge Active + Standby deployment Maglev Consistent Hashing, minimize interruption yet optimize scale as much as possible, truly distributed; interruption rate is tunable by parameters in the consistent hashing algorithm;
cluster scalability it doesn’t handle states, so unlimited at most 10 nodes per edge cluster. in total, at most 160 edge nodes. one LB is only mapped to at most one pair of edge nodes though. Maglev is stateless because it handles states in a stateless way(consistent hashing). so unlimited

Clearly, the opportunity to build an enterprise-grade Distributed Software LB lies in the Dataplane.

Note:

  1. Antrea serves as the lightweight version of NSX-T open vSwitch based dataplane agent;
  2. Cilium optimizes Dataplane using eBPF to replace vanilla Kube-proxy. That means we could potentially use Cilium in MetalLB Dedicated K8s Cluster to achieve better performance;

Proposal:

  1. Use Cilium like eBPF-based module to optimize dataplane
    1. could be deployed as Daemonset;
    2. could be used to replace kube-proxy;
  2. Use Maglev Consistent Hashing to build truly distributed LB with states handled, meaning:
    1. connection stickiness is preserved as much as possible;
    2. scalable like Cloud; not traditional Active-standy or Active-Active model any more!

Besides,

based on Maglev paper, we need to add the following improvements based on MetalLB to implement something like Maglev:

  1. QoS: divide Services between multiple shards of LBs in the same cluster in order to achieve performance isolation;
  2. Aggregation of VIP by a component like Routes Reflector sitting in front of all MetalLB BGP peers before all VIPs are published to the ToR Router/Gateway;