Something magical happens when MetalLB is used in the following fashion:
- MetalLB is deployed in a dedicated LB cluster;
- LB cluster is deployed in front of all workload clusters;
- all Service type=LoadBalancer are projected into the LB cluster;
Compare this setup with a traditional proprietary SDN, eg: NSX-T and Cloud LoadBalancer like Maglev used in GCP.
MetalLB in LB Cluster with Service Projected | NSX-T | Maglev | |
---|---|---|---|
Control Plane | K8s API Server | NSX-T Manager | not mentioned, Borg? |
Control Plane concurrency limit | 1 million per second? | 199 per second(NSX-T 2.5) | not mentioned, Borg? |
CP Database | Etcd | Corfu | not mentioned, Chubby? |
Deployment Form | VM or most commonly Containers | VM | unclear, mentioned Maglev deployment shares the same machines as other Applications, Borg? |
South-North Data Plane | K8s Nodes | NSX-T Edge Node | not mentioned, Borg nodes? |
South-North Data Plane technology | kube-proxy: iptables/ipvs | Nginx | optimized kernel-free datapath module |
South-North Datapath | DNAT only; two hops in total: VIP->NodeIP→PodIP; | DNAT, DSR etc; one hop: VIP->PodIP; | DSR; hardware encapsulator between router and Maglev for fast overlay; one hop: VIP→Service EP; |
Data Plane Programmability | K8s Controller + CR/Core Objects | NSX-T Data Model: LB + VirtualServer + ServerPool | MagLev Config objects which are committed atomically(implies a CP system like etcd or ZooKeeper(Google Chubby)); |
States management | None | Edge Active + Standby deployment | Maglev Consistent Hashing, minimize interruption yet optimize scale as much as possible, truly distributed; interruption rate is tunable by parameters in the consistent hashing algorithm; |
cluster scalability | it doesn’t handle states, so unlimited | at most 10 nodes per edge cluster. in total, at most 160 edge nodes. one LB is only mapped to at most one pair of edge nodes though. | Maglev is stateless because it handles states in a stateless way(consistent hashing). so unlimited |
Clearly, the opportunity to build an enterprise-grade Distributed Software LB lies in the Dataplane.
Note:
- Antrea serves as the lightweight version of NSX-T open vSwitch based dataplane agent;
- Cilium optimizes Dataplane using eBPF to replace vanilla Kube-proxy. That means we could potentially use Cilium in MetalLB Dedicated K8s Cluster to achieve better performance;
Proposal:
- Use Cilium like eBPF-based module to optimize dataplane
- could be deployed as Daemonset;
- could be used to replace kube-proxy;
- Use Maglev Consistent Hashing to build truly distributed LB with states handled, meaning:
- connection stickiness is preserved as much as possible;
- scalable like Cloud; not traditional Active-standy or Active-Active model any more!
Besides,
based on Maglev paper, we need to add the following improvements based on MetalLB to implement something like Maglev:
- QoS: divide Services between multiple shards of LBs in the same cluster in order to achieve performance isolation;
- Aggregation of VIP by a component like Routes Reflector sitting in front of all MetalLB BGP peers before all VIPs are published to the ToR Router/Gateway;