Cloud Native L4 Load Balancer: MetalLB, NSX-T and Maglev

Something magical happens when MetalLB is used in the following fashion:

MetalLB is deployed in a dedicated LB cluster;
LB cluster is deployed in front of all workload clusters;
all Service type=LoadBalancer are projected into the LB cluster;

Compare this setup with a traditional proprietary SDN, eg: NSX-T and Cloud LoadBalancer like Maglev used in GCP.

	MetalLB in LB Cluster with Service Projected	NSX-T	Maglev
Control Plane	K8s API Server	NSX-T Manager	not mentioned, Borg?
Control Plane concurrency limit	1 million per second?	199 per second(NSX-T 2.5)	not mentioned, Borg?
CP Database	Etcd	Corfu	not mentioned, Chubby?
Deployment Form	VM or most commonly Containers	VM	unclear, mentioned Maglev deployment shares the same machines as other Applications, Borg?
South-North Data Plane	K8s Nodes	NSX-T Edge Node	not mentioned, Borg nodes?
South-North Data Plane technology	kube-proxy: iptables/ipvs	Nginx	optimized kernel-free datapath module
South-North Datapath	DNAT only; two hops in total: VIP->NodeIP→PodIP;	DNAT, DSR etc; one hop: VIP->PodIP;	DSR; hardware encapsulator between router and Maglev for fast overlay; one hop: VIP→Service EP;
Data Plane Programmability	K8s Controller + CR/Core Objects	NSX-T Data Model: LB + VirtualServer + ServerPool	MagLev Config objects which are committed atomically(implies a CP system like etcd or ZooKeeper(Google Chubby));
States management	None	Edge Active + Standby deployment	Maglev Consistent Hashing, minimize interruption yet optimize scale as much as possible, truly distributed; interruption rate is tunable by parameters in the consistent hashing algorithm;
cluster scalability	it doesn’t handle states, so unlimited	at most 10 nodes per edge cluster. in total, at most 160 edge nodes. one LB is only mapped to at most one pair of edge nodes though.	Maglev is stateless because it handles states in a stateless way(consistent hashing). so unlimited

Clearly, the opportunity to build an enterprise-grade Distributed Software LB lies in the Dataplane.

Note:

Antrea serves as the lightweight version of NSX-T open vSwitch based dataplane agent;
Cilium optimizes Dataplane using eBPF to replace vanilla Kube-proxy. That means we could potentially use Cilium in MetalLB Dedicated K8s Cluster to achieve better performance;

Proposal:

Use Cilium like eBPF-based module to optimize dataplane
1. could be deployed as Daemonset;
2. could be used to replace kube-proxy;
Use Maglev Consistent Hashing to build truly distributed LB with states handled, meaning:
1. connection stickiness is preserved as much as possible;
2. scalable like Cloud; not traditional Active-standy or Active-Active model any more!

Besides,

based on Maglev paper, we need to add the following improvements based on MetalLB to implement something like Maglev:

QoS: divide Services between multiple shards of LBs in the same cluster in order to achieve performance isolation;
Aggregation of VIP by a component like Routes Reflector sitting in front of all MetalLB BGP peers before all VIPs are published to the ToR Router/Gateway;