What makes k8s better

Posted on 2021-07-16 Edited on 2021-07-17

It’s easy to create a process, but it’s hard to manage its lifecycle. What if it
dies, what if resources are insufficient, what if you want to create more
replicas, etc.

It’s easy to make a REST request, but it’s hard to manage its lifecycle. What if
it fails, what if you need to make many to achieve one logical goal, what if
some of them fail, what if the order matters and changes later, etc.

Therefore:
Q: What’s the difference between a Linux process and a K8s operator running in
Pod?

A: Roughly speaking, pod is a managed process whereas process is a pure compute unit, which could be managed by systemd etc.

Q: What’s the difference between managing some infra resources through making a
bunch of REST calls and creating a CRD for a k8s operator?

A: Roughly speaking, CRD is a managed request. Also yaml is a better user
interface.

Intention reads better than implementation. The system that helps you to focus
on intention wins eventually.

矛盾和统一

Posted on 2021-01-03

如果说人的异化以及”大理性“和”小理性“的失调是马克思主义学者们对现代资本主义的控诉，那么:

是否存在纯粹建立在“大理性”或者“小理性”上的国家？
如果有，比如古希腊和古罗马的某些时期:)，那么由他们的发展历程是否可以发现所谓的“矛盾”恰恰构成了其他国家或时期社会稳定的基础？
如果这样的”矛盾“构成社会稳定的基础，那怎样保证”矛盾“的冲突始终在可控范围？比如：如果美国梦和宣扬禁欲伦理的新教是对资本主义财富积累的价值观的引导，但是如何平衡由之而来的种种泛民主的冲突和抗议，甚至暴力行为？话说回来，每个国家可能都有一个”美国梦“，比如”书中自有黄金屋“ :)

客观

Posted on 2021-01-03

一旦人在科学研究中掺入了关于个人价值的判断，对事实的完整理解便将荡然无存。 –
Max Weber

学术与科学的严肃性必然要求从知识沟通的机制中消除任何争论和攻击行为 – Max
Weber

韦伯主张，知识分子在自己的生活中必须区分学者身份和行动者的身份。要是把两者所追求的判断混淆起来，就必然会危及到两种判断的有效性。

谨记

ToB or ToC

Posted on 2020-10-24

Even though you’re building a To C Service, you should think about how could your service be
consumed one day by other companies.

It encourages modularity and the industry is shifting towards a more agile model
where more products and services could be easily built upon existing ones.

Try to make a building block instead of a silo.

How to come up with good abstractions

Posted on 2020-10-12 Edited on 2020-10-16

When thinking about abstraction, do not model objects, instead model intentions.
In other words, do not abstract nouns, abstract verbs.

Maybe that’s why composition is preferred in more places than inheritance.

And I think the former scales way better than the latter.

Random Thoughts

Posted on 2020-10-06

Curiosity drives DFS, while entrepreneurship demands BFS

Cloud Native L4 Load Balancer: MetalLB, NSX-T and Maglev

Posted on 2020-06-15

Something magical happens when MetalLB is used in the following fashion:

MetalLB is deployed in a dedicated LB cluster;
LB cluster is deployed in front of all workload clusters;
all Service type=LoadBalancer are projected into the LB cluster;

Compare this setup with a traditional proprietary SDN, eg: NSX-T and Cloud LoadBalancer like Maglev used in GCP.

	MetalLB in LB Cluster with Service Projected	NSX-T	Maglev
Control Plane	K8s API Server	NSX-T Manager	not mentioned, Borg?
Control Plane concurrency limit	1 million per second?	199 per second(NSX-T 2.5)	not mentioned, Borg?
CP Database	Etcd	Corfu	not mentioned, Chubby?
Deployment Form	VM or most commonly Containers	VM	unclear, mentioned Maglev deployment shares the same machines as other Applications, Borg?
South-North Data Plane	K8s Nodes	NSX-T Edge Node	not mentioned, Borg nodes?
South-North Data Plane technology	kube-proxy: iptables/ipvs	Nginx	optimized kernel-free datapath module
South-North Datapath	DNAT only; two hops in total: VIP->NodeIP→PodIP;	DNAT, DSR etc; one hop: VIP->PodIP;	DSR; hardware encapsulator between router and Maglev for fast overlay; one hop: VIP→Service EP;
Data Plane Programmability	K8s Controller + CR/Core Objects	NSX-T Data Model: LB + VirtualServer + ServerPool	MagLev Config objects which are committed atomically(implies a CP system like etcd or ZooKeeper(Google Chubby));
States management	None	Edge Active + Standby deployment	Maglev Consistent Hashing, minimize interruption yet optimize scale as much as possible, truly distributed; interruption rate is tunable by parameters in the consistent hashing algorithm;
cluster scalability	it doesn’t handle states, so unlimited	at most 10 nodes per edge cluster. in total, at most 160 edge nodes. one LB is only mapped to at most one pair of edge nodes though.	Maglev is stateless because it handles states in a stateless way(consistent hashing). so unlimited

Clearly, the opportunity to build an enterprise-grade Distributed Software LB lies in the Dataplane.

Note:

Antrea serves as the lightweight version of NSX-T open vSwitch based dataplane agent;
Cilium optimizes Dataplane using eBPF to replace vanilla Kube-proxy. That means we could potentially use Cilium in MetalLB Dedicated K8s Cluster to achieve better performance;

Proposal:

Use Cilium like eBPF-based module to optimize dataplane
1. could be deployed as Daemonset;
2. could be used to replace kube-proxy;
Use Maglev Consistent Hashing to build truly distributed LB with states handled, meaning:
1. connection stickiness is preserved as much as possible;
2. scalable like Cloud; not traditional Active-standy or Active-Active model any more!

Besides,

based on Maglev paper, we need to add the following improvements based on MetalLB to implement something like Maglev:

QoS: divide Services between multiple shards of LBs in the same cluster in order to achieve performance isolation;
Aggregation of VIP by a component like Routes Reflector sitting in front of all MetalLB BGP peers before all VIPs are published to the ToR Router/Gateway;

CAS, Level-based API and Agile

Posted on 2020-01-13 Edited on 2020-01-14

semaphore

Let’s take a look at an over-simplified version of Golang runtime’s semaphore lock() implementation. If you’re interested, see the original version here.

Before you start, let’s make some assumptions to ease your understanding:

a mutex l is an unsigned integer;
it’s unlocked by default, having value 0;
it’s locked by setting its value to 1;

func lock(l *mutex) {
	// Speculative grab for lock.
	if atomic.Casuintptr(&l.key, 0, locked) {
		return
	}

	// On uniprocessor's, no point spinning.
	// On multiprocessors, spin for ACTIVE_SPIN attempts.
	spin := 0
	if ncpu > 1 {
		spin = active_spin
	}
Loop:
	for i := 0; ; i++ {
		v := atomic.Loaduintptr(&l.key)
		if v&locked == 0 {
			// Unlocked. Try to lock.
			if atomic.Casuintptr(&l.key, v, v|locked) {
				return
			}
			i = 0
		}
		if i < spin {
			procyield(active_spin_cnt)
		} else if i < spin+passive_spin {
			osyield()
		} else {
			if v&locked != 0 {
				// Queued. Wait.
				semasleep(-1)
				i = 0
			}
		}
	}
}

What happens in the lock function can be summarized in the following points:

if it is unlocked, grab it by setting it to 1;
otherwise, we do some active spinnings to wait;
before we start the spinning, if it is unlocked, grab it by setting it to 1;
spin if we haven’t spent too much time actively waiting;
otherwise, go to sleep;
once wake up, start over from step 1;

Compare and Set.

Now, let me try to write some ugly pseudo code to summarize.

run() {
beginning:
   # Make observations
   o := observe()
   # Act based on observations and make new proposed changes
   p := act(o)
   # Post new proposals assuming my past observations still hold, if not, go back to the beginning;
   goback := false
   atomic {
	   if o == observe() {
              setp(p)
              goback = false
	   } else {
              goback = true
	   }
   }
   if goback {
      goto beginning
    }
}

Leveled API

Feels familiar? This is actually a very typical controller in high level. And it’s called level-based API in Kubernetes. By “level-based”, we mean:

we make no other assumptions about the world except our last observations;
we make decisions completely based on our assumptions;
we try to commit our decisions if our assumptions still hold, otherwise, we update our assumptions;

Why?

Some points I can think of:

optimal concurrency control, thereby -> we get “optimal”;
remain as stateless as possible;
1. scalable;
2. better manageability;

Distinct thought pattern? Or repeating history

CAS?

This is all I know, let’s move the world forward; otherwise, allow me to refresh my knowledge and try again.

Agile?

Some items copied from Wikipedia: Agile Overview

Iterative, incremental;
Efficient and face-to-face communication
Very short feedback loop and adaptation cycle;

Similar?

Conclusion

In multithread programming, CAS or Optimal Concurrency Control is used to:

capture my moment;
make my contribution based on my understanding;
try really hard to catch up with the real world;

Here, data is modified all the time by our peers, and we need to stay up to date with it to make changes.

In a team, Standup or Sprint Planning is used to achieve almost the same thing:

(to market or customers or PM) what do you think?
here is what I propose to solve your problem, does it still make sense?
what’s on your mind now? let me know!

Yeah, if human were more advanced machines, why not manage them using the same way :) Of course, don’t forget to add Happiness on top.

Custom Resource and Controllers: The new paradigm for programming

Posted on 2020-01-13 Edited on 2020-01-14

Paradigm?

According to Wikipedia, the explanation for paradigm is “In science and philosophy, a paradigm (/ˈpærədaɪm/) is a distinct set of concepts or thought patterns, including theories, research methods, postulates, and standards for what constitutes legitimate contributions to a field.” And this is exactly what Custom Resource + Controllers provide.

Interfaces

Subresources can be either data or service. Think of subresources as virtualized objects, interfaces of CR. If the subresource is data, it is a subset of all CR’s fields to describe a specific aspect; if the subresource is a service, it is the verb to interact with CR object.

status, scale are examples of the former;
log, exec, portforward are examples for the latter.

See https://github.com/kubernetes/kubernetes/issues/72637 for discussions on the support of arbitrary subresources for custom resources.

subresource + controller

Is supporting arbitrary subresource + customized controller too generic to be useful?

Design Patterns

Watch this talk on Kubecon: Growth and Design Patterns in the Extensions Ecosystem - Eric Tune, Google

Cloud Native Virtualization Technologies

Posted on 2020-01-03

I’ve been investigating cloud native virtualization technologies in the past week. More specifically, I tried to:

create/poweron/ssh into/poweroff/delete a virtualmachine by using qemu/kvm;
create/poweron/ssh into/poweroff/delete a virtualmachine through libvirt+esxi;
create/poweron/ssh into/poweroff/delete a virtualmachine through libvirt+qemu/kvm;
create/ssh into/suspend/delete a container through libvirt+lxc;
try out kata-container by following its quickstart guide:
1. build the kernel;
2. build the rootfs;
3. add kata-container as an extra runtime to docker;
4. run docker to start a container through kata-container so the container is actaully running as a vm;
5. only qemu/kvm is used;
try out firecracker by following its quickstart guide which basically boots up a vm by using provided kernel and rootfs;
try out ignite by following its quickstart guide to import a kernel from a docker image, and deploy a container vm, ssh into it and delete it;
also read some documentations on:
1. kubevirt
2. virtlet

After which, I get a rough idea of:

what is libvirt;
how is kubevirt, virtlet architectured and using libvirt;
what is kata-container, firecracker, cloud-hypervisor, qemu/kvm, lxc/lxd;

To summary:

kubevirt and virtlet all leverage libvirt to provision vms. They are mainly using the qemu/kvm mode for libvirt;
1. kubevirt includes an operator, a handler per K8s node and a launcher per
  vm;
2. virtlet instead implements a CRI proxy which calls virtlet process to provision the
  vm besides the inband call to docker;
firecracker now provides a binary that allows you to boot a microvm with provided kernel and rootfs fast. It opens up a socket and listens of REST requests;
ignite is a thin wrapper over firecracker as far as I call tell, including the abilities to easily manage kernel images, rootfs in an organized, docker-image way. But it’s not OCI-compliant, so it can only be used standalone now. Besides, it seems it’s integrated with weave net cni by default;
kata-container is pretty well interfaced compared with ignite. It’s a OCI-compliant container runtime so it can be easily swapped in to replace runc in docker, also through cri-o, it can also be used in K8s as the runtime. On the backend, it leverages virtcontainer library to abstract Hypervisor calls into an interface. qemu/kvm, cloud-hypervisor, firecracker, acrn are supported now according to the doc;
cloud-hypervisor and firecracker are built based on rust-vmm library, which includes many useful components to build vmm quickly, effectively and securely. Among them, it includes a kvm ioctl wrapper: kvm-bindings.

All my findings are recorded into my personal wiki and I’ll see if I can host it somewhere.

The following picture demonstrates their relationships:
Knowledge graph