How to design a Kubernetes cluster
- There is no best architecture, it all depends on your requirements and resources
- Clusters with 1 to 3 nodes are good for testing, but for production you should have at least 5 nodes
- Feel free to do differently as I say
Since its launch in 2015, Kubernetes is the new bitcoin and every company wants to migrate all its infrastructure (which is obviously made of hundreds of microservices) to show how cool they are. Even when we cut the bullshit, I believe Kubernetes is still a revolution, and if you are managing servers, you should take a look. If you are unsure what Kubernetes is, or what its components are, you should probably read a little about it. I might write an article about that, but in the meantime you should get an overview in this schema1 and in this article. This article is not meant as an analysis of the pros and cons of Kubernetes, but rather as an answer to the question what should my Kubernetes cluster look like?. In this article, I will take as granted that Kubernetes is worth looking at, but if you are interested on the whys, come back later, I will probably write such a post in the future.
When I design my clusters (especially for the first ones), I always find it hard to known how to architecture the cluster and where to set the cursor between high resilience/availability, high performance and low complexity (complexity driving the time and human resources required to operate the cluster).
In this article, I will try to design and explain in details the best Kubernetes cluster. Obviously, there are no best cluster architecture, it all depend on your requirements and resources. Instead, I will show several possible architectures, with their pros and cons. You should find one that fits your use case.
Disclaimer: theses architectures are what I came up with some thinking: I don’t claim to be right. Similarly to any other subject, you should always double check what you read and make your own opinion.
Kubernetes leaves a lot of choices to the user: there are tons of different possible clusters. I find it difficult to find which one is the most secure / resilient / performant. So here are some thoughts and tips to help you make your choice.
Note: this article focuses on self hosted clusters. For AWS/GCE/Azure there are a few differences but for this high level article it should not matter.
First, let’s talks about Kubernetes nodes: the different kinds, rough size advices, and some Hardware suggestions.
A Kubernetes cluster is made of several node types: master, worker and proxy.
The node type is mostly arbitrary: all nodes have the same software installed and can be quickly reconfigured to be one or another type.
The difference is purely configuration in Kubernetes configuration.
It is based on Taints2, which are applied when configuring the node with
For instance, all master nodes will have the Taint
node-role.kubernetes.io/master="" which will prevent Pods to be scheduled on them (unless the Pod has a Toleration, used for instance by Pods that are part of Kubernetes control plane and must to run on the master node).
Let’s describe theses node types a little more in depth.
The master node holds the Kubernetes control plane. The Kubernetes control plane includes all the components required for Kubernetes to function properly: kube-apiserver, kube-scheduler, kube-controller-mananger, kube-dns and etcd.
On one hand, all of the kube-* Pods are stateless and therefore not worth replicating: when one of them crash, it will restart in a few seconds ; and while it is down, the running Pods won’t be disturbed: you don’t have downtime.
On the other hand, etcd, which holds all the state of the cluster, is critical. Similarly to kube-* components, if it crashes, no Pods will get killed. You cluster will only enter a *read-only* state, where you won’t be able to interact with it (create new Pods for instance).
While kube-* components are probably not worth replicating (hard to replicate and low benefits) ; etcd is really worth replicating: it is easy to replicate and has immediate benefits.
The worker nodes are the workhorse of the cluster: they run the Pods.
Kubernetes encourages the cattle approach, as opposed to the pet approach. For worker nodes, it is easy to adhere to this principle, as each worker is non critical to the health of the cluster: if a worker dies, Kubernetes will reschedule the Pods elsewhere. Moreover, worker nodes can join a cluster very easily. Master and Proxy nodes are harder to think as cattle, but they have some properties of it: most components are stateless, and the etcd cluster can be resized at will.
You should take care when sizing the worker nodes: if one fails, others could be DDoS-ed when Pods are rescheduled by kube-scheduler-manager. To prevent overloading one worker, you should apply limits: on the node and per Pods (and/or per Namespace).
The proxy nodes will only run a kube-proxy instance (and an ingress-controller to minimize the number of network hops for a request).
This kind of node will be exposed externally, and therefore their purpose is to have the smallest exposition surface possible. This will reduce the attack surface, and help protect the integrity of the cluster.
Here is a rough guide for node sizing. Note that if you want to run very large cluster (hundreds of nodes), you probably already have data for smaller cluster and therefore known what to expect.
|Master||1-2||2-8 GB||Cpu (kube-*) / Ram (etcd)|
|Worker||1+||2+ GB||Depends on the load|
|Proxy||1||1-4 GB||Network / Cpu|
The most critical resource is memory: 2 GB is a minimum, but if you are running a test cluster and you have very few resources, you might get away with 1.5 GB. Most probably not much less, and I would not advise for it, as processes might be killed by the OOM killer3.
Obviously, you should choose your worker nodes according to the load that you need to withstand.
Building a Kubernetes cluster require a few servers, and if you want some decent security and availability with the ability to execute your workload, it quickly gets to 5 or 10 servers.
- The most efficient cluster would be made of some very powerful servers (worker nodes) and some less powerful ones (master and proxy nodes).
- If you have a few very powerful servers, you could have VMs on them.
- If you are tight on budget, you can use RaspberryPi’s, but you will be very constrained (might work well for proxy and master nodes though, and maybe even for worker nodes if you have lightweight apps to run).
When designing any system, you always have several goals to optimize for. In our case, we will try to have a cluster with as much security and high availability as we can, while keeping the cluster (relatively) simple to set up and operate. The line I will draw between theses constrains will probably not be the one you will draw for your use case, but I will try to explain my choices as much as possible to help you make your own decision.
To ensure the security and high availability of your cluster, you should have several nodes of each type. This number depends on the node type:
- For master nodes, the number is driven by etcd, which requires 3 nodes (or 1 if you don’t choose availability, or 5 if you need to be resilient to two node failures).
- For proxy nodes, you only need redundancy which 2 nodes achieve. One node might be enough though, if you want to keep it simple.4
- For worker nodes, it depends on your workload, but you need at least 2 nodes for redundancy.
Note that you should provision resources (mainly Cpu and Ram, but don’t forget Disk and Network) so that N-1 worker nodes handle the load (in case one node fails). The more nodes you have, the smaller the impact will be in case of a failure. For example, if you have two worker nodes, and one fails, the other one should be able to handle the load (or at least have quota and limits set up so that it does not crash miserably).
Concerning persistent data, you have several choices when creating a Kubernetes
The choice is done with the
localPath: this mounts a folder from the host system. Note that if the Pod is restarted on another node, the data will stay on the first node and it will be unavailable. To solve this, you need to share a filesystem over the network.
nfs: this is a proven system, but it has no redundancy whatsoever. If the nfs server fails, all your data is unavailable until it is up again. Also, all the disk load is concentrated on the nfs server.
ceph: this shares a filesystem across several nodes. You get failure tolerance and distributed disk load.
- More options are available, refer to the full list.
Note: don’t forget backups and remember that untested backups are not backups.
User and administration access
Finally, for security purposes, you should not expose your master and worker nodes. This is where the proxy node finds its purpose.
You don’t want to expose the master node to minimize the chances of unauthorized access to kube-apiserver and etcd (which could allow denial of service or lead to write access).
You don’t want to expose the worker nodes to prevent a malicious actor to connect to an arbitrary port that could have been opened by a rogue Pod. By enforcing Pods to be exposed through the Ingress5, you cant have a very small surface exposed to the outside world.
Let’s break down the various architecture options by the number of nodes you have at your disposal. Note that you can create VMs if you don’t have enough nodes.
Small: <5 nodes
With up to 5 nodes you don’t have many choices if you don’t want to waste all your servers for Kubernetes plumbing. Your best bet is to have 1 master and N worker, as described by the following schema.
- For this architecture, 3 nodes (1 master + 2 workers) is the minimum for a viable cluster if you want to benefit from Kubernetes features (automatic scheduling and self healing of your applications).
- Redundancy for the master node is not the priority:
- With such a small cluster, you probably have few resources (human and hardware) at your disposal.
- You can run an instance of etcd on your worker nodes. Be sure to enable HTTPS only connections and force authentication with certificates.
Large: >5 nodes
With at least 5 nodes, you can afford higher security and higher availability.
The architecture for this cluster size will be 3 master, 1-2 proxy and N worker.
On the master nodes, you can replicate only etcd or the whole Kubernetes control plane. If you replicated the whole Kubernetes control plane, it will remove another single point of failure, but it depends if you can afford the added complexity.
The best architecture
The best architecture would leverage VMs on 3 powerful servers. You would have:
- 3 physical machines with (at least) 8 Cpu cores and 16 GB Ram.
- On each of them, configure 2 or 3 VMs (master, proxy, worker) as follow:
- 3 master for etcd and Kubernetes control plane
- 1 proxy for security and high availability (exposed externally)
- As many workers as we have Cpu and Ram left
|Node type||Cpu||Ram (GB)||Reason|
|Master||2||8||High availability for etcd|
|Proxy||2||4||Security and high availability (exposed external)|
|Worker||4+||4+||As many as possible|
Some general tips:
- 1 private IP for every node
- 1 external IP for the proxy nodes
- 1 Disk for the system (OS) and Kubernetes database (etcd)
- 1 Disk for the network filesystem for the Pods
Kubernetes allows a lot of freedom, and it is not always easy to know which option is the best.
Before wanting to setup a highly available Kubernetes cluster, you should know the different failures and their impacts (for example, if the master node dies, what happens? The answer is: pretty much nothing). You should really read this article about high availability and the absence of a single point of failure. In short: the first goal is to work on etcd, Kubernetes itself is very resilient.
Don’t expect Kubernetes to solve all your currently unsolved problems: high availability, security, monitoring, etc. Everything is about trade-offs, but in any case, if you migrate to a Kubernetes cluster, you will be improving. Probably not everywhere but at least on some topics.
This article comes to an end, I hope you found it useful. Your questions/comments/complaints are welcome.
Source: https://en.wikipedia.org/wiki/File:Kubernetes.png ↩︎
For more information, see https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/ ↩︎
For more information, see https://en.wikipedia.org/wiki/Out_of_memory#Out_of_memory_management ↩︎
If you choose the 2 proxy version, take a look at https://github.com/kubernetes/contrib/tree/master/keepalived-vip ↩︎
If you don’t known what it is, I will tell you: this is dope! Go check https://kubernetes.io/docs/concepts/services-networking/ingress/ and https://github.com/kubernetes/ingress-nginx ↩︎