How to design a Kubernetes cluster



TL;DR:

Introduction

Since its launch in 2015, Kubernetes is the new bitcoin and every company wants to migrate all its infrastructure (which is obviously made of hundreds of microservices) to show how cool they are. Even when we cut the bullshit, I believe Kubernetes is still a revolution, and if you are managing servers, you should take a look. If you are unsure what Kubernetes is, or what its components are, you should probably read a little about it. I might write an article about that, but in the meantime you should get an overview in this schema1 and in this article. This article is not meant as an analysis of the pros and cons of Kubernetes, but rather as an answer to the question what should my Kubernetes cluster look like?. In this article, I will take as granted that Kubernetes is worth looking at, but if you are interested on the whys, come back later, I will probably write such a post in the future.

When I design my clusters (especially for the first ones), I always find it hard to known how to architecture the cluster and where to set the cursor between high resilience/availability, high performance and low complexity (complexity driving the time and human resources required to operate the cluster).

Kubernetes logo
Kubernetes logo - Source

In this article, I will try to design and explain in details the best Kubernetes cluster. Obviously, there are no best cluster architecture, it all depend on your requirements and resources. Instead, I will show several possible architectures, with their pros and cons. You should find one that fits your use case.

Disclaimer: theses architectures are what I came up with some thinking: I don’t claim to be right. Similarly to any other subject, you should always double check what you read and make your own opinion.

Kubernetes leaves a lot of choices to the user: there are tons of different possible clusters. I find it difficult to find which one is the most secure / resilient / performant. So here are some thoughts and tips to help you make your choice.

Note: this article focuses on self hosted clusters. For AWS/GCE/Azure there are a few differences but for this high level article it should not matter.

Kubernetes nodes

First, let’s talks about Kubernetes nodes: the different kinds, rough size advices, and some Hardware suggestions.

Node types

A Kubernetes cluster is made of several node types: master, worker and proxy.

The node type is mostly arbitrary: all nodes have the same software installed and can be quickly reconfigured to be one or another type. The difference is purely configuration in Kubernetes configuration. It is based on Taints2, which are applied when configuring the node with kubeadm. For instance, all master nodes will have the Taint node-role.kubernetes.io/master="" which will prevent Pods to be scheduled on them (unless the Pod has a Toleration, used for instance by Pods that are part of Kubernetes control plane and must to run on the master node).

Let’s describe theses node types a little more in depth.

Master

The master node holds the Kubernetes control plane. The Kubernetes control plane includes all the components required for Kubernetes to function properly: kube-apiserver, kube-scheduler, kube-controller-mananger, kube-dns and etcd.

On one hand, all of the kube-* Pods are stateless and therefore not worth replicating: when one of them crash, it will restart in a few seconds ; and while it is down, the running Pods won’t be disturbed: you don’t have downtime.

On the other hand, etcd, which holds all the state of the cluster, is critical. Similarly to kube-* components, if it crashes, no Pods will get killed. You cluster will only enter a *read-only* state, where you won’t be able to interact with it (create new Pods for instance).

While kube-* components are probably not worth replicating (hard to replicate and low benefits) ; etcd is really worth replicating: it is easy to replicate and has immediate benefits.

Worker

The worker nodes are the workhorse of the cluster: they run the Pods.

Kubernetes encourages the cattle approach, as opposed to the pet approach. For worker nodes, it is easy to adhere to this principle, as each worker is non critical to the health of the cluster: if a worker dies, Kubernetes will reschedule the Pods elsewhere. Moreover, worker nodes can join a cluster very easily. Master and Proxy nodes are harder to think as cattle, but they have some properties of it: most components are stateless, and the etcd cluster can be resized at will.

You should take care when sizing the worker nodes: if one fails, others could be DDoS-ed when Pods are rescheduled by kube-scheduler-manager. To prevent overloading one worker, you should apply limits: on the node and per Pods (and/or per Namespace).

Proxy

The proxy nodes will only run a kube-proxy instance (and an ingress-controller to minimize the number of network hops for a request).

This kind of node will be exposed externally, and therefore their purpose is to have the smallest exposition surface possible. This will reduce the attack surface, and help protect the integrity of the cluster.

Node sizing

Here is a rough guide for node sizing. Note that if you want to run very large cluster (hundreds of nodes), you probably already have data for smaller cluster and therefore known what to expect.

Node type Cpu  Ram Bottleneck
Master 1-2 2-8 GB Cpu (kube-*) / Ram (etcd)
Worker 1+ 2+ GB Depends on the load
Proxy 1 1-4 GB Network / Cpu

The most critical resource is memory: 2 GB is a minimum, but if you are running a test cluster and you have very few resources, you might get away with 1.5 GB. Most probably not much less, and I would not advise for it, as processes might be killed by the OOM killer3.

Obviously, you should choose your worker nodes according to the load that you need to withstand.

Hardware suggestions

Building a Kubernetes cluster require a few servers, and if you want some decent security and availability with the ability to execute your workload, it quickly gets to 5 or 10 servers.

Some suggestions:

Goals

When designing any system, you always have several goals to optimize for. In our case, we will try to have a cluster with as much security and high availability as we can, while keeping the cluster (relatively) simple to set up and operate. The line I will draw between theses constrains will probably not be the one you will draw for your use case, but I will try to explain my choices as much as possible to help you make your own decision.

Nodes

To ensure the security and high availability of your cluster, you should have several nodes of each type. This number depends on the node type:

Note that you should provision resources (mainly Cpu and Ram, but don’t forget Disk and Network) so that N-1 worker nodes handle the load (in case one node fails). The more nodes you have, the smaller the impact will be in case of a failure. For example, if you have two worker nodes, and one fails, the other one should be able to handle the load (or at least have quota and limits set up so that it does not crash miserably).

Persistent data

Concerning persistent data, you have several choices when creating a Kubernetes PersistentVolume. The choice is done with the type attribute:

Note: don’t forget backups and remember that untested backups are not backups.

User and administration access

Finally, for security purposes, you should not expose your master and worker nodes. This is where the proxy node finds its purpose.

You don’t want to expose the master node to minimize the chances of unauthorized access to kube-apiserver and etcd (which could allow denial of service or lead to write access).

You don’t want to expose the worker nodes to prevent a malicious actor to connect to an arbitrary port that could have been opened by a rogue Pod. By enforcing Pods to be exposed through the Ingress5, you cant have a very small surface exposed to the outside world.

Architectures

Let’s break down the various architecture options by the number of nodes you have at your disposal. Note that you can create VMs if you don’t have enough nodes.

Small: <5 nodes

With up to 5 nodes you don’t have many choices if you don’t want to waste all your servers for Kubernetes plumbing. Your best bet is to have 1 master and N worker, as described by the following schema.

One node is master and is exposed to the internet. The other nodes are workers and are behind the master.
Kubernetes small cluster

Large: >5 nodes

With at least 5 nodes, you can afford higher security and higher availability.

The architecture for this cluster size will be 3 master, 1-2 proxy and N worker.

On the master nodes, you can replicate only etcd or the whole Kubernetes control plane. If you replicated the whole Kubernetes control plane, it will remove another single point of failure, but it depends if you can afford the added complexity.

One node is proxy and is exposed to the internet. The other nodes are three master and several workers and are behind the proxy. All the nodes are connected to the first master.
Kubernetes large cluster

The best architecture

The best architecture would leverage VMs on 3 powerful servers. You would have:

Node type Cpu Ram (GB) Reason
Master 2 8 High availability for etcd
Proxy 2 4 Security and high availability (exposed external)
Worker 4+ 4+ As many as possible

Some general tips:

Conclusion

Kubernetes allows a lot of freedom, and it is not always easy to know which option is the best.

Before wanting to setup a highly available Kubernetes cluster, you should know the different failures and their impacts (for example, if the master node dies, what happens? The answer is: pretty much nothing). You should really read this article about high availability and the absence of a single point of failure. In short: the first goal is to work on etcd, Kubernetes itself is very resilient.

Don’t expect Kubernetes to solve all your currently unsolved problems: high availability, security, monitoring, etc. Everything is about trade-offs, but in any case, if you migrate to a Kubernetes cluster, you will be improving. Probably not everywhere but at least on some topics.

This article comes to an end, I hope you found it useful. Your questions/comments/complaints are welcome.


  1. Source: https://en.wikipedia.org/wiki/File:Kubernetes.png ↩︎

  2. For more information, see https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/ ↩︎

  3. For more information, see https://en.wikipedia.org/wiki/Out_of_memory#Out_of_memory_management ↩︎

  4. If you choose the 2 proxy version, take a look at https://github.com/kubernetes/contrib/tree/master/keepalived-vip ↩︎

  5. If you don’t known what it is, I will tell you: this is dope! Go check https://kubernetes.io/docs/concepts/services-networking/ingress/ and https://github.com/kubernetes/ingress-nginx ↩︎