This guide explains how to build a highly-available, hyperconverged Kubernetes cluster using MicroK8s, Ceph and MetalLB on commodity hardware or virtual machines. This could be useful for small production deployments, dev/test clusters, or a nerdy toy.
Other guides are available – this one is written from a sysadmin point of view, focusing on stability and ease of maintenance. I prefer to avoid running random scripts or fetching binaries that are then unmanaged and unmanageable. This guide uses package managers and operators wherever possible. I’ve also attempted to explain each step so readers can gain some understanding instead of just copying and pasting the commands. However, this does not absolve you from having a decent background of the components, and it is strongly recommended that you are familiar with kubectl/Kubernetes and Ceph in particular.
The technological landscape moves so fast so these instructions may become outdated quickly. I’ll link to upstream documentation wherever possible so you can check for updated versions.
Finally, this is a fairly simplistic guide that gives you the minimum possible configuration. There are many other components and configurations that you can add, and it also takes no account of security with RBAC etc.
Hardware
There are a few of considerations when choosing your hardware or virtual “hardware” for use as Kubernetes nodes.
- MicroK8s requires at least 3 nodes to work in HA mode, so we’ll start with 3 VMs
- While MicroK8s is quite lightweight, by the time you start adding the storage capability you will need a reasonable amount of memory. Recommended minimum spec for this guide is 2 CPUs and 4GB RAM. More is obviously better, depending on your workload.
- Each VM will need two block devices (disks). One should be partitioned, formatted and used as a normal OS disk, and the other should be left untouched so it can be claimed by Ceph later. The OS disk will also contain cached container images so could get quite large. I’ve allowed 16GB for the OS disk, and Ceph requires a minimum of 10GB for its disk.
- If running in VirtualBox, place all VMs either in the same NAT network, or bridged to the host network. Ideally have static IPs.
- If you are running on bare metal, make sure the machines are on the same network, or at least on networks that can talk to each other.
In my case, I used VirtualBoxc and created 3 identical VMs, kube01
, kube02
and kube03
.
Operating system
This guide focuses on CentOS/Fedora but should be applicable to many distributions with minor tweaks. I have started with a CentOS 8 minimal installation. Fedora Server or Ubuntu Server would also work just as well but you’ll need to tweak some of the commands.
- Don’t create a swap partition on these machines
- Make sure ntp is enabled for accurate time
- Make sure the VMs have static IPs or DHCP reservations, so their IPs won’t change
Snap
Reference: https://snapcraft.io/docs/installing-snap-on-centos
Snap is a package manager that contains MicroK8s. It comes preinstalled on Ubuntu, but if you’re on CentOS, Fedora or others, you’ll need to install it on all your nodes.
sudo dnf -y install epel-release
sudo dnf -y install snapd
sudo systemctl enable --now snapd
sudo ln -s /var/lib/snapd/snap /snap
MicroK8s
Reference: https://microk8s.io/
MicroK8s is a lightweight, pre-packaged Kubernetes distribution which is easy to use and works well for small deployments. It’s a lot more straightforward than following Kubernetes the hard way.
Install
Install MicroK8s 1.19.1 or greater from Snap on all your nodes:
sudo snap install microk8s --classic --channel=latest/edge
microk8s status --wait-ready
echo 'alias kubectl="microk8s kubectl"' >> ~/.bashrc
The first time you run microk8s status
, you will be prompted to add your user to the microk8s
group. Follow the instructions and log in again.
Enable HA mode
Reference: https://microk8s.io/docs/high-availability
Enable MicroK8s HA mode on all nodes, which allows any of the worker nodes to also behave as a master, instead of just being a worker node. This must be enabled before nodes are joined to the master. On some versions of MicroK8s this is enabled by default. https://microk8s.io/docs/high-availability
microk8s enable ha-cluster
Add firewall rules
Reference: https://microk8s.io/docs/ports
Create firewall rules for your nodes, so they can communicate with each other.
Enable clustering
Reference: https://microk8s.io/docs/clustering
Enable Microk8s clustering, which allows you to add multiple worker nodes to your existing master node
Run this on the first node only:
[jonathan@kube01 ~]$ microk8s add-node
From the node you wish to join to this cluster, run the following:
microk8s join 192.168.0.41:25000/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Then execute the join command on the second node, to join it to the master.
[jonathan@kube02 ~]$ microk8s join 192.168.0.41:25000/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Contacting cluster at 192.168.0.41
Waiting for this node to finish joining the cluster. ..
Repeat for the third node and remember to run the add-node
command for each node you add, so they all get a unique token.
Verify that they are correctly joined:
[jonathan@kube01 ~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kube01.jonathangazeley.com Ready <none> 35h v1.19.1-34+08a87c75adb55c
kube03.jonathangazeley.com Ready <none> 35h v1.19.1-34+08a87c75adb55c
kube02.jonathangazeley.com Ready <none> 35h v1.19.1-34+08a87c75adb55c
Finally make sure that full HA mode is enabled:
[jonathan@kube01 ~]$ microk8s status
microk8s is running
high-availability: yes
datastore master nodes: 192.168.0.41:19001 192.168.0.42:19001 192.168.0.43:19001
datastore standby nodes: none
addons:
enabled:
...
Addons
Reference: https://microk8s.io/docs/addon-dns
Reference: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
Enable some basic addons across the cluster to provide a usable experience. Run this on any one node.
microk8s enable dns rbac
Check
We’ve already checked that all 3 nodes are up. Now let’s make sure pods are being scheduled on all nodes:
[jonathan@kube01 ~]$ kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system calico-node-bqqqd 1/1 Running 0 112m 192.168.0.41 kube01.jonathangazeley.com
kube-system calico-node-z4sxd 1/1 Running 0 110m 192.168.0.43 kube03.jonathangazeley.com
kube-system calico-kube-controllers-847c8c99d-4qblz 1/1 Running 0 115m 10.1.58.1 kube01.jonathangazeley.com
kube-system coredns-86f78bb79c-t2sgt 1/1 Running 0 109m 10.1.111.65 kube02.jonathangazeley.com
kube-system calico-node-t5skc 1/1 Running 0 111m 192.168.0.42 kube02.jonathangazeley.com
With the cluster in a health and operational state, let’s add the hyperconverged storage. From now on, all steps can be run on kube01
.
Ceph
Ceph is a clustered storage engine which can present its storage to Kubernetes as block storage or a filesystem. We will use the Rook operator to manage our Ceph deployment.
Install
Reference: https://rook.io/docs/rook/v1.4/ceph-quickstart.html
These steps are taken verbatim from the official Rook docs. Check the link above to make sure you are using the latest version of Rook.
First we install the Rook operator, which automates the rest of the Ceph installation.
git clone --single-branch --branch release-1.4 https://github.com/rook/rook.git
cd rook/cluster/examples/kubernetes/ceph
kubectl create -f common.yaml
kubectl create -f operator.yaml
kubectl -n rook-ceph get pod
Wait until the rook-ceph-operator
pod and the rook-discover
pods are all Running. This took a few minutes for me. Then we can create the actual Ceph cluster.
kubectl create -f cluster.yaml
kubectl -n rook-ceph get pod
This command will probably take a while – be patient. The operator creates various pods including canaries, monitors, a manager, and provisioners. There will be periods where it looks like it isn’t doing anything, but don’t be tempted to intervene. You can check what the operator is doing by reading its log:
kubectl -n rook-ceph logs rook-ceph-operator-775d4b6c5f-52r87
Check
Reference: https://rook.io/docs/rook/v1.4/ceph-toolbox.html
Install the Ceph toolbox and connect to it so we can run some checks.
kubectl create -f toolbox.yaml
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
OSDs are the individual pieces of storage. Make sure all 3 are available and check the overall health of the cluster.
[root@rook-ceph-tools-6967fc698d-5f4sh /]# ceph status
cluster:
id: e37a9364-b2e4-42ba-a7c0-c7276bc2083d
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,d (age 2m)
mgr: a(active, since 33s)
osd: 3 osds: 3 up (since 89s), 3 in (since 89s)
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 45 GiB / 48 GiB avail
pgs: 1 active+clean
[root@rook-ceph-tools-6967fc698d-5f4sh /]# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 kube03.jonathangazeley.com 1027M 14.9G 0 0 0 0 exists,up
1 kube02.jonathangazeley.com 1027M 14.9G 0 0 0 0 exists,up
2 kube01.jonathangazeley.com 1027M 14.9G 0 0 0 0 exists,up
Block storage
Reference: https://rook.io/docs/rook/v1.4/ceph-block.html
Ceph can provide persistent block storage to Kubernetes as a storage class which can be consumed by one pod at any one time.
kubectl create -f csi/rbd/storageclass.yaml
Verify that the block storageclass is available:
[jonathan@kube01 ~]$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
rook-ceph-block rook-ceph.rbd.csi.ceph.com Delete Immediate true 3m53s
Filesystem
Reference: https://rook.io/docs/rook/v1.4/ceph-filesystem.html
Ceph can provide persistent storage which can be consumed across multiple pods simultaneously by providing a filesystem layer.
kubectl create -f filesystem.yaml
Use the toolbox again to verify that there is a metadata service (mds) available:
[root@rook-ceph-tools-6967fc698d-5f4sh /]# ceph status
cluster:
id: e37a9364-b2e4-42ba-a7c0-c7276bc2083d
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,d (age 36m)
mgr: a(active, since 34m)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 3 osds: 3 up (since 35m), 3 in (since 35m)
task status:
scrub status:
mds.myfs-a: idle
mds.myfs-b: idle
data:
pools: 4 pools, 97 pgs
objects: 22 objects, 2.2 KiB
usage: 3.0 GiB used, 45 GiB / 48 GiB avail
pgs: 97 active+clean
io:
client: 852 B/s rd, 1 op/s rd, 0 op/s wr
Now we can create a new storageclass based on the filesystem:
kubectl create -f csi/cephfs/storageclass.yaml
Verify the storageclass is present:
[jonathan@kube01 ceph]$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
rook-ceph-block (default) rook-ceph.rbd.csi.ceph.com Delete Immediate true 49m
rook-cephfs rook-ceph.cephfs.csi.ceph.com Delete Immediate true 34m
Consume
It’s easy to consume the new Ceph storage. Use the storageClassName rook-ceph-block in ReadWriteOnce
mode for persistent storage for a single pod, or rook-cephfs in ReadWriteMany
mode for persistent storage that can be shared between pods.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ceph-rbd-pvc
labels:
spec:
storageClassName: rook-ceph-block
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cephfs-pvc
spec:
storageClassName: rook-cephfs
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
Ingress
Reference: https://microk8s.io/docs/addon-ingress
Probably the simplest way to expose web applications on your cluster is to use an Ingress. This binds to ports 80+443 on all your nodes and listens for http+https requests. It will effectively do name-based virtual hosting, terminate your SSL and will direct your web traffic to a Kubernetes Service with an internal ClusterIP which acts as a simple load balancer. This will require you to set up external round robin DNS to point your A record at all 3 of the node IPs.
microk8s enable ingress
sudo firewall-cmd --permanent --add-service http
sudo firewall-cmd --permanent --add-service https
sudo firewall-cmd --reload
MetalLB
Reference: https://microk8s.io/docs/addon-metallb
If you want to set up more advanced load balancing, consider using MetalLB. It will load balance your Kubernetes Service and present it on a single virtual IP.
Install
MetalLB will prompt you for one or more ranges of IPs that it can use for load-balancing. It should be fine to accept the default suggestion.
[jonathan@kube01 ~]$ microk8s enable metallb
Enabling MetalLB
Enter each IP address range delimited by comma (e.g. '10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111'): 10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111
Consume
Once MetalLB is installed and configured, to expose a service externally, simply create it with spec.type
set to LoadBalancer
, and MetalLB will do the rest.
It’s important to note that in the default config, the vIP will only appear on one of your nodes and that node will act as the entry point for all traffic before it gets load balanced between nodes, so this could be a bottleneck in busy environments.
---
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
ports:
- port: 80
targetPort: 80
selector:
app: nginx
type: LoadBalancer
Summary
You now have a fully-featured Kubernetes cluster with high availability, clustered storage, ingress, and load balancing. The possibilities are endless!
If you spot any mistakes, improvements or versions that need updating, please drop a comment below.
Hello,
that helped me very much, thank you.
I had to set ROOK_CSI_KUBELET_DIR_PATH: to “/var/snap/microk8s/common/var/lib/kubelet”
Otherwise, it tries to use /var/lib/kubelet, in which case you would need to manually create the expected dir layout.
At least, thats true for v1.5.9
LikeLike
Hi Jonathan,
Thank you very much for your very comprehensive article.
Based on it, and as part of my “research” for an upcoming production deployment, I ended up this PoC project: https://github.com/el95149/vagrant-microk8s-ceph
Hope it proves useful to somebody. Once again, thanks!!!
LikeLike