Building a hyperconverged Kubernetes cluster with MicroK8s and Ceph

This guide explains how to build a highly-available, hyperconverged Kubernetes cluster using MicroK8s, Ceph and MetalLB on commodity hardware or virtual machines. This could be useful for small production deployments, dev/test clusters, or a nerdy toy.

Other guides are available – this one is written from a sysadmin point of view, focusing on stability and ease of maintenance. I prefer to avoid running random scripts or fetching binaries that are then unmanaged and unmanageable. This guide uses package managers and operators wherever possible. I’ve also attempted to explain each step so readers can gain some understanding instead of just copying and pasting the commands. However, this does not absolve you from having a decent background of the components, and it is strongly recommended that you are familiar with kubectl/Kubernetes and Ceph in particular.

The technological landscape moves so fast so these instructions may become outdated quickly. I’ll link to upstream documentation wherever possible so you can check for updated versions.

Finally, this is a fairly simplistic guide that gives you the minimum possible configuration. There are many other components and configurations that you can add, and it also takes no account of security with RBAC etc.

Hardware

There are a few of considerations when choosing your hardware or virtual “hardware” for use as Kubernetes nodes.

MicroK8s requires at least 3 nodes to work in HA mode, so we’ll start with 3 VMs
While MicroK8s is quite lightweight, by the time you start adding the storage capability you will need a reasonable amount of memory. Recommended minimum spec for this guide is 2 CPUs and 4GB RAM. More is obviously better, depending on your workload.
Each VM will need two block devices (disks). One should be partitioned, formatted and used as a normal OS disk, and the other should be left untouched so it can be claimed by Ceph later. The OS disk will also contain cached container images so could get quite large. I’ve allowed 16GB for the OS disk, and Ceph requires a minimum of 10GB for its disk.
If running in VirtualBox, place all VMs either in the same NAT network, or bridged to the host network. Ideally have static IPs.
If you are running on bare metal, make sure the machines are on the same network, or at least on networks that can talk to each other.

In my case, I used VirtualBoxc and created 3 identical VMs, kube01, kube02 and kube03.

Operating system

This guide focuses on CentOS/Fedora but should be applicable to many distributions with minor tweaks. I have started with a CentOS 8 minimal installation. Fedora Server or Ubuntu Server would also work just as well but you’ll need to tweak some of the commands.

Don’t create a swap partition on these machines
Make sure ntp is enabled for accurate time
Make sure the VMs have static IPs or DHCP reservations, so their IPs won’t change

Snap

Reference: https://snapcraft.io/docs/installing-snap-on-centos

Snap is a package manager that contains MicroK8s. It comes preinstalled on Ubuntu, but if you’re on CentOS, Fedora or others, you’ll need to install it on all your nodes.

sudo dnf -y install epel-release
sudo dnf -y install snapd
sudo systemctl enable --now snapd
sudo ln -s /var/lib/snapd/snap /snap

MicroK8s

Reference: https://microk8s.io/

MicroK8s is a lightweight, pre-packaged Kubernetes distribution which is easy to use and works well for small deployments. It’s a lot more straightforward than following Kubernetes the hard way.

Install

Install MicroK8s 1.19.1 or greater from Snap on all your nodes:

sudo snap install microk8s --classic --channel=latest/edge
microk8s status --wait-ready
echo 'alias kubectl="microk8s kubectl"' >> ~/.bashrc

The first time you run microk8s status, you will be prompted to add your user to the microk8s group. Follow the instructions and log in again.

Enable HA mode

Reference: https://microk8s.io/docs/high-availability

Enable MicroK8s HA mode on all nodes, which allows any of the worker nodes to also behave as a master, instead of just being a worker node. This must be enabled before nodes are joined to the master. On some versions of MicroK8s this is enabled by default. https://microk8s.io/docs/high-availability

microk8s enable ha-cluster

Add firewall rules

Reference: https://microk8s.io/docs/ports

Create firewall rules for your nodes, so they can communicate with each other.

Enable clustering

Reference: https://microk8s.io/docs/clustering

Enable Microk8s clustering, which allows you to add multiple worker nodes to your existing master node

Run this on the first node only:

[jonathan@kube01 ~]$ microk8s add-node
From the node you wish to join to this cluster, run the following:
microk8s join 192.168.0.41:25000/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Then execute the join command on the second node, to join it to the master.

[jonathan@kube02 ~]$ microk8s join 192.168.0.41:25000/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Contacting cluster at 192.168.0.41
Waiting for this node to finish joining the cluster. ..

Repeat for the third node and remember to run the add-node command for each node you add, so they all get a unique token.

Verify that they are correctly joined:

[jonathan@kube01 ~]$ kubectl get nodes
NAME                         STATUS   ROLES    AGE   VERSION
kube01.jonathangazeley.com   Ready    <none>   35h   v1.19.1-34+08a87c75adb55c
kube03.jonathangazeley.com   Ready    <none>   35h   v1.19.1-34+08a87c75adb55c
kube02.jonathangazeley.com   Ready    <none>   35h   v1.19.1-34+08a87c75adb55c

Finally make sure that full HA mode is enabled:

[jonathan@kube01 ~]$ microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 192.168.0.41:19001 192.168.0.42:19001 192.168.0.43:19001
  datastore standby nodes: none
addons:
  enabled:
...

Addons

Reference: https://microk8s.io/docs/addon-dns

Reference: https://kubernetes.io/docs/reference/access-authn-authz/rbac/

Enable some basic addons across the cluster to provide a usable experience. Run this on any one node.

microk8s enable dns rbac

Check

We’ve already checked that all 3 nodes are up. Now let’s make sure pods are being scheduled on all nodes:

[jonathan@kube01 ~]$ kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                                READY   STATUS              RESTARTS   AGE    IP             NODE                      
kube-system   calico-node-bqqqd                                   1/1     Running             0          112m   192.168.0.41   kube01.jonathangazeley.com
kube-system   calico-node-z4sxd                                   1/1     Running             0          110m   192.168.0.43   kube03.jonathangazeley.com
kube-system   calico-kube-controllers-847c8c99d-4qblz             1/1     Running             0          115m   10.1.58.1      kube01.jonathangazeley.com
kube-system   coredns-86f78bb79c-t2sgt                            1/1     Running             0          109m   10.1.111.65    kube02.jonathangazeley.com
kube-system   calico-node-t5skc                                   1/1     Running             0          111m   192.168.0.42   kube02.jonathangazeley.com

With the cluster in a health and operational state, let’s add the hyperconverged storage. From now on, all steps can be run on kube01.

Ceph

Ceph is a clustered storage engine which can present its storage to Kubernetes as block storage or a filesystem. We will use the Rook operator to manage our Ceph deployment.

Install

Reference: https://rook.io/docs/rook/v1.4/ceph-quickstart.html

These steps are taken verbatim from the official Rook docs. Check the link above to make sure you are using the latest version of Rook.

First we install the Rook operator, which automates the rest of the Ceph installation.

git clone --single-branch --branch release-1.4 https://github.com/rook/rook.git
cd rook/cluster/examples/kubernetes/ceph
kubectl create -f common.yaml
kubectl create -f operator.yaml
kubectl -n rook-ceph get pod

Wait until the rook-ceph-operator pod and the rook-discover pods are all Running. This took a few minutes for me. Then we can create the actual Ceph cluster.

kubectl create -f cluster.yaml
kubectl -n rook-ceph get pod

This command will probably take a while – be patient. The operator creates various pods including canaries, monitors, a manager, and provisioners. There will be periods where it looks like it isn’t doing anything, but don’t be tempted to intervene. You can check what the operator is doing by reading its log:

kubectl -n rook-ceph logs rook-ceph-operator-775d4b6c5f-52r87

Check

Reference: https://rook.io/docs/rook/v1.4/ceph-toolbox.html

Install the Ceph toolbox and connect to it so we can run some checks.

kubectl create -f toolbox.yaml
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

OSDs are the individual pieces of storage. Make sure all 3 are available and check the overall health of the cluster.

[root@rook-ceph-tools-6967fc698d-5f4sh /]# ceph status
  cluster:
    id:     e37a9364-b2e4-42ba-a7c0-c7276bc2083d
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,d (age 2m)
    mgr: a(active, since 33s)
    osd: 3 osds: 3 up (since 89s), 3 in (since 89s)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   3.0 GiB used, 45 GiB / 48 GiB avail
    pgs:     1 active+clean

[root@rook-ceph-tools-6967fc698d-5f4sh /]# ceph osd status
ID  HOST                         USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
 0  kube03.jonathangazeley.com  1027M  14.9G      0        0       0        0   exists,up  
 1  kube02.jonathangazeley.com  1027M  14.9G      0        0       0        0   exists,up  
 2  kube01.jonathangazeley.com  1027M  14.9G      0        0       0        0   exists,up

Block storage

Reference: https://rook.io/docs/rook/v1.4/ceph-block.html

Ceph can provide persistent block storage to Kubernetes as a storage class which can be consumed by one pod at any one time.

kubectl create -f csi/rbd/storageclass.yaml

Verify that the block storageclass is available:

[jonathan@kube01 ~]$ kubectl get storageclass
NAME                PROVISIONER                  RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
rook-ceph-block     rook-ceph.rbd.csi.ceph.com   Delete          Immediate           true                   3m53s

Filesystem

Reference: https://rook.io/docs/rook/v1.4/ceph-filesystem.html

Ceph can provide persistent storage which can be consumed across multiple pods simultaneously by providing a filesystem layer.

kubectl create -f filesystem.yaml

Use the toolbox again to verify that there is a metadata service (mds) available:

[root@rook-ceph-tools-6967fc698d-5f4sh /]# ceph status
  cluster:
    id:     e37a9364-b2e4-42ba-a7c0-c7276bc2083d
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,d (age 36m)
    mgr: a(active, since 34m)
    mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 35m), 3 in (since 35m)
 
  task status:
    scrub status:
        mds.myfs-a: idle
        mds.myfs-b: idle
 
  data:
    pools:   4 pools, 97 pgs
    objects: 22 objects, 2.2 KiB
    usage:   3.0 GiB used, 45 GiB / 48 GiB avail
    pgs:     97 active+clean
 
  io:
    client:   852 B/s rd, 1 op/s rd, 0 op/s wr

Now we can create a new storageclass based on the filesystem:

kubectl create -f csi/cephfs/storageclass.yaml

Verify the storageclass is present:

[jonathan@kube01 ceph]$ kubectl get storageclass
NAME                        PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
rook-ceph-block (default)   rook-ceph.rbd.csi.ceph.com      Delete          Immediate           true                   49m
rook-cephfs                 rook-ceph.cephfs.csi.ceph.com   Delete          Immediate           true                   34m

Consume

It’s easy to consume the new Ceph storage. Use the storageClassName rook-ceph-block in ReadWriteOnce mode for persistent storage for a single pod, or rook-cephfs in ReadWriteMany mode for persistent storage that can be shared between pods.

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-rbd-pvc
  labels:
spec:
  storageClassName: rook-ceph-block
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cephfs-pvc
spec:
  storageClassName: rook-cephfs
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

Ingress

Reference: https://microk8s.io/docs/addon-ingress

Probably the simplest way to expose web applications on your cluster is to use an Ingress. This binds to ports 80+443 on all your nodes and listens for http+https requests. It will effectively do name-based virtual hosting, terminate your SSL and will direct your web traffic to a Kubernetes Service with an internal ClusterIP which acts as a simple load balancer. This will require you to set up external round robin DNS to point your A record at all 3 of the node IPs.

microk8s enable ingress
sudo firewall-cmd --permanent --add-service http
sudo firewall-cmd --permanent --add-service https
sudo firewall-cmd --reload

MetalLB

Reference: https://microk8s.io/docs/addon-metallb

If you want to set up more advanced load balancing, consider using MetalLB. It will load balance your Kubernetes Service and present it on a single virtual IP.

Install

MetalLB will prompt you for one or more ranges of IPs that it can use for load-balancing. It should be fine to accept the default suggestion.

[jonathan@kube01 ~]$ microk8s enable metallb
Enabling MetalLB
Enter each IP address range delimited by comma (e.g. '10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111'): 10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111

Consume

Once MetalLB is installed and configured, to expose a service externally, simply create it with spec.type set to LoadBalancer, and MetalLB will do the rest.

It’s important to note that in the default config, the vIP will only appear on one of your nodes and that node will act as the entry point for all traffic before it gets load balanced between nodes, so this could be a bottleneck in busy environments.

---
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

Summary

You now have a fully-featured Kubernetes cluster with high availability, clustered storage, ingress, and load balancing. The possibilities are endless!

If you spot any mistakes, improvements or versions that need updating, please drop a comment below.

6 thoughts on “Building a hyperconverged Kubernetes cluster with MicroK8s and Ceph”

Pingback: Using TrueNAS to provide persistent storage for Kubernetes | Jonathan Gazeley
Steffen says:

Apr 11, 2021 at 4:53 pm

Hello,
that helped me very much, thank you.

I had to set ROOK_CSI_KUBELET_DIR_PATH: to “/var/snap/microk8s/common/var/lib/kubelet”
Otherwise, it tries to use /var/lib/kubelet, in which case you would need to manually create the expected dir layout.

At least, thats true for v1.5.9

LikeLike

Angelos Anagnostopoulos says:

Jun 18, 2021 at 7:45 pm

Hi Jonathan,

Thank you very much for your very comprehensive article.
Based on it, and as part of my “research” for an upcoming production deployment, I ended up this PoC project: https://github.com/el95149/vagrant-microk8s-ceph

Hope it proves useful to somebody. Once again, thanks!!!

LikeLike

packelend says:

Jun 26, 2023 at 9:10 pm

Hi (again) Jonathan
have you spent some thoughts on app segregation?

You use MetalLB, which is available for [TrueCharts](https://truecharts.org/charts/enterprise/metallb-config/setup-guide/) as well, which could do part of the job.

All that I want to achieve is that certain apps are only reachable from certain VLANs as I have user and system apps running. The only solution I know is VMs or docker (compose).

merci
stefan

LikeLike

1. Jonathan says:
  
  Jun 26, 2023 at 9:27 pm
  
  The simplest way to achieve this is probably to add the [`whitelist-source-range` annotation](https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#whitelist-source-range) to your Ingress resources.
  
  If you want to add VLANs to Kubernetes you’ll need to add a more advanced network controller too 🙂
  
  LikeLike
  
packelend says:

Jun 27, 2023 at 5:40 am

these are things which are not part of the default TrueNAS SCALE. If I will add these things, it is likely that I mess up things.

I don’t know if https://www.reddit.com/r/truenas/comments/yy31et/bluefin_kubernetes_multinode_clustering can solve the problem once it is part of TN.

LikeLike

Hardware

Operating system

Snap

MicroK8s

Install

Enable HA mode

Add firewall rules

Enable clustering

Addons

Check

Ceph

Install

Check

Block storage

Filesystem

Consume

Ingress

MetalLB

Install

Consume

Summary

Share this:

Related

6 thoughts on “Building a hyperconverged Kubernetes cluster with MicroK8s and Ceph”

Leave a comment Cancel reply