Sorry it’s taken a while to get to the next part of my blog series. This section was supposed to be about hyperconverged clustered storage in my cluster, but I unfortunately ran into a total cluster loss due to some bugs in MicroK8s and/or dqlite that maintainers haven’t managed to get to the bottom of.
The volumes that were provisioned on my off-cluster storage, I was able to re-associate with my rebuilt-from-scratch cluster. The volumes that were provisioned on the containerised, clustered storage were irrecoverably lost.
Therefore, I have decided to rework this part of my blog series into a cautionary tale – partly about specific technologies, and partly to push my pro-backup agenda.
It’s worth looking at the previous posts in the series for some background, especially the overview.
Let’s have a look at my original design for hyperconverged, containerised, clustered storage. And before we get stuck in, let me quickly demystify some of the jargon:
- hyperconverged means the storage runs on the same nodes as the compute
- containerised means the storage controller runs as Kubernetes pods inside the cluster
- clustered means many nodes provide some kind of storage hardware, and your volumes are split up into replicas on more than one node, so you can tolerate a node failure without losing a volume
Several clustered storage solutions are available. Perhaps Rook/Ceph is the best known, but as MicroK8s packages OpenEBS, I decided to use that. The default setup you get if you simply do
microk8s enable openebs creates a file on the root filesystem and provisions block volumes out of that file. In my case, that file would have ended up on the same SATA SSD as the OS, and I didn’t want that.
So I went poking at OpenEBS, and found that it offers various storage backends: Mayastor, cStor or Jiva. Mayastor is the newest engine, but has higher hardware requirements. In the end I decided on cStor as it appeared to be lightweight (i.e. didn’t consume much CPU or memory) and was also based on ZFS, which is a technology I already rely on in my TrueNAS storage. I ended up deploying OpenEBS from its Helm chart.
This diagram is quite complex, so let me walk you through it – starting at the bottom. Each physical node has an M.2 NVMe storage device installed, and this is separate from the SATA SSD that runs the OS. When you install OpenEBS, it creates a DaemonSet of a component called Node Disk Manager (NDM) which runs on each node and looks for available storage devices, and makes them available to OpenEBS as BlockDevices. When you have several BlockDevices, you can create a storage cluster. From this cluster, you can provision Volumes which will be replicated across multiple NVMe devices (by default you get 3 replicas). Creating a Volume also creates a Pod that acts as an iSCSI target for the volume. The Volume can now be mounted by workload Pods from any node in the usual way. It’s important to note that the workload Pod does not have to be on the same node as the Volume Target, and the three VolumeReplicas are placed according to the nodes with most capacity.
MicroK8s uses dqlite as its cluster datastore instead of etcd like most other Kubernetes distributions. I ran into some problems with MicroK8s where dqlite started consuming all CPU, running at high latency and eventually silently lost quorum. The Kubernetes API server then also silently went read-only, so any requests to change cluster state would silently fail, and any requests to read cluster state would effectively be snapshots from the moment the cluster went read-only, and might vary depending on which of the dqlite replicas was being queried.
The further complication is that as a clustered storage engine, cStor uses CRDs to represent its objects and therefore relies on the Kubernetes API server and the underlying datastore to track its own volumes, replicas, block devices, etc. By default, cStor then also lost quorum.
I followed through the how to restore lost quorum guide for MicroK8s, several times, but it never worked for me. I worked with MicroK8s developers for a while on recovery.
Even without cluster quorum, I attempted to recover my cStor volumes. However, actions like creating a pod to mount a volume rely on having a kube API that is not read only!
Eventually I had no other choice but to reset the cluster and start from scratch. I made sure I did not wipe the NVMe devices, and assumed I would be able to reassociate them on a new cluster. I exported all of the OpenEBS/cStor CRs to yaml files as a backup.
After the cluster was rebuilt, I reimported the BlockDevice resources but doing so did not discover the NVMe drives as they seemed to change UUID in the new cluster. I tweaked my yaml to adopt them under their new names, but I was not able to rebuild them as an OpenEBS cluster and rediscover my old volumes.
The documentation for cStor is quite minimal, and focuses on installing it rather than fixing it. The only relevant page is the Troubleshooting page, and it didn’t cater to my problem. Which seems surprising, because a common question with any storage system must be “how do I get my stuff back when it goes wrong?”
I contacted the OpenEBS community via Slack and my question was ignored for a week, despite my nudges. Eventually, an engineer contacted me and we worked through some steps, but were not able to reassociate a previous cluster’s cStor volumes with a new cluster.
All my cStor volumes were either MariaDB or PostgreSQL databases, and fortunately I had recent backups of all of them and was able to create new volumes on TrueNAS external storage (slower, but reliable) and restore the databases.
- First and foremost, take backups. Backups saved my bacon here in what would otherwise have been a significant data loss. I’ll cover my backup solutions in a later part of this blog post series.
- Volume snapshots are not backups. cStor provides volume snapshot functionality and it is very easy to take snapshots automatically. However, using those snapshots requires a functioning kube API.
- The control plane is fragile. It doesn’t take a lot for your datastore to lose quorum, and then all bets are off.
- I advise against hyperconverged storage in your homelab, unless you really need it. As soon as there is persistent data stored in your cluster, it stops being ephemeral and you need to treat it as carefully as a NAS. It’s fine for caches and things that can be regenerated.
- Check support arrangements before you commit to a product. MicroK8s developers have been responsive and helpful. However, cStor support has been useless. The product seems mature and the website looked shiny and makes claims about it being enterprise-grade, but the recovery documentation was useless and nobody was willing to help. Most of the chatter in the Slack channel are around Mayastor, so this must be the new shiny that gets all the attention.
The root cause of this problem was dqlite and MicroK8s quorum. At the moment, I don’t yet understand why this incident happened and I don’t know how to prevent it from happening again. I’m not the only person to have been bitten by it.
For time being, I restored like-for-like on MicroK8s even though I don’t really trust dqlite any more. I’ve upped the frequency of my backups in the expectation that it will probably happen again at some point.
I think I’ve decided that if this happens again, I will consider rebuilding on K3s instead of MicroK8s, as they use the more standard etcd datastore.
I’m not currently using the NVMe disks, but it seems a waste just to leave them there doing nothing. I will probably fiddle with hyperconverged storage again one day – maybe either Mayastor or Rook/Ceph, both of which seem to get more attention than cStor.
4 thoughts on “Kubernetes Homelab Part 4: Hyperconverged Storage”
Ouch. Sorry to hear about your cluster failure. While MicroK8s uses dqlite as storage, is that not still exposed via the Kubernetes API as if it was etcd? Are you not able to extract this out as a snapshot? I’m using K3s with proper etcd, snapshots every 6 hours stored to Minio on my TrueNAS server. I wasn’t aware MicroK8s had this limitation.
MicroK8s uses kine as a compatibility layer which supports SQL engines including dqlite in the backend. The API server itself is the same. I don’t know what options there are to take snapshots either from kine or from dqlite but it must be possible.
In general I’ve always avoided situations where I’d need to restore etcd from snapshot – my data is on TrueNAS and the workloads are easily redeployable. Having on-cluster storage broke that model.
I use Velero with Kopia plugin (Restic is an option also) to backup my in-cluster storage (rook-ceph) externally to TrueNAS (via MinIO S3 service). Velero can also be used to generate CSI snapshots of your PVs as part of the backups. You can restore snapshots or restore from external S3.
Has other uses such as can be used to backup one PV type and then restore the data as another PV type, handy if you need to convert data from one PV type to another.
Can even restore PV data to an unrelated cluster easily. Just install Velero on the other cluster, point to external S3 storage and within minutes you can browse the backups and select what to restore. The restore will create the PVs, and all Kubernetes objects needed or you filter to just a small subset of want you need restored. It can be as if you had installed the application locally to that cluster.
You get the amazing benefits of in-cluster storage and keep a copy of your entire application outside your cluster. I create a schedule per namespace to backup, and backup all objects deployed within that namespace. On some pods I put annotations to indicate which volumes to ignore (don’t backup) — either data easy to recreate or NFS mounts already on TrueNAS, etc.
I don’t plan on restoring etcd snapshots, or full application restores. But they are options I have. I follow GitOps with ArgoCD and GitHub which will restore my entire cluster via a bootstrap. But the data I’d pull in via Velero from TrueNAS MinIO S3.
A recommendation to consider.
LikeLiked by 1 person
Coincidence, I literally just got a notification about a GitHub comment where someone has provided a script to dump dqlite out in sqlite format for inspection. Doesn’t look quite as simple as an etcd snapshot…