When Kubernetes Collapses: Lessons from a Home Cluster Meltdown

The Issue

During an intense weekend of running workloads for Coderone, an AI game programming tournament, and Headbot, a project involving generative AI avatars, my 6-node MicroK8s cluster unexpectedly crashed. Symptoms ranged from stuck pods and workloads to crippling network errors and sluggish kubectl commands.

SwoleBama by Headbot AI

The Investigation

After deep analysis, I found that the problems stemmed from a corrupted control plane data storage in MicroK8s. Interestingly, MicroK8s uses dqlite, a lightweight but less mature data store compared to etcd, often used in production-grade clusters.

Bomberland preview

Why Dqlite and Etcd Matter

Dqlite is praised for its simplicity and resource efficiency but falls short in terms of maturity and high availability. Etcd is designed for high availability and data consistency but demands more resources and a deeper understanding of its operational aspects.

Exploring Alternatives: K3s and RKE2

After abandoning MicroK8s, I evaluated other Kubernetes distributions: k3s and RKE2.

K3s was promising but didn't meet my high-availability requirements. RKE2, on the other hand, proved to be reliable and secure, working seamlessly with a cluster of 6 nodes.

Key Takeaways

For single-node or experimental setups, MicroK8s is a great choice, especially with its range of free add-ons. For more robust, scalable solutions, RKE2 stands out as a better option. For a detailed account of my experience and further insights, check out the full article.

When Kubernetes Collapses: Lessons from a Home Cluster Meltdown

The Issue

The Investigation

Why Dqlite and Etcd Matter

Exploring Alternatives: K3s and RKE2

Key Takeaways

Resources

Pages

Let's connect