Introduction with 1-Mistake
On that shiny day, I got a project that required deploying a mongodb cluster, After a few searches, I found percona Operator, moved into installation section and copied the helm install command.
After installing the required charts, I noticed that the pods weren’t in “running” state, so as a civilized kubernetes developer I ran “kubectl describe pod_name -n namespace”, and it turned out the problem was mongodb cluster requires either 3 or 5 nodes
That’s easy right ? !‘am using proxmox for my on-prem VMs and talos for my kubernetes OS, therefore I created a new VM with talos as The OS, disabled DHCP and added custom IP address corresponding to 192.168.1.116. Then, to add the VM a new “worker” node we use Talos magic wand:
|
|
But life always hides a few surprises for you , I mistakenly ran another command which added the new VM as a controller plane. The result was having two etcd members fighting for their survivals, but our old master “kubernetes” wasn’t happy with the result because both etcd instance went down with their kube-apiserver instances
2-Mistake
Because I am so smart, I thought the two controller nodes contradicted with etcd’s happy state, so I searched for a solution that led me into either removing the newly created node or adding a new one to balance the cluster number, And hell yeah, Iam removing the second idiot VM.
The node deleted from “proxmox”,then I thought the cluster will return to healthy state and I will kiss my “IT” girlfriend saying “we’re back to normal” making her think I was mad at her for no reason, hmm but life surprises you once again, my dear.
This time, etcd remained in an unhealthy state, claiming it couldn’t find the joined node 192.168.1.116
NOTE: you can run ’talosctl -n 192.168.1.110 dmesg’ to view node logs .
I thought, I saw a talosctl command that invokes a members list and I said to myself if the “list” subcommand exists, then the remove or delete one will exist also, Well it was there of course, but with a different name: “remove-member”. however, it didn’t work, etcd wasn’t responding to my requests even the command : “talosctl members list” wasn’t showing anything.
Solution: Edit the Etcd Snapshotted DB
After long hours of reading Github issues, walking on the beach and talking with friends about rap songs I realized there was no solution other than to reset the controller node along with the etcd data directory.
While reading the documentation on Talos “Disator Recovery”, I was made aware of the snapshot idea but wasn’t thinking outside the box. Until I thought of editing the etcd database, talosctl didn’t have a built-in command for this kind of operation so I went for snapshotting the database and inspecting it to see what I can edit there to remove the call for the our beloved dead node.
Let’s start with taking a snapshot, there are two commands referenced in the documentation but we will go with latter because etcd is unreachable
|
|
I ran the ‘file’ command to check file type which returned: data, hmm well this isn’t enough linux, thanks for your time, On my second search on google I found the bbolt file type and there is this tool bbolt for inspecting bbolt databases, Cool now we playing our cards right.
After a few tries, I found a bucket called “members”
|
|
hmm, I procceded into list members bucket keys and inspecting each value
|
|
|
|
|
|
here we go Kogoro Mouri, I found the culprit this member with key 920c1b791dddb17e must be deleted, so let’s call the POLICE(ChatGPT) to exceel him out. We asked ChatGPT for deleting the key 920c1b791dddb17e from members buckets
|
|
Now, we can reset the etcd node and then recover from the backup db
|
|
wait until the node become in preparing mode and run:
|
|
finally run, ‘kubectl get nodes -o wide’ and you should see your nodes
‘kubectl get pods’ to check your cluster previous state returned to normal