Skip to content

Recovery from Quorum Loss

In an Etcd cluster, quorum is a majority of nodes/members that must agree on updates to a cluster state before the cluster can authorise the DB modification. For a cluster with n members, quorum is (n/2)+1. An Etcd cluster is said to have lost quorum when majority of nodes (greater than or equal to (n/2)+1) are unhealthy or down and as a consequence cannot participate in consensus building.

For a multi-node Etcd cluster quorum loss can either be Transient or Permanent.

Transient quorum loss

If quorum is lost through transient network failures (e.g. n/w partitions), spike in resource usage which results in OOM, etcd automatically and safely resumes (once the network recovers or the resource consumption has come down) and restores quorum. In other cases like transient power loss, etcd persists the Raft log to disk and replays the log to the point of failure and resumes cluster operation.

Permanent quorum loss

In case the quorum is lost due to hardware failures or disk corruption etc, automatic recovery is no longer possible and it is categorized as a permanent quorum loss.

Note: If one has capability to detect Failed nodes and replace them, then eventually new nodes can be launched and etcd cluster can recover automatically. But sometimes this is just not possible.

Recovery

At present, recovery from a permanent quorum loss is achieved by manually executing the steps listed in this section.

Note: In the near future etcd-druid will offer capability to automate the recovery from a permanent quorum loss via Out-Of-Band Operator Tasks. An operator only needs to ascertain that there is a permanent quorum loss and the etcd-cluster is beyond auto-recovery. Once that is established then an operator can invoke a task whose status an operator can check.

Warning

Please note that manually restoring etcd can result in data loss. This guide is the last resort to bring an Etcd cluster up and running again.

00-Identify the etcd cluster

It is possible to shard the etcd cluster based on resource types using --etcd-servers-overrides CLI flag of kube-apiserver. Any sharding results in more than one etcd-cluster.

Info

In gardener, each shoot control plane has two etcd clusters, etcd-events which only stores events and etcd-main - stores everything else except events.

Identify the etcd-cluster which has a permanent quorum loss. Most of the resources of an etcd-cluster can be identified by its name. The resources of interest to recover from permanent quorum loss are: Etcd CR, StatefulSet, ConfigMap and PVC.

To identify the ConfigMap resource use the following command:

Bash
 kubectl get sts <sts-name> -o jsonpath='{.spec.template.spec.volumes[?(@.name=="etcd-config-file")].configMap.name}'

01-Prepare Etcd Resource to allow manual updates

To ensure that only one actor (in this case an operator) makes changes to the Etcd resource and also to the Etcd cluster resources, following must be done:

Add the annotation to the Etcd resource:

Bash
kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/suspend-etcd-spec-reconcile=

The above annotation will prevent any reconciliation by etcd-druid for this Etcd cluster.

Add another annotation to the Etcd resource:

Bash
kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/disable-etcd-component-protection=

The above annotation will allow manual edits to Etcd cluster resources that are managed by etcd-druid.

02-Scale-down Etcd StatefulSet resource to 0

Bash
kubectl scale sts <sts-name> --replicas=0 -n <namespace>

03-Delete all PVCs for the Etcd cluster

Bash
kubectl delete pvc -l instance=<sts-name> -n <namespace>

04-Delete All Member Leases

For a n member Etcd cluster there should be n member Lease objects. The lease names should start with the Etcd name.

Example leases for a 3 node Etcd cluster:

Text Only
 NAME          HOLDER                  AGE
 <etcd-name>-0 4c37667312a3912b:Member 1m
 <etcd-name>-1 75a9b74cfd3077cc:Member 1m
 <etcd-name>-2 c62ee6af755e890d:Leader 1m

Delete all the member leases.

Bash
kubectl delete lease <space separated lease names>
# Alternatively you can use label selector. From v0.23.0 onwards leases will have common set of labels
kubectl delete lease -l app.kubernetes.io.component=etcd-member-lease, app.kubernetes.io/part-of=<etcd-name> -n <namespace>

05-Modify ConfigMap

Prerequisite to scale up etcd-cluster from 0->1 is to change initial-cluster in the ConfigMap. Assuming that prior to scale-down to 0, there were 3 members, the initial-cluster field would look like the following (assuming that the name of the etcd resource is etcd-main):

YAML
# Initial cluster
initial-cluster: etcd-main-0=https://etcd-main-0.etcd-main-peer.default.svc:2380,etcd-main-1=https://etcd-main-1.etcd-main-peer.default.svc:2380,etcd-main-2=https://etcd-main-2.etcd-main-peer.default.svc:2380

Change the initial-cluster field to have only one member (in this case etc-main-0). After the change it should look like:

Bash
# Initial cluster
initial-cluster: etcd-main-0=https://etcd-main-0.etcd-main-peer.default.svc:2380

06-Scale up Etcd cluster to size 1

Bash
kubectl scale sts <sts-name> -n <namespace> --replicas=1 

07-Wait for Single-Member etcd cluster to be completely ready

To check if the single-member etcd cluster is ready check the status of the pod.

Bash
kubectl get pods <etcd-name-0> -n <namespace>
NAME            READY   STATUS    RESTARTS   AGE
<etcd-name>-0   2/2     Running   0          1m

If both containers report readiness (as seen above), then the etcd-cluster is considered ready.

08-Enable Etcd reconciliation and resource protection

All manual changes are now done. We must now re-enable etcd-cluster resource protection and also enable reconciliation by etcd-druid by doing the following:

Bash
kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/suspend-etcd-spec-reconcile-
kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/disable-etcd-component-protection-

09-Scale-up Etcd Cluster to 3 and trigger reconcile

Scale etcd-cluster to its original size (we assumed 3 below).

Bash
kubectl scale sts <sts-name> -n namespace --replicas=3

If etcd-druid has been set up with --enable-etcd-spec-auto-reconcile switched-off then to ensure reconciliation one must annotate Etcd resource with the following command:

Bash
# Annotate etcd-test CR to reconcile
kubectl annotate etcd <etcd-name> -n <namespace> gardener.cloud/operation="reconcile"

10-Verify Etcd cluster health

Check if all the member pods have both of their containers in Running state.

Bash
kubectl get pods -n <namespace> -l app.kubernetes.io/part-of=<etcd-name>
NAME            READY   STATUS    RESTARTS   AGE
<etcd-name>-0   2/2     Running   0          5m
<etcd-name>-1   2/2     Running   0          1m
<etcd-name>-2   2/2     Running   0          1m

Additionally, check if the Etcd CR is ready:

Bash
kubectl get etcd <etcd-name> -n <namespace>
NAME          READY   AGE
<etcd-name>   true    13d

Check member leases, whose holderIdentity should reflect the member role. Check if all members are voting members (their role should either be Member or Leader). Monitor the leases for some time and check if the leases are getting updated. You can monitor the AGE field.

Bash
NAME          HOLDER                  AGE
<etcd-name>-0 4c37667312a3912b:Member 1m
<etcd-name>-1 75a9b74cfd3077cc:Member 1m
<etcd-name>-2 c62ee6af755e890d:Leader 1m