Skip to content

Setting up etcd-druid in Production

You can get familiar with etcd-druid and all the resources that it creates by setting up etcd-druid locally by following the detailed guide. This document lists down recommendations for a productive setup of etcd-druid.

Helm Charts

You can use helm charts at this location to deploy druid. Values for charts are present here and can be configured as per your requirement. Following charts are present:

  • deployment.yaml - defines a kubernetes Deployment for etcd-druid. To configure the CLI flags for druid you can refer to this document which explains these flags in detail.
  • serviceaccount.yaml - defines a kubernetes ServiceAccount which will serve as a technical user to which role/clusterroles can be bound.

  • clusterrole.yaml - etcd-druid can manage multiple etcd clusters. In a hosted control plane setup (e.g. Gardener), one would typically create separate namespace per control-plane. This would require a ClusterRole to be defined which gives etcd-druid permissions to operate across namespaces. Packing control-planes via namespaces provides you better resource utilisation while providing you isolation from the data-plane (where the actual workload is scheduled).

  • rolebinding.yaml - binds the ClusterRole defined in druid-clusterrole.yaml to the ServiceAccount defined in service-account.yaml.
  • service.yaml - defines a Cluster IP Service allowing other control-plane components to communicate to http endpoints exposed out of etcd-druid (e.g. enables prometheus to scrap metrics, validating webhook to be invoked upon change to Etcd CR etc.)
  • secret-ca-crt.yaml - Contains the base64 encoded CA certificate used for the etcd-druid webhook server.
  • secret-server-tls-crt.yaml - Contains the base64 encoded server certificate used for the etcd-druid webhook server.
  • validating-webhook-config.yaml - Configuration for all webhooks that etcd-druid registers to the webhook server. At the time of writing this document EtcdComponents webhook gets registered.

Etcd cluster size

Recommendation from upstream etcd is to always have an odd number of members in an Etcd cluster.

Mounted Volume

All Etcd cluster member Pods provisioned by etcd-druid mount a Persistent Volume. A mounted persistent storage helps in faster recovery in case of single-member transient failures. etcd is I/O intensive and its performance is heavily dependent on the Storage Class. It is therefore recommended that high performance SSD drives be used.

At the time of writing this document etcd-druid provisions the following volume types:

Cloud Provider Type Size
AWS GP3 25Gi
Azure Premium SSD 33Gi
GCP Performance (SSD) Persistent Disks (pd-ssd) 25Gi

Also refer: Etcd Disk recommendation.

Additionally, each cloud provider offers redundancy for managed disks. You should choose redundancy as per your availability requirement.

Backup & Restore

A permanent quorum loss or data-volume corruption is a reality in production clusters and one must ensure that data loss is minimized. Etcd clusters provisioned via etcd-druid offer two levels of data-protection

Via etcd-backup-restore all clusters started via etcd-druid get the capability to regularly take delta & full snapshots. These snapshots are stored in an object store. Additionally, a snapshot-compaction job is run to compact and defragment the latest snapshot, thereby reducing the time it takes to restore a cluster in case of a permanent quorum loss. You can read the detailed guide on how to restore from permanent quorum loss.

It is therefore recommended that you configure an Object store in the cloud/infra provider of your choice, enabled backup & restore functionality by filling in store configuration of an Etcd custom CR.

Ransomware protection

Ransomware is a form of malware designed to encrypt files on a device, rendering any files and the systems that rely on them unusable. All cloud providers (aws, gcp, azure) provide a feature of immutability that can be set at the bucket/object level which provides WORM access to objects as long as the bucket/lock retention duration.

All delta & full snapshots that are periodically taken by etcd-backup-restore are stored in Object store provided by a cloud provider. It is recommended that these backups be protected from ransomware protection by turning locking at the bucket/object level.

Security

Use Distroless Container Images

It is generally recommended to use a minimal base image which additionally reduces the attack surface. Google's Distroless is one way to reduce the attack surface and also minimize the size of the base image. It provides the following benefits:

  • Reduces the attack surface
  • Minimizes vulnerabilities
  • No shell
  • Reduced size - only includes what is necessary

For every Etcd cluster provisioned by etcd-druid, distroless images are used as base images.

Enable TLS for Peer and Client communication

Generally you should enable TLS for peer and client communication for an Etcd cluster. To enable TLS CA certificate, server and client certificates needs to be generated. You can refer to the list of TLS artifacts that are generated for an Etcd cluster provisioned by etcd-druid here.

Enable TLS for Druid Webhooks

If you choose to enable webhooks in etcd-druid then it is necessary to create a separate CA and server certificate to be used by the webhooks.

Rotate TLS artifacts

It is generally recommended to rotate all TLS certificates to reduce the chances of it getting leaked or have expired. Kubernetes does not support revocation of certificates (see issue#18982). One possible way to revoke certificates is to also revoke the entire chain including CA certificates.

Scaling etcd pods

etcd clusters cannot be scaled-out horizontly to meet the increased traffic/storage demand for the following reasons:

  • There is a soft limit of 8GB and a hard limit of 10GB for the etcd DB beyond which perfomance and stability of etcd is not guaranteed.
  • All members of etcd maintain the entire replica of the entire DB, thus scaling-out will not really help if the storage demand grows.
  • Increasing the number of cluster members beyond 5 also increases the cost of consensus amongst now a larger quorum, increases load on the single leader as it needs to also participate in bringing up etcd learner.

Therefore the following is recommended:

  • To meet the increased demand, configure a VPA. You have to be careful on selection of containerPolicies, targetRef.
  • To meet the increased demand in storage etcd-druid already configures each etcd member to auto-compact and it also configures periodic defragmentation of the etcd DB. The only case this will not help is when you only have unique writes all the time.

Note

Care should be taken with usage of VPA. While it helps to vertically scale up etcd-member pods, it also can cause transient quorum loss. This is a direct consequence of the design of VPA - where recommendation is done by Recommender component, Updater evicts the pods that do not have the resources recommended by the Recommender and Admission Controller which updates the resources on the Pods. All these three components act asynchronously and can fail independently, so while VPA respects PDB's it can easily enter into a state where updater evicts a pod while respecting PDB but the admission controller fails to apply the recommendation. The pod comes with a default resources which still differ from the recommended values, thus causing a repeat eviction. There are other race conditions that can also occur and one needs to be careful of using VPA for quorum based workloads.

High Availability

To ensure that an Etcd cluster is highly available, following is recommended:

Ensure that the Etcd cluster members are spread

Etcd cluster members should always be spread across nodes. This provides you failure tolerance at the node level. For failure tolerance of a zone, it is recommended that you spread the Etcd cluster members across zones. We recommend that you use a combination of TopologySpreadConstraints and Pod Anti-Affinity. To set the scheduling constraints you can either specify these constraints using SchedulingConstraints in the Etcd custom resource or use a MutatingWebhook to dynamically inject these into pods.

An example of scheduling constraints for a multi-node cluster with zone failure tolerance will be:

YAML
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/component: etcd-statefulset
        app.kubernetes.io/managed-by: etcd-druid
        app.kubernetes.io/name: etcd-main
        app.kubernetes.io/part-of: etcd-main
    maxSkew: 1
    minDomains: 3
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
  - labelSelector:
      matchLabels:
        app.kubernetes.io/component: etcd-statefulset
        app.kubernetes.io/managed-by: etcd-druid
        app.kubernetes.io/name: etcd-main
        app.kubernetes.io/part-of: etcd-main
    maxSkew: 1
    minDomains: 3
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

For a 3 member etcd-cluster, the above TopologySpreadConstraints will ensure that the members will be spread across zones (assuming there are 3 zones -> minDomains=3) and no two members will be on the same node.

Optimize Network Cost

In most cloud providers there is no network cost (ingress/egress) for any traffic that is confined within a single zone. For Zonal failure tolerance, it will become imperative to spread the Etcd cluster across zones within a region. Knowing that an Etcd cluster members are quite chatty (leader election, consensus building for writes and linearizable reads etc.), this can add to the network cost.

One could evaluate using TopologyAwareRouting which reduces cross-zonal traffic thus saving costs and latencies.

Tip

You can read about how it is done in Gardener here.

Metrics & Alerts

Monitoring etcd metrics is essential for fine tuning Etcd clusters. etcd already exports a lot of metrics. You can see the complete list of metrics that are exposed out of an Etcd cluster provisioned by etcd-druid here. It is also recommended that you configure an alert for etcd space quota alarms.

Hibernation

If you have a concept of hibernating kubernetes clusters, then following should be kept in mind:

  • Before you bring down the Etcd cluster, leverage the capability to take a full snapshot which captures the state of the etcd DB and stores it in the configured Object store. This ensures that when the cluster is woken up from hibernation it can restore from the last state with no data loss.
  • To save costs you should consider deleting the PersistentVolumeClaims associated to the StatefulSet pods. However, it must be ensured that you take a full snapshot as highlighted in the previous point.
  • When the cluster is woken up from hibernation then you should do the following (assuming prior to hibernation the cluster had a size of 3 members):
  • Start the Etcd cluster with 1 replica. Let it restore from the last full snapshot.
  • Once the cluster reports that it is ready, only then increase the replicas to its original value (e.g. 3). The other two members will start up each as learners and post learning they will join as voting members (Followers).

Reference

  • A nicely written blog post on High Availability and Zone Outage Toleration has a lot of recommendations that one can borrow from.