Managing Kubernetes Webhook Failures

About me

  • AWS Sr. Cloud Support Engineer (Container and DevOps)

  • AWS Certificates All-Five (2017)

  • Started my Kubernetes journey since 2017 (EKS)

  • Author of《Mastering Elastic Kubernetes Service on AWS》

  • Certificates: CKA, CKS, CKAD (CNCF) + Solution Architect Professional, DevOps Engineer Professional (AWS)

✍️ Created EasonTechTalk.com

🏃 Marathon runner

🤿 PADI AOW diver

https://easoncao.com/about/

Outline

  1. Overview of Kubernetes admission webhooks
  2. Understanding Admission Webhooks
  3. Common Failure Patterns
  4. Detection and Monitoring
  5. Best Practices and Solutions

What are Admission Webhooks?

常見使用情境

  • 鏡像安全掃描
  • 命名空間管理
  • Sidecar 注入

Understanding Admission Webhooks

驗證型 (Validating) Webhook

  • 負責驗證資源請求是否符合特定規則,可以接受或拒絕請求,但不能修改請求內容。

修改型 (Mutating) Webhook

  • 除了可以驗證請求外,還能修改請求內容,例如新增預設值或注入額外配置。
$ kubectl describe MutatingWebhookConfiguration/aws-load-balancer-webhook
Name:         aws-load-balancer-webhook
API Version:  admissionregistration.k8s.io/v1
Kind:         MutatingWebhookConfiguration
Webhooks:
  Admission Review Versions:
    v1beta1
  Client Config:
    Service:
      Name:        aws-load-balancer-webhook-service
      Namespace:   kube-system
      Path:        /mutate-v1-pod
      Port:        443
  Failure Policy:  Fail
  Name:            mpod.elbv2.k8s.aws
  Namespace Selector:
    Match Expressions:
      Key:       elbv2.k8s.aws/pod-readiness-gate-inject
      Operator:  In
      Values:
        enabled
  Object Selector:
    Match Expressions:
      Key:       app.kubernetes.io/name
      Operator:  NotIn
      Values:
        aws-load-balancer-controller
  Rules:
    API Versions:
      v1
    Operations:
      CREATE
    Resources:
      pods
  ...

Common Failure Patterns

Network connectivity issues (Timeout)

error when patching "istio-gateway.yaml": Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.default.svc:443/validate/fail?timeout=10s": context deadline exceeded

Certificate expiration and TLS problems

Events:
Type     Reason             Age                From     Message
----     ------             ----               ----     -------
Warning  FailedDeployModel  53m (x9 over 63m)  ingress  (combined from similar events): Failed deploy model due to Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=30s": x509: certificate has expired or is not yet valid: current time 2022-03-03T07:37:16Z is after 2022-02-26T11:24:26Z

Performance degradation (Control Plane)

E0119 11:37:53.532226       1 shared_informer.go:243] unable to sync caches for garbage collector
E0119 11:37:53.532261       1 garbagecollector.go:228] timed out waiting for dependency graph builder sync during GC sync (attempt 73)

I0119 11:37:54.680276       1 request.go:645] Throttling request took 1.047002085s, request: GET:https://10.150.233.43:6443/apis/configuration.konghq.com/v1beta1?timeout=32s
I0119 11:37:54.831942       1 shared_informer.go:240] Waiting for caches to sync for garbage collector
I0119 11:38:04.722878       1 request.go:645] Throttling request took 1.860914441s, request: GET:https://10.150.233.43:6443/apis/acme.cert-manager.io/v1alpha2?timeout=32s
E0119 11:38:04.861576       1 shared_informer.go:243] unable to sync caches for resource quota
E0119 11:38:04.861687       1 resource_quota_controller.go:447] timed out waiting for quota monitor sync
Parameter Default
--concurrent-resource-quota-syncs 5
--resource-quota-sync-period 5m0s

Resource constraints and scaling issues (Data plane)

  • CPU
  • Memory
  • I/O performance (e.g. Disk)
apiVersion: v1
kind: Pod
metadata:
  name: memory-demo
  namespace: pod-resources-example
spec:
  resources:
    requests:
      memory: "100Mi"
    limits:
      memory: "200Mi"
  containers:
  - name: memory-demo-ctr
    image: nginx
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]

Case studies

Case 1: System resource exhaustion

$ kubectl get events -n curl
...
23m Normal   SuccessfulCreate replicaset/curl-9454cc476   Created pod: curl-9454cc476-khp45
22m Warning  FailedCreate     replicaset/curl-9454cc476   Error creating: Internal error occurred: failed calling webhook "namespace.sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": dial tcp 10.96.44.51:443: connect: connection refused
$ kubectl get pods -n kube-system
NAME                 READY   STATUS      RESTARTS   AGE
example-pod-1234     0/1     Evicted     0          5m
example-pod-5678     0/1     Evicted     0          10m
example-pod-9012     0/1     Evicted     0          7m
example-pod-3456     0/1     Evicted     1          15m
example-pod-7890     0/1     Evicted     0          3m

Resource exhaustion (OOM, DiskPressure or PIDPressure)

-> Node-pressure Eviction

-> Webhook Pod terminated

Case 2: Job controller blocked cluster status

I0623 12:15:42.123456       1 job_controller.go:256] Syncing Job default/example-job
E0623 12:15:42.124789       1 job_controller.go:276] Error syncing Job "default/example-job": Internal error occurred: failed calling webhook "validate.jobs.example.com": Post "https://webhook-service.default.svc:443/validate?timeout=10s": dial tcp 10.96.0.42:443: connect: connection refused
W0623 12:15:42.125000       1 controller.go:285] Retrying webhook request after failure
E0623 12:15:52.130123       1 job_controller.go:276] Error syncing Job "default/example-job": Internal error occurred: failed calling webhook "validate.jobs.example.com": Post "https://webhook-service.default.svc:443/validate?timeout=10s": dial tcp 10.96.0.42:443: connect: connection refused
W0623 12:15:52.130456       1 controller.go:285] Retrying webhook request after failure
...

Case 3: Calico + Kyverno: Cluster down

Events:
  Type     Reason        Age                 From                  Message
  ----     ------        ----                ----                  -------
  Warning  FailedCreate  18s (x14 over 60s)  daemonset-controller  Error creating: Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s": no endpoints available for service "kyverno-svc"
{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "RequestResponse",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/calico-system/services/calico-typha",
    "verb": "update",
    "responseStatus": {
        "metadata": {},
        "status": "Failure",
        "message": "Internal error occurred: failed calling webhook \"validate.kyverno.svc-fail\": failed to call webhook: Post \"https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s\": no endpoints available for service \"kyverno-svc\"",
        "reason": "InternalError",
        "details": {
            "causes": [{
                "message": "failed calling webhook \"validate.kyverno.svc-fail\": failed to call webhook: Post \"https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s\": no endpoints available for service \"kyverno-svc\""
            }]
        },
        "code": 500
    },
}

Detection & Monitoring

Key Metrics to Monitor (API Sserver)

  • webhook_rejection_count
  • webhook_request_total
  • webhook_fail_open_count
$ kubectl get --raw /metrics | grep "apiserver_admission_webhook"
apiserver_admission_webhook_request_total{code="400",name="mpod.elbv2.k8s.aws",operation="CREATE",rejected="true",type="admit"} 17

$ kubectl get --raw /metrics | grep "apiserver_admission_webhook_rejection"
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="mpod.elbv2.k8s.aws",operation="CREATE",rejection_code="400",type="admit"} 17

Monitoring Tools

  • kube-prometheus-stack

Monitoring

Application level

針對關鍵性的應用和服務,可以參考是否提供對應的 Prometheus 或對應指標監控應用的可用性,建立特定的監控指標和警報閾值。同時監控相關的資源使用情況,如 CPU、記憶體使用率,以及網絡延遲等指標。

Kubernetes API Server logs

fields @timestamp, @message, @logStream
| filter @logStream like /kube-apiserver/
| filter @message like 'failed to call webhook'

Failure Recovery

Quick Fixes

  • Fix webhook service (Review Pod events and logs)
  • timeoutSeconds adjustments (default: 10 seconds)
  • failurePolicy: Ignore option
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: aws-load-balancer-webhook
webhooks:
  - clientConfig:
      service:
        name: aws-load-balancer-webhook-service
        namespace: kube-system
        path: /mutate-v1-pod
    failurePolicy: Fail   # <--- Replace "Fail" to "Ignore"
    name: mpod.elbv2.k8s.aws
    ...

Best Practices

資源配置最佳化

  • 為 webhook 服務設置適當的資源請求和限制
  • 實施水平自動擴展(HPA)以應對負載變化
  • 使用 Pod Disruption Budget 確保服務可用性
  • 定期輪換 TLS 證書

可靠性和效能

  • 如果運行自定義 webhook 服務,確保定義專用的 Namespace,避免故障時影響所有資源
  • ValidatingAdmissionPolicy (Kubenretes >= v1.30)
  • MutatingAdmissionPolicy (KEP-3962)

Validating Admission Policy

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: "demo-policy.example.com"
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE", "UPDATE"]
      resources:   ["deployments"]
  validations:
    - expression: "object.spec.replicas <= 5"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "demo-binding-test.example.com"
spec:
  policyName: "demo-policy.example.com"
  validationActions: [Deny]
  matchResources:
    namespaceSelector:
      matchLabels:
        environment: test

Kubernetes v1.30 [stable]

Thank you