Managing Kubernetes Webhook Failures

About me
-
AWS Sr. Cloud Support Engineer (Container and DevOps)
-
AWS Certificates All-Five (2017)
-
Started my Kubernetes journey since 2017 (EKS)
-
Author of《Mastering Elastic Kubernetes Service on AWS》
-
Certificates: CKA, CKS, CKAD (CNCF) + Solution Architect Professional, DevOps Engineer Professional (AWS)

✍️ Created EasonTechTalk.com
🏃 Marathon runner
🤿 PADI AOW diver

https://easoncao.com/about/
Outline
- Overview of Kubernetes admission webhooks
- Understanding Admission Webhooks
- Common Failure Patterns
- Detection and Monitoring
- Best Practices and Solutions

What are Admission Webhooks?
常見使用情境
- 鏡像安全掃描
- 命名空間管理
- Sidecar 注入


Understanding Admission Webhooks

驗證型 (Validating) Webhook
- 負責驗證資源請求是否符合特定規則,可以接受或拒絕請求,但不能修改請求內容。
修改型 (Mutating) Webhook
- 除了可以驗證請求外,還能修改請求內容,例如新增預設值或注入額外配置。



$ kubectl describe MutatingWebhookConfiguration/aws-load-balancer-webhook
Name: aws-load-balancer-webhook
API Version: admissionregistration.k8s.io/v1
Kind: MutatingWebhookConfiguration
Webhooks:
Admission Review Versions:
v1beta1
Client Config:
Service:
Name: aws-load-balancer-webhook-service
Namespace: kube-system
Path: /mutate-v1-pod
Port: 443
Failure Policy: Fail
Name: mpod.elbv2.k8s.aws
Namespace Selector:
Match Expressions:
Key: elbv2.k8s.aws/pod-readiness-gate-inject
Operator: In
Values:
enabled
Object Selector:
Match Expressions:
Key: app.kubernetes.io/name
Operator: NotIn
Values:
aws-load-balancer-controller
Rules:
API Versions:
v1
Operations:
CREATE
Resources:
pods
...
Common Failure Patterns

Network connectivity issues (Timeout)

error when patching "istio-gateway.yaml": Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.default.svc:443/validate/fail?timeout=10s": context deadline exceeded

Certificate expiration and TLS problems
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedDeployModel 53m (x9 over 63m) ingress (combined from similar events): Failed deploy model due to Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=30s": x509: certificate has expired or is not yet valid: current time 2022-03-03T07:37:16Z is after 2022-02-26T11:24:26Z


Performance degradation (Control Plane)
E0119 11:37:53.532226 1 shared_informer.go:243] unable to sync caches for garbage collector
E0119 11:37:53.532261 1 garbagecollector.go:228] timed out waiting for dependency graph builder sync during GC sync (attempt 73)
I0119 11:37:54.680276 1 request.go:645] Throttling request took 1.047002085s, request: GET:https://10.150.233.43:6443/apis/configuration.konghq.com/v1beta1?timeout=32s
I0119 11:37:54.831942 1 shared_informer.go:240] Waiting for caches to sync for garbage collector
I0119 11:38:04.722878 1 request.go:645] Throttling request took 1.860914441s, request: GET:https://10.150.233.43:6443/apis/acme.cert-manager.io/v1alpha2?timeout=32s
E0119 11:38:04.861576 1 shared_informer.go:243] unable to sync caches for resource quota
E0119 11:38:04.861687 1 resource_quota_controller.go:447] timed out waiting for quota monitor sync

Parameter | Default |
---|---|
--concurrent-resource-quota-syncs | 5 |
--resource-quota-sync-period | 5m0s |


Resource constraints and scaling issues (Data plane)
- CPU
- Memory
- I/O performance (e.g. Disk)
apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: pod-resources-example
spec:
resources:
requests:
memory: "100Mi"
limits:
memory: "200Mi"
containers:
- name: memory-demo-ctr
image: nginx
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]

Case studies

Case 1: System resource exhaustion
$ kubectl get events -n curl
...
23m Normal SuccessfulCreate replicaset/curl-9454cc476 Created pod: curl-9454cc476-khp45
22m Warning FailedCreate replicaset/curl-9454cc476 Error creating: Internal error occurred: failed calling webhook "namespace.sidecar-injector.istio.io": failed to call webhook: Post "https://istiod.istio-system.svc:443/inject?timeout=10s": dial tcp 10.96.44.51:443: connect: connection refused
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
example-pod-1234 0/1 Evicted 0 5m
example-pod-5678 0/1 Evicted 0 10m
example-pod-9012 0/1 Evicted 0 7m
example-pod-3456 0/1 Evicted 1 15m
example-pod-7890 0/1 Evicted 0 3m
Resource exhaustion (OOM, DiskPressure or PIDPressure)
-> Node-pressure Eviction
-> Webhook Pod terminated

Case 2: Job controller blocked cluster status
I0623 12:15:42.123456 1 job_controller.go:256] Syncing Job default/example-job
E0623 12:15:42.124789 1 job_controller.go:276] Error syncing Job "default/example-job": Internal error occurred: failed calling webhook "validate.jobs.example.com": Post "https://webhook-service.default.svc:443/validate?timeout=10s": dial tcp 10.96.0.42:443: connect: connection refused
W0623 12:15:42.125000 1 controller.go:285] Retrying webhook request after failure
E0623 12:15:52.130123 1 job_controller.go:276] Error syncing Job "default/example-job": Internal error occurred: failed calling webhook "validate.jobs.example.com": Post "https://webhook-service.default.svc:443/validate?timeout=10s": dial tcp 10.96.0.42:443: connect: connection refused
W0623 12:15:52.130456 1 controller.go:285] Retrying webhook request after failure
...


Case 3: Calico + Kyverno: Cluster down
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreate 18s (x14 over 60s) daemonset-controller Error creating: Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": failed to call webhook: Post "https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s": no endpoints available for service "kyverno-svc"
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "RequestResponse",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/calico-system/services/calico-typha",
"verb": "update",
"responseStatus": {
"metadata": {},
"status": "Failure",
"message": "Internal error occurred: failed calling webhook \"validate.kyverno.svc-fail\": failed to call webhook: Post \"https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s\": no endpoints available for service \"kyverno-svc\"",
"reason": "InternalError",
"details": {
"causes": [{
"message": "failed calling webhook \"validate.kyverno.svc-fail\": failed to call webhook: Post \"https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s\": no endpoints available for service \"kyverno-svc\""
}]
},
"code": 500
},
}

Detection & Monitoring

Key Metrics to Monitor (API Sserver)
- webhook_rejection_count
- webhook_request_total
- webhook_fail_open_count
$ kubectl get --raw /metrics | grep "apiserver_admission_webhook"
apiserver_admission_webhook_request_total{code="400",name="mpod.elbv2.k8s.aws",operation="CREATE",rejected="true",type="admit"} 17
$ kubectl get --raw /metrics | grep "apiserver_admission_webhook_rejection"
apiserver_admission_webhook_rejection_count{error_type="calling_webhook_error",name="mpod.elbv2.k8s.aws",operation="CREATE",rejection_code="400",type="admit"} 17

Monitoring Tools
- kube-prometheus-stack
Monitoring


Application level
針對關鍵性的應用和服務,可以參考是否提供對應的 Prometheus 或對應指標監控應用的可用性,建立特定的監控指標和警報閾值。同時監控相關的資源使用情況,如 CPU、記憶體使用率,以及網絡延遲等指標。
Kubernetes API Server logs
fields @timestamp, @message, @logStream
| filter @logStream like /kube-apiserver/
| filter @message like 'failed to call webhook'


Failure Recovery
Quick Fixes
- Fix webhook service (Review Pod events and logs)
- timeoutSeconds adjustments (default: 10 seconds)
- failurePolicy: Ignore option
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: aws-load-balancer-webhook
webhooks:
- clientConfig:
service:
name: aws-load-balancer-webhook-service
namespace: kube-system
path: /mutate-v1-pod
failurePolicy: Fail # <--- Replace "Fail" to "Ignore"
name: mpod.elbv2.k8s.aws
...

Best Practices
資源配置最佳化
- 為 webhook 服務設置適當的資源請求和限制
- 實施水平自動擴展(HPA)以應對負載變化
- 使用 Pod Disruption Budget 確保服務可用性
- 定期輪換 TLS 證書
可靠性和效能
- 如果運行自定義 webhook 服務,確保定義專用的 Namespace,避免故障時影響所有資源
- ValidatingAdmissionPolicy (Kubenretes >= v1.30)
- MutatingAdmissionPolicy (KEP-3962)

Validating Admission Policy

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
name: "demo-policy.example.com"
spec:
failurePolicy: Fail
matchConstraints:
resourceRules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
validations:
- expression: "object.spec.replicas <= 5"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
name: "demo-binding-test.example.com"
spec:
policyName: "demo-policy.example.com"
validationActions: [Deny]
matchResources:
namespaceSelector:
matchLabels:
environment: test
Kubernetes v1.30 [stable]
Thank you


