Best Practices on writing
a Kubernetes Operator
my lessons from Capsule
Dario Tranchitella - prometherion
GoLab Goes Online! 2020
$ kubectl describe prometherion dario-0
apiVersion: human.kind.io/v1alpha1
kind: Tranchitella
metadata:
name: prometherion
labels:
haproxy.org/maintainer: k8s-ingress
clastix.io/maintainer: capsule
spec:
replicas: 1
image: quay.io/tranchitella.io:0.0.1-rc7
eyes: blue
hair: null
status:
conditions:
- type: Initialized
Status: "True"
lastTransitionTime: "1989-07-03 ??:??:??"
- type: Ready
Status: "True"
lastTransitionTime: "2014-12-06 03:24:00"
- type: Drumming
Status: "False"
- type: KickBoxing
Status: "False"
WARNING: memes alert!
What is Capsule?
A Kubernetes multi-tenant operator, aiming to provide strong isolation between Namespace resources.
Not to be intended to be yet-another-PaaS, instead, we provide a minimalistic approach as a lightweight tool leveraging Kubernetes default resources.
CAPSULE
PRESENTATION
AUDIENCE
Kubernetes (K8s)
An open-source system for automating deployment, scaling, and management of containerized applications.
It groups containers that make up an application into logical units for easy management and discovery.
Features
- Automated rollouts and rollbacks
- Service discovery and load balancing
- Service Topology
- Storage orchestration
- Secret and configuration management
- Batch execution
- Horizontal scaling
- Self-healing
- Isolate workloads via Namespaces
- Run everywhere, also your machine!
Namespaces
- Namespaces are intended for use in environments with many users spread across multiple teams, or projects.
- Namespaces are a way to divide cluster resources between multiple users.
# kubectl get namespaces
NAME STATUS AGE
oil-development Active 3d22h
oil-staging Active 3d22h
oil-production Active 3d22h
gas-development Active 2d33h
gas-staging Active 2d33h
gas-production Active 2d33h
How to isolate Namespaces?
- LimitRanges
Policies to constrain resource allocations (to Pods or Containers) in a namespace -
NetworkPolicy
Set of rules that specify how groups of pods are allowed to communicate with each other and other network endpoints -
ResourceQuota
Policies that provide constraints that limit aggregate resource consumption as limiting the number or total amount of objects or computing resources that can be created per namespace by type.
Operator
Based on the Operator Pattern and the reconciliation/control loop.
Translating the human operator knowledge into code, automating procedures and ensuring the actual state matched the desired one.
func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&group.Kind{}).
Owns(&corev1.Namespace{}).
Owns(&networkingv1.NetworkPolicy{}).
Owns(&corev1.LimitRange{}).
Owns(&corev1.ResourceQuota{}).
Owns(&rbacv1.RoleBinding{}).
Complete(r)
}
func (r Reconciler) Reconcile(request ctrl.Request) (result ctrl.Result, err error) {
instance := &group.Kind{}
err = r.Get(context.Background(), request.NamespacedName, instance)
if err != nil {
if errors.IsNotFound(err) {
r.Log.Info("Request object not found, could have been deleted after reconcile request")
return reconcile.Result{}, nil
}
r.Log.Error(err, "Error reading the object")
return reconcile.Result{}, err
}
// do your business logic here:
//
// if you need to reconcile every each period of time:
// return reconcile.Result{Requeue: true, RequeueAfter: time.Minute}, nil
return reconcile.Result{}, nil
}
Reconciliation Loop
# kubectl get networkpolicy capsule-oil-0 -o yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
ownerReferences:
- apiVersion: capsule.clastix.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Tenant
name: oil
uid: 57987c02-e883-4642-9c7f-259f6d50ddca # <<< it's a match!
...
# kubectl get tenant oil -o yaml
apiVersion: capsule.clastix.io/v1alpha1
kind: Tenant
metadata:
name: oil
uid: 57987c02-e883-4642-9c7f-259f6d50ddca # <<< it's a match!
...
Owned Resources
Finally, Capsule!
Leveraging Kubernetes default resources to handle the multi-tenancy hassle.
The Tenant definition is a Custom Resource Definition (tl;dr; yet another YAML file).
Grouping Namespaces over a new definition: then Tenant.
How does it look
apiVersion: capsule.clastix.io/v1alpha1
kind: Tenant
metadata:
name: oil
spec:
owner:
kind: User
name: alice
namespaceQuota: 3
ingressClasses:
allowed:
- default
allowedRegex: ''
storageClasses:
allowed:
- default
allowedRegex: ''
limitRanges:
- limits:
- max:
cpu: '1'
memory: 1Gi
min:
cpu: 50m
memory: 5Mi
type: Pod
- default:
cpu: 200m
memory: 100Mi
defaultRequest:
cpu: 100m
memory: 10Mi
max:
cpu: '1'
memory: 1Gi
min:
cpu: 50m
memory: 5Mi
type: Container
- max:
storage: 10Gi
min:
storage: 1Gi
type: PersistentVolumeClaim
networkPolicies:
- egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.0.0/12
ingress:
- from:
- namespaceSelector:
matchLabels:
capsule.clastix.io/tenant: oil
- podSelector: {}
- ipBlock:
cidr: 192.168.0.0/12
podSelector: {}
policyTypes:
- Ingress
- Egress
nodeSelector:
kubernetes.io/os: linux
resourceQuotas:
- hard:
limits.cpu: '8'
limits.memory: 16Gi
requests.cpu: '8'
requests.memory: 16Gi
scopes:
- NotTerminating
- hard:
pods: '10'
- hard:
requests.storage: 100Gi
Tenant definition
controllerutil.CreateOrUpdate
CreateOrUpdate creates or updates the given object in the Kubernetes cluster.
The object's desired state must be reconciled with the existing state inside the passed in callback MutateFn.
The MutateFn is called regardless of creating or updating an object.
It returns the executed operation and an error.
Package
sigs.k8s.io/controller-runtime/pkg/controller/controllerutil
controllerutil.CreateOrUpdate
# wrong
var err error
err = r.Get(context.TODO(), types.NamespacedName{Name: namespace}, ns)
if errors.IsNotFound(err) {
err = r.Create(context.TODO(), ns, &client.CreateOptions{})
if err != nil {
return err
}
}
if ns.Annotations == nil {
ns.Annotations = make(map[string]string)
}
var selector []string
for k, v := range selectorMap {
selector = append(selector, fmt.Sprintf("%s=%s", k, v))
}
ns.Annotations["scheduler.alpha.kubernetes.io/node-selector"] = strings.Join(selector, ",")
return r.Update(context.TODO(), ns, &client.UpdateOptions{})
controllerutil.CreateOrUpdate
# right
var err error
ns := &corev1.Namespace{
ObjectMeta: metav1.ObjectMeta{
Name: namespace,
},
}
_, err = controllerutil.CreateOrUpdate(context.TODO(), r, ns, func() error {
if ns.Annotations == nil {
ns.Annotations = make(map[string]string)
}
var selector []string
for k, v := range selectorMap {
selector = append(selector, fmt.Sprintf("%s=%s", k, v))
}
annotationName = "scheduler.alpha.kubernetes.io/node-selector"
ns.Annotations[annotationName] = strings.Join(selector, ",")
return nil
})
return err
Resources is there (update)
- getting the Resource (from the cache)
- mutating it as fn(obj)
- performing update
Resources is not there (create)
- using the empty object
- mutating it as fn(obj)
- performing creation
mutateFn benefits
- Isolating data
- Wrapping functions (middleware)
- data access that's not available
Do's
Use the mutateFn to encapsulate your resource reconciliation loop.
Don'ts
Don't pass the expected resource and just apply it: it's cache backed!
Nor, don't overwrite the whole object (metadata or status): loosing data!
errors.IsNotFound
If objects have been deleted, they're not found! ¯\_(ツ)_/¯
We got the Kubernetes Garbage Collector doing the dirty job: just let it go.
Package: k8s.io/apimachinery/pkg/api/errors
...but if deletion is complex?
Let's use the Kubernetes finalizers: asynchronous pre-deletion hooks.
You can set them in .metadata.finalizers
apiVersion: v1
kind: Namespace
metadata:
creationTimestamp: "2020-09-07T10:48:53Z"
name: default
resourceVersion: "152"
selfLink: /api/v1/namespaces/default
uid: 79fb69c3-a81e-42e0-9476-85d6fb247ce4
spec:
finalizers:
- kubernetes
status:
phase: Active
---
apiVersion: v1alpha1
kind: Tenant
metadata:
name: oil
finalizers:
- capsule/close-handler
spec:
...
func (r *CronJobReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
ctx := context.Background()
log := r.Log.WithValues("cronjob", req.NamespacedName)
var cronJob *batchv1.CronJob
if err := r.Get(ctx, req.NamespacedName, cronJob); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// name of our custom finalizer
myFinalizerName := "storage.finalizers.tutorial.kubebuilder.io"
// examine DeletionTimestamp to determine if object is under deletion
if cronJob.ObjectMeta.DeletionTimestamp.IsZero() {
// The object is not being deleted, so if it does not have our finalizer,
// then lets add the finalizer and update the object. This is equivalent
// registering our finalizer.
if !containsString(cronJob.ObjectMeta.Finalizers, myFinalizerName) {
cronJob.ObjectMeta.Finalizers = append(cronJob.ObjectMeta.Finalizers, myFinalizerName)
if err := r.Update(context.Background(), cronJob); err != nil {
return ctrl.Result{}, err
}
}
} else {
// The object is being deleted
if containsString(cronJob.ObjectMeta.Finalizers, myFinalizerName) {
// our finalizer is present, so lets handle any external dependency
if err := r.deleteExternalResources(cronJob); err != nil {
// if fail to delete the external dependency here, return with error
// so that it can be retried
return ctrl.Result{}, err
}
// remove our finalizer from the list and update it.
cronJob.ObjectMeta.Finalizers = removeString(cronJob.ObjectMeta.Finalizers, myFinalizerName)
if err := r.Update(context.Background(), cronJob); err != nil {
return ctrl.Result{}, err
}
}
// Stop reconciliation as the item is being deleted
return ctrl.Result{}, nil
}
// Your reconcile logic
return ctrl.Result{}, nil
}
Do's
Use finalizers to achieve long-task deletions.
Ignore NotFound errors.
Don'ts
Don't reconcile over and over the NotFound errors, you're wasting time and cycles.
controllerutils.SetControllerReference
Sets owner as a Controller OwnerReference on controlled. This is used for garbage collection of the controlled object and for reconciling the owner object on changes to controlled (with a Watch + EnqueueRequestForOwner).
controllerutils.SetOwnerReference
A helper method to make sure the given object contains an object reference to the object provided. This allows you to declare that owner has a dependency on the object without specifying it as a controller. If a reference to the same object already exists, it'll be overwritten with the newly provided version.
_, _ := controllerutil.CreateOrUpdate(context.TODO(), r.Client, t, func() (err error) {
t.ObjectMeta.Labels = map[string]string{
tl: tenant.Name,
ll: strconv.Itoa(i),
}
t.Spec = spec
return controllerutil.SetControllerReference(tenant, t, r.Scheme)
})
func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&group.Kind{}).
Owns(&corev1.Namespace{}).
Owns(&networkingv1.NetworkPolicy{}).
Owns(&corev1.LimitRange{}).
Owns(&corev1.ResourceQuota{}).
Owns(&rbacv1.RoleBinding{}).
Complete(r)
}
EnqueueRequestForOwner
Do's
Kubernetes GC is so powerful: take advantage of it!
Don'ts
OwnerReference can lead to data loss, use it wisely!
retry.RetryOnConflict
Used to make an update to a resource when you have to worry about conflicts caused by other code making unrelated updates to the resource at the same time.
fn should fetch the resource to be modified, make appropriate changes to it, try to update it, and return (unmodified) the error from the update function.
On a successful update, RetryOnConflict will return nil. If the update function returns a "Conflict" error, RetryOnConflict will wait some amount of time as described by backoff, and then try again.
On a non-"Conflict" error, or if it retries too many times and gives up, RetryOnConflict will return an error to the caller.
Package
k8s.io/client-go/util/retry
retry.RetryOnConflict(retry.DefaultBackoff, func() error {
// Retrieving from the cache the actual ResourceQuota
found := &corev1.ResourceQuota{}
_ = r.Get(context.Background(), types.NamespacedName{Namespace: rq.Namespace, Name: rq.Name}, found)
// Ensuring annotation map is there to avoid uninitialized map error and
// assigning the overall usage
if found.Annotations == nil {
found.Annotations = make(map[string]string)
}
found.Labels = rq.Labels
found.Annotations[capsulev1alpha1.UsedQuotaFor(resourceName)] = qt.String()
// Updating the Resource according to the qt.Cmp result
found.Spec.Hard = rq.Spec.Hard
return r.Update(context.Background(), found, &client.UpdateOptions{})
})
RetryOnConflict in action
Combine RetryOnConflict with CreateOrUpdate
tls := &corev1.Secret{}
_ = r.Get(context.TODO(), types.NamespacedName{
Namespace: r.Namespace,
Name: tlsSecretName,
}, tls)
err = retry.RetryOnConflict(retry.DefaultBackoff, func() error {
_, err = controllerutil.CreateOrUpdate(context.Background(), r.Client, tls, func() error {
tls.Data = map[string][]byte{}
return nil
})
return err
})
if err != nil {
return err
}
client.Client
Not a versioned Client Set, using the NamespacedName struct to interact with versioned API, also CRDs.
Combined with the Controller Manager, fully backed by local cache.
Package
sigs.k8s.io/controller-runtime/pkg/client
ns := &corev1.Namespace{}
if err := r.Client.Get(context.TODO(), types.NamespacedName{Name: namespace}, ns); err != nil {
return err
}
Retrieving any object
import (
utilruntime"k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
capsulev1alpha1 "github.com/clastix/capsule/api/v1alpha1"
)
func init() {
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(capsulev1alpha1.AddToScheme(scheme))
}
// and avoid the GVK pain with code-generator!
Also CRDs... but first:
client.FieldIndexer
The FieldIndexer knows how to index over a particular field such that it can later be used by a field selector.
Tl;dr; don't do O(n), use Indexer as O(1)
Package
sigs.k8s.io/controller-runtime/pkg/client
type CustomIndexer interface {
Object() runtime.Object
Field() string
Func() client.IndexerFunc
}
type NamespacesReference struct {
}
func (o NamespacesReference) Object() runtime.Object {
return &v1alpha1.Tenant{}
}
func (o NamespacesReference) Field() string {
return ".status.namespaces"
}
func (o NamespacesReference) Func() client.IndexerFunc {
return func(object runtime.Object) (res []string) {
tenant := object.(*v1alpha1.Tenant)
return tenant.Status.Namespaces.DeepCopy()
}
}
// register the Indexer to the manager
var AddToIndexerFuncs []CustomIndexer
func AddToManager(mgr manager.Manager) error {
for _, f := range AddToIndexerFuncs {
err := mgr.GetFieldIndexer().IndexField(context.TODO(), f.Object(), f.Field(), f.Func())
if err != nil {
return err
}
}
return nil
}
// retrieve from the manager Client
tl := &v1alpha1.TenantList{}
if err := c.List(ctx, tl, client.MatchingFieldsSelector{
Selector: fields.OneTermEqualSelector(".status.namespaces", object.Namespace()),
}); err != nil {
return err
}
if len(tl.Items) == 0 {
return fmt.Errorf("No tenants are handling the %s Namespace", object.Namespace())
}
Do's
Useful for simple entries.
Don'ts
Doesn't work well with complex types, which could lead to hard to understand the logic.
Be aware of memory consumption due to map storing.
manager.Runnable
If you need to run something after Manager start-up, you have to implement the Runnable interface.
Package
sigs.k8s.io/controller-runtime/pkg/manager
// Runnable allows a component to be started.
// It's very important that Start blocks until
// it's done running.
type Runnable interface {
// Start starts running the component. The component will stop running
// when the channel is closed. Start blocks until the channel is closed
// or an error occurs.
Start(<-chan struct{}) error
}
Our problem
Capsule intercepts the User requests and filters by Organization/Group that can be set at runtime via CLI flag: this means edit all the RBAC settings.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: capsule-namespace-provisioner
rules:
- apiGroups:
- ""
resources:
- namespaces
verbs:
- create
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: capsule-namespace-provisioner
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: capsule-namespace-provisioner
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: capsule.clastix.io
# ^^^ this is the Group/Org name
func (r *Manager) Start(<-chan struct{}) (err error) {
for roleName := range clusterRoles {
if err = r.EnsureClusterRole(roleName); err != nil {
return
}
}
err = r.EnsureClusterRoleBinding()
return
}
// but it could run anything you want, like an HTTP server!
func (n kubeFilter) Start(stop <-chan struct{}) error {
http.HandleFunc("/api/v1/namespaces", func(writer http.ResponseWriter, request *http.Request) {
if request.Method == "GET" || n.isWatchEndpoint(request) {
if err := n.decorateRequest(writer, request); err != nil {
n.handleError(err, writer)
return
}
}
n.reverseProxyFunc(writer, request)
})
http.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) {
n.reverseProxyFunc(writer, request)
})
http.HandleFunc("/_healthz", func(writer http.ResponseWriter, request *http.Request) {
writer.WriteHeader(200)
_, _ = writer.Write([]byte("ok"))
})
go func() {
if err := http.ListenAndServe(fmt.Sprintf("0.0.0.0:%d", n.listeningPort), nil); err != nil {
panic(err)
}
}()
<-stop
return nil
}
predicate.Predicate
Predicate filters events before enqueuing the keys.
Package
sigs.k8s.io/controller-runtime/pkg/predicate
type Predicate interface {
// Create returns true if the Create event should be processed
Create(event.CreateEvent) bool
// Delete returns true if the Delete event should be processed
Delete(event.DeleteEvent) bool
// Update returns true if the Update event should be processed
Update(event.UpdateEvent) bool
// Generic returns true if the Generic event should be processed
Generic(event.GenericEvent) bool
}
Our problem
Capsule (ab)uses of Dynamic Admissions Control resources (aka webhooks) that must be secured with TLS/HTTPS: we need a CA and a certificate, each of them stored on a Kubernetes Secret resource.
// Certificate
func (r *TlsReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&corev1.Secret{}, forOptionPerInstanceName(tlsSecretName)).
Complete(r)
}
// CA
func (r *CaReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&corev1.Secret{}, forOptionPerInstanceName(caSecretName)).
Complete(r)
}
// For defines the type of Object being *reconciled*, and configures the ControllerManagedBy
// to respond to create / delete / update events by *reconciling the object*.
// This is the equivalent of calling
// Watches(&source.Kind{Type: apiType}, &handler.EnqueueRequestForObject{})
func (blder *Builder) For(object runtime.Object, opts ...ForOption) *Builder {
input := ForInput{object: object}
for _, opt := range opts {
opt.ApplyToFor(&input)
}
blder.forInput = input
return blder
}
func forOptionPerInstanceName(instanceName string) builder.ForOption {
return builder.WithPredicates(predicate.Funcs{
CreateFunc: func(event event.CreateEvent) bool {
return filterByName(event.Meta.GetName(), instanceName)
},
DeleteFunc: func(deleteEvent event.DeleteEvent) bool {
return filterByName(deleteEvent.Meta.GetName(), instanceName)
},
UpdateFunc: func(updateEvent event.UpdateEvent) bool {
return filterByName(updateEvent.MetaNew.GetName(), instanceName)
},
GenericFunc: func(genericEvent event.GenericEvent) bool {
return filterByName(genericEvent.Meta.GetName(), instanceName)
},
})
}
func filterByName(objName, desired string) bool {
return objName == desired
}
Final thoughts
Best Practices on writing a Kubernetes Operator
By Dario Tranchitella
Best Practices on writing a Kubernetes Operator
- 1,691