Best Practices on writing
a Kubernetes Operator

my lessons from Capsule


Dario Tranchitella - prometherion

GoLab Goes Online! 2020

$ kubectl describe prometherion dario-0

apiVersion: human.kind.io/v1alpha1
kind: Tranchitella
metadata:
  name: prometherion
  labels:
    haproxy.org/maintainer: k8s-ingress
    clastix.io/maintainer: capsule
spec:
  replicas: 1
  image: quay.io/tranchitella.io:0.0.1-rc7
  eyes: blue
  hair: null
status:
  conditions:
  - type: Initialized
    Status: "True"
    lastTransitionTime: "1989-07-03 ??:??:??"
  - type: Ready
    Status: "True"
    lastTransitionTime: "2014-12-06 03:24:00"
  - type: Drumming
    Status: "False"
  - type: KickBoxing
    Status: "False"

WARNING: memes alert!

What is Capsule?

A Kubernetes multi-tenant operator, aiming to provide strong isolation between Namespace resources.

 

Not to be intended to be yet-another-PaaS, instead, we provide a minimalistic approach as a lightweight tool leveraging Kubernetes default resources.

CAPSULE
PRESENTATION

AUDIENCE

Kubernetes (K8s)

An open-source system for automating deployment, scaling, and management of containerized applications.

 

It groups containers that make up an application into logical units for easy management and discovery.

Features

  • Automated rollouts and rollbacks
  • Service discovery and load balancing
  • Service Topology
  • Storage orchestration
  • Secret and configuration management
  • Batch execution
  • Horizontal scaling
  • Self-healing
  • Isolate workloads via Namespaces
  • Run everywhere, also your machine!

Namespaces

  • Namespaces are intended for use in environments with many users spread across multiple teams, or projects.
  • Namespaces are a way to divide cluster resources between multiple users.
# kubectl get namespaces

NAME                 STATUS        AGE
oil-development      Active        3d22h
oil-staging          Active        3d22h
oil-production       Active        3d22h
gas-development      Active        2d33h
gas-staging          Active        2d33h
gas-production       Active        2d33h

How to isolate Namespaces?

  • LimitRanges
    Policies to constrain resource allocations (to Pods or Containers) in a namespace
  • NetworkPolicy
    Set of rules that specify how groups of pods are allowed to communicate with each other and other network endpoints
  • ResourceQuota
    Policies that provide constraints that limit aggregate resource consumption as limiting the number or total amount of objects or computing resources that can be created per namespace by type.

Operator

Based on the Operator Pattern and the reconciliation/control loop.

 

Translating the human operator knowledge into code, automating procedures and ensuring the actual state matched the desired one.

func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
  return ctrl.NewControllerManagedBy(mgr).
    For(&group.Kind{}).
    Owns(&corev1.Namespace{}).
    Owns(&networkingv1.NetworkPolicy{}).
    Owns(&corev1.LimitRange{}).
    Owns(&corev1.ResourceQuota{}).
    Owns(&rbacv1.RoleBinding{}).
    Complete(r)
}

func (r Reconciler) Reconcile(request ctrl.Request) (result ctrl.Result, err error) {
  instance := &group.Kind{}
  err = r.Get(context.Background(), request.NamespacedName, instance)
  if err != nil {
    if errors.IsNotFound(err) {
      r.Log.Info("Request object not found, could have been deleted after reconcile request")
      return reconcile.Result{}, nil
    }
    r.Log.Error(err, "Error reading the object")
    return reconcile.Result{}, err
  }
    
  // do your business logic here:
  //
  // if you need to reconcile every each period of time:
  // return reconcile.Result{Requeue: true, RequeueAfter: time.Minute}, nil
    
  return reconcile.Result{}, nil
}

Reconciliation Loop

# kubectl get networkpolicy capsule-oil-0 -o yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  ownerReferences:
  - apiVersion: capsule.clastix.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Tenant
    name: oil
    uid: 57987c02-e883-4642-9c7f-259f6d50ddca # <<< it's a match!
...


# kubectl get tenant oil -o yaml
apiVersion: capsule.clastix.io/v1alpha1
kind: Tenant
metadata:
  name: oil
  uid: 57987c02-e883-4642-9c7f-259f6d50ddca  # <<< it's a match!
...

Owned Resources

Finally, Capsule!

Leveraging Kubernetes default resources to handle the multi-tenancy hassle.

 

The Tenant definition is a Custom Resource Definition (tl;dr; yet another YAML file).

 

Grouping Namespaces over a new definition: then Tenant.

How does it look

apiVersion: capsule.clastix.io/v1alpha1
kind: Tenant
metadata:
  name: oil
spec:
  owner:
    kind: User
    name: alice
  namespaceQuota: 3
  ingressClasses:
    allowed:
      - default
    allowedRegex: ''
  storageClasses:
    allowed:
      - default
    allowedRegex: ''
  limitRanges:
    - limits:
        - max:
            cpu: '1'
            memory: 1Gi
          min:
            cpu: 50m
            memory: 5Mi
          type: Pod
        - default:
            cpu: 200m
            memory: 100Mi
          defaultRequest:
            cpu: 100m
            memory: 10Mi
          max:
            cpu: '1'
            memory: 1Gi
          min:
            cpu: 50m
            memory: 5Mi
          type: Container
        - max:
            storage: 10Gi
          min:
            storage: 1Gi
          type: PersistentVolumeClaim
  networkPolicies:
    - egress:
        - to:
            - ipBlock:
                cidr: 0.0.0.0/0
                except:
                  - 192.168.0.0/12
      ingress:
        - from:
            - namespaceSelector:
                matchLabels:
                  capsule.clastix.io/tenant: oil
            - podSelector: {}
            - ipBlock:
                cidr: 192.168.0.0/12
      podSelector: {}
      policyTypes:
        - Ingress
        - Egress
  nodeSelector:
    kubernetes.io/os: linux
  resourceQuotas:
    - hard:
        limits.cpu: '8'
        limits.memory: 16Gi
        requests.cpu: '8'
        requests.memory: 16Gi
      scopes:
        - NotTerminating
    - hard:
        pods: '10'
    - hard:
        requests.storage: 100Gi

Tenant definition

controllerutil.CreateOrUpdate

CreateOrUpdate creates or updates the given object in the Kubernetes cluster.

The object's desired state must be reconciled with the existing state inside the passed in callback MutateFn.

The MutateFn is called regardless of creating or updating an object.

It returns the executed operation and an error.


Package
sigs.k8s.io/controller-runtime/pkg/controller/controllerutil

controllerutil.CreateOrUpdate

# wrong

var err error

err = r.Get(context.TODO(), types.NamespacedName{Name: namespace}, ns)
if errors.IsNotFound(err) {
	err = r.Create(context.TODO(), ns, &client.CreateOptions{})
	if err != nil {
		return err
	}
}

if ns.Annotations == nil {
	ns.Annotations = make(map[string]string)
}
var selector []string
for k, v := range selectorMap {
	selector = append(selector, fmt.Sprintf("%s=%s", k, v))
}
ns.Annotations["scheduler.alpha.kubernetes.io/node-selector"] = strings.Join(selector, ",")

return r.Update(context.TODO(), ns, &client.UpdateOptions{})

controllerutil.CreateOrUpdate

# right

var err error

ns := &corev1.Namespace{
	ObjectMeta: metav1.ObjectMeta{
		Name: namespace,
	},
}

_, err = controllerutil.CreateOrUpdate(context.TODO(), r, ns, func() error {
	if ns.Annotations == nil {
		ns.Annotations = make(map[string]string)
	}
	var selector []string
	for k, v := range selectorMap {
		selector = append(selector, fmt.Sprintf("%s=%s", k, v))
	}
	annotationName = "scheduler.alpha.kubernetes.io/node-selector"
	ns.Annotations[annotationName] = strings.Join(selector, ",")
	return nil
})

return err

Resources is there (update)

  1. getting the Resource (from the cache)
  2. mutating it as fn(obj)
  3. performing update

Resources is not there (create)

  1. using the empty object
  2. mutating it as fn(obj)
  3. performing creation

mutateFn benefits

  • Isolating data
  • Wrapping functions (middleware)
  • data access that's not available

Do's

Use the mutateFn to encapsulate your resource reconciliation loop.

Don'ts

Don't pass the expected resource and just apply it: it's cache backed!

Nor, don't overwrite the whole object (metadata or status): loosing data!

errors.IsNotFound

If objects have been deleted, they're not found! ¯\_()_/¯

 

We got the Kubernetes Garbage Collector doing the dirty job: just let it go.

 

 

Package: k8s.io/apimachinery/pkg/api/errors

...but if deletion is complex?

Let's use the Kubernetes finalizers: asynchronous pre-deletion hooks.

 

You can set them in .metadata.finalizers

apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2020-09-07T10:48:53Z"
  name: default
  resourceVersion: "152"
  selfLink: /api/v1/namespaces/default
  uid: 79fb69c3-a81e-42e0-9476-85d6fb247ce4
spec:
  finalizers:
  - kubernetes
status:
  phase: Active
---
apiVersion: v1alpha1
kind: Tenant
metadata:
  name: oil
  finalizers:
  - capsule/close-handler
spec:
  ...
func (r *CronJobReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
  ctx := context.Background()
  log := r.Log.WithValues("cronjob", req.NamespacedName)

  var cronJob *batchv1.CronJob
  if err := r.Get(ctx, req.NamespacedName, cronJob); err != nil {
    return ctrl.Result{}, client.IgnoreNotFound(err)
  }

  // name of our custom finalizer
  myFinalizerName := "storage.finalizers.tutorial.kubebuilder.io"

  // examine DeletionTimestamp to determine if object is under deletion
  if cronJob.ObjectMeta.DeletionTimestamp.IsZero() {
    // The object is not being deleted, so if it does not have our finalizer,
    // then lets add the finalizer and update the object. This is equivalent
    // registering our finalizer.
    if !containsString(cronJob.ObjectMeta.Finalizers, myFinalizerName) {
      cronJob.ObjectMeta.Finalizers = append(cronJob.ObjectMeta.Finalizers, myFinalizerName)
      if err := r.Update(context.Background(), cronJob); err != nil {
        return ctrl.Result{}, err
      }
    }
  } else {
    // The object is being deleted
    if containsString(cronJob.ObjectMeta.Finalizers, myFinalizerName) {
      // our finalizer is present, so lets handle any external dependency
      if err := r.deleteExternalResources(cronJob); err != nil {
        // if fail to delete the external dependency here, return with error
        // so that it can be retried
        return ctrl.Result{}, err
      }

      // remove our finalizer from the list and update it.
      cronJob.ObjectMeta.Finalizers = removeString(cronJob.ObjectMeta.Finalizers, myFinalizerName)
      if err := r.Update(context.Background(), cronJob); err != nil {
        return ctrl.Result{}, err
      }
    }

    // Stop reconciliation as the item is being deleted
    return ctrl.Result{}, nil
  }

  // Your reconcile logic

  return ctrl.Result{}, nil
}

Do's

Use finalizers to achieve long-task deletions.

Ignore NotFound errors.

Don'ts

Don't reconcile over and over the NotFound errors, you're wasting time and cycles.

controllerutils.SetControllerReference

Sets owner as a Controller OwnerReference on controlled. This is used for garbage collection of the controlled object and for reconciling the owner object on changes to controlled (with a Watch + EnqueueRequestForOwner).

controllerutils.SetOwnerReference

A helper method to make sure the given object contains an object reference to the object provided. This allows you to declare that owner has a dependency on the object without specifying it as a controller. If a reference to the same object already exists, it'll be overwritten with the newly provided version.

_, _ := controllerutil.CreateOrUpdate(context.TODO(), r.Client, t, func() (err error) {
	t.ObjectMeta.Labels = map[string]string{
		tl: tenant.Name,
		ll: strconv.Itoa(i),
	}
	t.Spec = spec
	return controllerutil.SetControllerReference(tenant, t, r.Scheme)
})
func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
  return ctrl.NewControllerManagedBy(mgr).
    For(&group.Kind{}).
    Owns(&corev1.Namespace{}).
    Owns(&networkingv1.NetworkPolicy{}).
    Owns(&corev1.LimitRange{}).
    Owns(&corev1.ResourceQuota{}).
    Owns(&rbacv1.RoleBinding{}).
    Complete(r)
}

EnqueueRequestForOwner

Do's

Kubernetes GC is so powerful: take advantage of it!

Don'ts

OwnerReference can lead to data loss, use it wisely!

retry.RetryOnConflict

Used to make an update to a resource when you have to worry about conflicts caused by other code making unrelated updates to the resource at the same time.

fn should fetch the resource to be modified, make appropriate changes to it, try to update it, and return (unmodified) the error from the update function.

On a successful update, RetryOnConflict will return nil. If the update function returns a "Conflict" error, RetryOnConflict will wait some amount of time as described by backoff, and then try again.

On a non-"Conflict" error, or if it retries too many times and gives up, RetryOnConflict will return an error to the caller.


Package
k8s.io/client-go/util/retry

retry.RetryOnConflict(retry.DefaultBackoff, func() error {
  // Retrieving from the cache the actual ResourceQuota
  found := &corev1.ResourceQuota{}
  _ = r.Get(context.Background(), types.NamespacedName{Namespace: rq.Namespace, Name: rq.Name}, found)
  // Ensuring annotation map is there to avoid uninitialized map error and
  // assigning the overall usage
  if found.Annotations == nil {
    found.Annotations = make(map[string]string)
  }
  found.Labels = rq.Labels
  found.Annotations[capsulev1alpha1.UsedQuotaFor(resourceName)] = qt.String()
  // Updating the Resource according to the qt.Cmp result
  found.Spec.Hard = rq.Spec.Hard
  return r.Update(context.Background(), found, &client.UpdateOptions{})
})

RetryOnConflict in action

Combine RetryOnConflict with CreateOrUpdate

tls := &corev1.Secret{}
_ = r.Get(context.TODO(), types.NamespacedName{
  Namespace: r.Namespace,
  Name:      tlsSecretName,
}, tls)
		
err = retry.RetryOnConflict(retry.DefaultBackoff, func() error {
  _, err = controllerutil.CreateOrUpdate(context.Background(), r.Client, tls, func() error {
    tls.Data = map[string][]byte{}
    return nil
  })
  return err
})

if err != nil {
  return err
}

client.Client

Not a versioned Client Set, using the NamespacedName struct to interact with versioned API, also CRDs.

 

Combined with the Controller Manager, fully backed by local cache.


Package
sigs.k8s.io/controller-runtime/pkg/client

ns := &corev1.Namespace{}
if err := r.Client.Get(context.TODO(), types.NamespacedName{Name: namespace}, ns); err != nil {
	return err
}

Retrieving any object

import (
	utilruntime"k8s.io/apimachinery/pkg/util/runtime"
	clientgoscheme "k8s.io/client-go/kubernetes/scheme"

	capsulev1alpha1 "github.com/clastix/capsule/api/v1alpha1"
)

func init() {
	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
	utilruntime.Must(capsulev1alpha1.AddToScheme(scheme))
}

// and avoid the GVK pain with code-generator!

Also CRDs... but first:

client.FieldIndexer

The FieldIndexer knows how to index over a particular field such that it can later be used by a field selector.

 

Tl;dr; don't do O(n), use Indexer as O(1)


Package
sigs.k8s.io/controller-runtime/pkg/client

type CustomIndexer interface {
  Object() runtime.Object
  Field() string
  Func() client.IndexerFunc
}

type NamespacesReference struct {
}

func (o NamespacesReference) Object() runtime.Object {
	return &v1alpha1.Tenant{}
}

func (o NamespacesReference) Field() string {
	return ".status.namespaces"
}

func (o NamespacesReference) Func() client.IndexerFunc {
	return func(object runtime.Object) (res []string) {
		tenant := object.(*v1alpha1.Tenant)
		return tenant.Status.Namespaces.DeepCopy()
	}
}
// register the Indexer to the manager

var AddToIndexerFuncs []CustomIndexer

func AddToManager(mgr manager.Manager) error {
  for _, f := range AddToIndexerFuncs {
    err := mgr.GetFieldIndexer().IndexField(context.TODO(), f.Object(), f.Field(), f.Func())
    if err != nil {
      return err
    }
  }
  return nil
}


// retrieve from the manager Client

tl := &v1alpha1.TenantList{}
if err := c.List(ctx, tl, client.MatchingFieldsSelector{
	Selector: fields.OneTermEqualSelector(".status.namespaces", object.Namespace()),
}); err != nil {
	return err
}

if len(tl.Items) == 0 {
	return fmt.Errorf("No tenants are handling the %s Namespace", object.Namespace())
}

Do's

Useful for simple entries.

Don'ts

Doesn't work well with complex types, which could lead to hard to understand the logic.

 

Be aware of memory consumption due to map storing.

manager.Runnable

If you need to run something after Manager start-up, you have to implement the Runnable interface.


Package
sigs.k8s.io/controller-runtime/pkg/manager

// Runnable allows a component to be started.
// It's very important that Start blocks until
// it's done running.
type Runnable interface {
	// Start starts running the component.  The component will stop running
	// when the channel is closed.  Start blocks until the channel is closed
	// or an error occurs.
	Start(<-chan struct{}) error
}

Our problem

Capsule intercepts the User requests and filters by Organization/Group that can be set at runtime via CLI flag: this means edit all the RBAC settings.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: capsule-namespace-provisioner
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - create
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: capsule-namespace-provisioner
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: capsule-namespace-provisioner
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: capsule.clastix.io
  # ^^^ this is the Group/Org name
func (r *Manager) Start(<-chan struct{}) (err error) {
  for roleName := range clusterRoles {
    if err = r.EnsureClusterRole(roleName); err != nil {
      return
    }
  }
  err = r.EnsureClusterRoleBinding()
  return
}

// but it could run anything you want, like an HTTP server!
func (n kubeFilter) Start(stop <-chan struct{}) error {
  http.HandleFunc("/api/v1/namespaces", func(writer http.ResponseWriter, request *http.Request) {
    if request.Method == "GET" || n.isWatchEndpoint(request) {
      if err := n.decorateRequest(writer, request); err != nil {
        n.handleError(err, writer)
        return
      }
    }
    n.reverseProxyFunc(writer, request)
  })
  http.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) {
    n.reverseProxyFunc(writer, request)
  })
  http.HandleFunc("/_healthz", func(writer http.ResponseWriter, request *http.Request) {
    writer.WriteHeader(200)
    _, _ = writer.Write([]byte("ok"))
  })
  go func() {
    if err := http.ListenAndServe(fmt.Sprintf("0.0.0.0:%d", n.listeningPort), nil); err != nil {
      panic(err)
    }
  }()
  <-stop
  return nil
}

predicate.Predicate

Predicate filters events before enqueuing the keys. 


Package
sigs.k8s.io/controller-runtime/pkg/predicate

type Predicate interface {
	// Create returns true if the Create event should be processed
	Create(event.CreateEvent) bool

	// Delete returns true if the Delete event should be processed
	Delete(event.DeleteEvent) bool

	// Update returns true if the Update event should be processed
	Update(event.UpdateEvent) bool

	// Generic returns true if the Generic event should be processed
	Generic(event.GenericEvent) bool
}

Our problem

Capsule (ab)uses of Dynamic Admissions Control resources (aka webhooks) that must be secured with TLS/HTTPS: we need a CA and a certificate, each of them stored on a Kubernetes Secret resource.

// Certificate
func (r *TlsReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&corev1.Secret{}, forOptionPerInstanceName(tlsSecretName)).
		Complete(r)
}

// CA
func (r *CaReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&corev1.Secret{}, forOptionPerInstanceName(caSecretName)).
		Complete(r)
}



// For defines the type of Object being *reconciled*, and configures the ControllerManagedBy
// to respond to create / delete / update events by *reconciling the object*.
// This is the equivalent of calling
// Watches(&source.Kind{Type: apiType}, &handler.EnqueueRequestForObject{})
func (blder *Builder) For(object runtime.Object, opts ...ForOption) *Builder {
	input := ForInput{object: object}
	for _, opt := range opts {
		opt.ApplyToFor(&input)
	}

	blder.forInput = input
	return blder
}
func forOptionPerInstanceName(instanceName string) builder.ForOption {
	return builder.WithPredicates(predicate.Funcs{
		CreateFunc: func(event event.CreateEvent) bool {
			return filterByName(event.Meta.GetName(), instanceName)
		},
		DeleteFunc: func(deleteEvent event.DeleteEvent) bool {
			return filterByName(deleteEvent.Meta.GetName(), instanceName)
		},
		UpdateFunc: func(updateEvent event.UpdateEvent) bool {
			return filterByName(updateEvent.MetaNew.GetName(), instanceName)
		},
		GenericFunc: func(genericEvent event.GenericEvent) bool {
			return filterByName(genericEvent.Meta.GetName(), instanceName)
		},
	})
}

func filterByName(objName, desired string) bool {
	return objName == desired
}

Final thoughts

Best Practices on writing a Kubernetes Operator

By Dario Tranchitella

Best Practices on writing a Kubernetes Operator

  • 1,648