Cluster API Runtime SDK

CAPI Runtime
Runtime SDK

Runtime Hook
Runtime Extension

Motivation

Instead, with the growing adoption of Cluster API as a common layer to manage fleets of Kubernetes Clusters, there is now a new category of systems, products and services built on top of Cluster API that require strict interactions with the lifecycle of Clusters, but at the same
time they do not want to replace any “low-level” components in Cluster API, because they happily benefit from all the features available in the existing providers (built on top vs plug-in/swap).

CAPI follows extensibility model

  • bootstrap providers
  • control plane providers
  • external remediation providers
  • infra providers

 

 

 

A common approach for this problem has been to watch for Cluster API resources; another approach has been to implement API Server admission webhooks to alter CAPI resources, but both approaches are limited by the fact that the system built on top of Cluster API is forced to treat it as a opaque system and thus with limited visibility and almost total lack of control, e.g. you can watch a Machine being provisioned, but not block the provisioning to start if a quota management systems signals you have exhausted all the resources assigned to you.


Example implemented and used by CCD is machine deletion webhook.

  • To define the rules ensuring Runtime Hooks can evolve over time
  • To define the fundamental capabilities/tooling to be implemented in CAPI in order to
    allow the implementation of Runtime Hooks
  • To provide an initial set of guidelines for Runtime Extension developers
  • To define how external Runtime Extensions can be registered within the Cluster API Runtime

 

Goals

Future work

  1. Identify and specify the list of Runtime Hooks to be implemented; this will be addressed iteratively by a set of future proposals, all of them building on top of the foundational capabilities introduced by this document;
  2. Eventually consider deprecation of machine deletion hooks and replacement with a
    Runtime Hook
  3. Improve the Runtime Extension developer guide based on experience and feedback
  4. Add metrics about Runtime Extension calls (usage, usage vs deprecated versions,
    duration, error rate etc.)

User stories

  1. As a developer building systems on top of Cluster API, I would like Runtime Extension to
    provide a certain degree of control on Cluster’s lifecycle, like e.g. block/defer an operation to start
     
  2. As a developer building systems on top of Cluster API, I would like to implement a Runtime Extension in a simple way (simpler than writing controllers)

Runtime Hooks vs K8s admission webhook

Runtime Hooks are inspired by Kubernetes admission webhooks, but there is one key difference
that splits them aparts:
 

  • Admission webhooks are strictly linked to Kubernetes API Server/etcd CRUD operations e.g. Create or Update Cluster in etcd.
  • Runtime Hooks can be used to define arbitrary operations , e.g. Cluster.BeforeUpgrade, Machine.Remediate etc.

Each Runtime Hook will be defined by one (or more) RESTful APIs implemented as a POST operation ; each operation is going to receive an input parameter as a request body, and return an output value as response body, both application/json encoded and with a schema of arbitrary
complexity that should be considered an integral part of the Runtime Hook definition.


It is also worth noting that more than one version of the same Runtime Hook might be supported at the same time; e.g. in the example above the Cluster.BeforeUpgrade operations exist in version v1alpha1 (old version) and v1alpha2 (current).

Runtime SDK rules

As a developer building systems on top of Cluster API, if you want to interact with the Cluster’s lifecycle via a Runtime Extension, you are required to implement an HTTP server handling requests according to the OpenAPI specification for the Runtime Hook you are interested in

Runtime Extensions developer guide

  • Packing the Runtime Extension in a container image;
  • Using a Kubernetes Deployment to run the above container inside the Management Cluster;
  • Using a Cluster IP service to make the Runtime Extension instances accessible via a stable DNS name;
  • Using a cert-manager generated Certificate to protect the endpoint.

 

  • ​deploying the HTTP Server as a part of another component, e.g. a controller;
  • deploying the HTTP Server outside of the Management Cluster.

Deploy Runtime Extensions

By registering a Runtime Extension the Cluster API Runtime became aware of a Runtime Extension implementing a Runtime Hook, and as a consequence the runtime starts calling the extension at well-defined moments of the Workload Cluster’s lifecycle

 

Register Runtime Extensions

apiVersion: cluster.x-k8s.io/v1beta1 
kind: RuntimeExtensionConfiguration
metadata:
  name: "my-amazing-product-runtime-extension"
webhooks:

  # Name should be fully qualified and unique in the Cluster,
  # thus usage of sub domains, version qualifiers is recommended.
- name: "my-amazing-runtime-extension.v1.5.panda.com" 
 
  # List of group/version/hook the RuntimeExtension implements. Required
  operations:
  - apiVersion: "cluster.runtime.cluster.x-k8s.io/v1alpha2" 
    hook: "beforeUpgrade" 
  - apiVersion: "cluster.runtime.cluster.x-k8s.io/v1alpha2" 
    hook: "afterUpgrade" 
 
  #ClientConfig defines how to communicate with the RuntimeExtension. Required
  clientConfig: 
    #`url` gives the location of the RuntimeExtension, in standard URL
    # form (`scheme://host:port/path`). Exactly one of `url` or `service` must be specified.
    url: "..." 
    service:
      namespace: "example-namespace"
      name: "example-service"
      # `path` is an optional path prefix path which can be sent in any
      # request to this service.
      path: "runtime-extensions/"
      # If specified, the port on the service that hosts the RuntimeExtension.
      # Default to 443. `port` should be a valid port number (1-65535, inclusive)
      # or a port name of the referenced service.
      port: 8082
    caBundle: "..."
  
  # If specified, define the timeout for each RuntimeExtension call t
  # complete (Default 5s, Max 10s).
  timeoutSeconds: 2
  
  # FailurePolicy defines how errors from RuntimeExtension calls are
  # handled - allowed values are Ignore or Fail. Defaults to Fail.
  failurePolicy: Fail 

deck

By fmuyassarov

deck

  • 172