Osh + DRA .....

Osh without DRA

You get what’s cooked. Give me one portion
You can’t ask for
- extra meat
- less oil
- to'y/choyxona osh

There’s one type of osh for everyone

Osh without DRA

Osh with DRA

There’s one type of osh for everyone

There’s one type of osh for everyone. You only request how much you want.

You request exactly what you want.
You can customize based on your taste.
Give me one portion with:

You get what’s cooked. Give me one portion

extra meat
less oil
eggs

Before

After

You ask for: nvidia.com/gpu: 1

You could not say:

I need a GPU
with these capabilities
prepared in this way

Describe what you actually need...

You can ask for:

nvidia.com/gpu: 1
product ID: A100-SXM4-40GB
memory: 40 GB
cores: 3456 FB64

resources:
  request:
    nvidia.com/gpu: 1

spec:
  requirements:
    - deviceClassName: gpu
      selectors:
        - name: model
          value: A100
        - name: memory
          value: "40Gi"

Driver writer

Cluster admin

Application Developer/DevOps

Roles

Driver writer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

Driver developer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

Decides:

what attributes of the hardware to expose to DRA
implements interfaces to configure the node resources on the fly

Driver developer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

Decides:

what attributes of the hardware to expose to DRA
implements interfaces to configure the node resources on the fly

Driver developer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

Decides:

what attributes of the hardware to expose to DRA
implements interfaces to configure the node resources on the fly

Driver developer

Cluster admin

App developer/DevOps

Roles

.. is who installs the DRA driver, sets up device classes, and configures nodes (e.g., attaching GPUs) so workloads can use the hardware.

Driver developer

Cluster admin

App developer/DevOps

Roles

.. is someone who knows the application needs and defines resource requirements (define ResourceClaims) for the their application

Let's play those roles.

We’re not writing the driver today :)

Install the driver
Analyze the resources
Create deviceClasses

Cluster admin

helm install dra-driver-pizza....

Install the driver
Analyze the resources
Create deviceClasses

Cluster admin

kubectl get resourceSlices
NAME                                         NODE                 DRIVER              POOL                 AGE
kind-control-plane-dra.pizza-9q2ls           kind-control-plane   dra.pizza           kind-control-plane   20h

helm install dra-driver-pizza....

Install the driver
Analyze the resources
Create deviceClasses

Cluster admin

kubectl get resourceSlices
NAME                                         NODE                 DRIVER              POOL                 AGE
kind-control-plane-dra.pizza-9q2ls           kind-control-plane   dra.pizza           kind-control-plane   20h

apiVersion: pizza.kitchen/v1
kind: ResourceSlice
metadata:
  name: pizzahut-matinkyla
spec:
  pizzas:
  - name: margherita-pan-pizza
    attributes:
      kitchen.pizza.example/dough:
        string: pan
      kitchen.pizza.example/sauce:
        string: tomato
      kitchen.pizza.example/cheese:
        string: mozzarella
      kitchen.pizza.example/toppings:
        string: "basil"
      kitchen.pizza.example/extraCheeseAvailable:
        bool: true
      kitchen.pizza.example/availableSlices:
        string: "4,6,8"
  - name: pepperoni-pan-pizza
    attributes:
      kitchen.pizza.example/dough:
        string: pan
      kitchen.pizza.example/sauce:
        string: tomato
      kitchen.pizza.example/cheese:
        string: mozzarella
      kitchen.pizza.example/toppings:
        string: "pepperoni"
      kitchen.pizza.example/extraCheeseAvailable:
        bool: true
      kitchen.pizza.example/availableSlices:
        string: "6,8"

helm install dra-driver-pizza....

Install the driver
Analyze the resources
Creates deviceClasses (for device filtering).
- Create deviceClass that selects CPU with x86_64 architecture

Cluster admin

"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "amd"

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "intel"

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "amd"

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "intel"

Install the driver
Analyze the resources
Creates deviceClasses (for device filtering). When a user creates a ResourceClaim, they don’t pick devices directly.
- create deviceClass that selects CPU resources based on architecture constraints, such as x86_64 or ARM64.

Cluster admin

"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: pizza-vegetarian
...
spec:
  selectors:
  - cel:
      expression: |
        attributes["kitchen.pizza.example/cheese"].string == "mozzarella" &&
        attributes["kitchen.pizza.example/toppings"].string.contains("mushroom") &&
        attributes["kitchen.pizza.example/extraCheeseAvailable"].bool == true

Install the driver
Analyze the resources
Creates deviceClasses (for device filtering). When a user creates a ResourceClaim, they don’t pick devices directly.
- create deviceClass that selects CPU resources based on architecture constraints, such as x86_64 or ARM64.

Cluster admin

"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: pizza-vegetarian
...
spec:
  selectors:
  - cel:
      expression: |
        attributes["kitchen.pizza.example/cheese"].string == "mozzarella" &&
        attributes["kitchen.pizza.example/toppings"].string.contains("mushroom") &&
        attributes["kitchen.pizza.example/extraCheeseAvailable"].bool == true

Install the driver
Analyze the resources
Creates deviceClasses (for device filtering). When a user creates a ResourceClaim, they don’t pick devices directly.
- create deviceClass that selects CPU resources based on architecture constraints, such as x86_64 or ARM64.

Cluster admin

"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: gpu.amd.com
spec:
  selectors:
  - cel: 
      expression: "device.driver == 'gpu.amd.com'"

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: mig.nvidia.com
spec:
  selectors:
  - cel:
      expression: "device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].type == 'mig'"

DevOps/app dev

given an access to a cluster
has DeviceClasses that describe category of devices
creates ResourceClaims (resources that a workload needs) and attaches to the workload (Pod)
- deviceClass CEL - filter over available devices
- resourceClaim CEL - workload specific attributes of hardware. From the acceptable devices, which exact ones do I want right now?

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: my-pizza-order
spec:
  devices:
    requests:
    - name: pizza
      deviceClassName: vegeterian-pizza
      selectors:
      - cel:
          expression: |-
            device.attributes["kitchen.pizza.example/cheese"].string == "mozzarella" &&
            device.attributes["kitchen.pizza.example/toppings"].string.contains("mushroom") &&
            device.attributes["kitchen.pizza.example/extraCheeseAvailable"].bool == true

DevOps/app dev

given an access to a cluster
has DeviceClasses that describe category of devices
creates ResourceClaims (resources that a workload needs) and attaches to the workload (Pod)
- deviceClass CEL - filter over available devices
- resourceClaim CEL - workload specific attributes of hardware. From the acceptable devices, which exact ones do I want right now?

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: claim-cpu-capacity-20
spec:
  devices:
    requests:
    - name: numa0-cpus
      exactly:
        deviceClassName: dra.cpu
        capacity:
          requests:
            dra.cpu/cpu: "10"
        selectors:
        - cel:
            expression: device.attributes["dra.cpu"].numaNodeID == 0
    - name: numa1-cpus
      exactly:
        deviceClassName: dra.cpu
        capacity:
          requests:
            dra.cpu/cpu: "10"
        selectors:
        - cel:
            expression: device.attributes["dra.cpu"].numaNodeID ==1

Pod references ResourceClaim

ResourceClaim
deviceClass + CEL constraints

DeviceClass
cluster wide device filter

ResourceSlice
devices advertised by
driver

DRA driver

scheduler
matches the claim to the node

ResourceClaim is allocated
status is updated by scheduler

DRA driver on Node
NodePrepareResource called

Kubelet mounts devices into Pod sandbocx

Pod starts
device ready, CDI injected

DRA isn't only about GPU

CPU, memory, hugepages, NIC, RDMA net devices and etc.

Migration from Device plugins to DRA

DRA

By fmuyassarov

Osh + DRA .....

Osh without DRA

Osh without DRA

Osh with DRA

Before

After

Roles

Roles

Roles

Roles

Roles

Roles

Roles

Let's play those roles.

Let's play those roles.

We’re not writing the driver today :)

Cluster admin

Cluster admin

Cluster admin

Cluster admin

Cluster admin

Cluster admin

Cluster admin

DevOps/app dev

DevOps/app dev

DRA isn't only about GPU

DRA isn't only about GPU

Migration from Device plugins to DRA

DRA

More from fmuyassarov