Osh + DRA .....

Osh without DRA

  • You get what’s cooked. Give me one portion
  • You can’t ask for
    • extra meat
    • less oil
    • to'y/choyxona osh

There’s one type of osh for everyone

Osh without DRA

Osh with DRA

There’s one type of osh for everyone

There’s one type of osh for everyone. You only request how much you want.
 

You request exactly what you want.
You can customize based on your taste.
Give me one portion with:

You get what’s cooked. Give me one portion

  • extra meat

  • less oil

  • eggs

Before

After

You ask for: nvidia.com/gpu: 1

You could not say:

  • I need a GPU
  • with these capabilities
  • prepared in this way

 

Describe what you actually need...

You can ask for:

  • nvidia.com/gpu: 1
  • product ID: A100-SXM4-40GB 
  • memory: 40 GB
  • cores: 3456 FB64

 

resources:
  request:
    nvidia.com/gpu: 1
spec:
  requirements:
    - deviceClassName: gpu
      selectors:
        - name: model
          value: A100
        - name: memory
          value: "40Gi"

Driver writer

Cluster admin

Application Developer/DevOps

Roles

Driver writer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

Driver developer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

 

Decides:

  • what attributes of the hardware to expose to DRA
  • implements interfaces to configure the node resources on the fly

 

Driver developer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

 

Decides:

  • what attributes of the hardware to expose to DRA
  • implements interfaces to configure the node resources on the fly

 

Driver developer

Cluster admin

Application Developer/DevOps

Roles

.. is someone who understands how a piece of hardware works, basically knows writing the software that lets to control and allocate that hardware.

 

Decides:

  • what attributes of the hardware to expose to DRA
  • implements interfaces to configure the node resources on the fly

 

Driver developer

Cluster admin

App developer/DevOps

Roles

.. is who installs the DRA driver, sets up device classes, and configures nodes (e.g., attaching GPUs) so workloads can use the hardware.

 

Driver developer

Cluster admin

App developer/DevOps

Roles

.. is someone who knows the application needs and defines resource requirements (define ResourceClaims) for the their application

Let's play those roles.

Let's play those roles.

We’re not writing the driver today :)

  • Install the driver
  • Analyze the resources
  • Create deviceClasses

Cluster admin

helm install dra-driver-pizza....
  • Install the driver
  • Analyze the resources
  • Create deviceClasses

Cluster admin

kubectl get resourceSlices
NAME                                         NODE                 DRIVER              POOL                 AGE
kind-control-plane-dra.pizza-9q2ls           kind-control-plane   dra.pizza           kind-control-plane   20h
helm install dra-driver-pizza....
  • Install the driver
  • Analyze the resources
  • Create deviceClasses

Cluster admin

kubectl get resourceSlices
NAME                                         NODE                 DRIVER              POOL                 AGE
kind-control-plane-dra.pizza-9q2ls           kind-control-plane   dra.pizza           kind-control-plane   20h
apiVersion: pizza.kitchen/v1
kind: ResourceSlice
metadata:
  name: pizzahut-matinkyla
spec:
  pizzas:
  - name: margherita-pan-pizza
    attributes:
      kitchen.pizza.example/dough:
        string: pan
      kitchen.pizza.example/sauce:
        string: tomato
      kitchen.pizza.example/cheese:
        string: mozzarella
      kitchen.pizza.example/toppings:
        string: "basil"
      kitchen.pizza.example/extraCheeseAvailable:
        bool: true
      kitchen.pizza.example/availableSlices:
        string: "4,6,8"
  - name: pepperoni-pan-pizza
    attributes:
      kitchen.pizza.example/dough:
        string: pan
      kitchen.pizza.example/sauce:
        string: tomato
      kitchen.pizza.example/cheese:
        string: mozzarella
      kitchen.pizza.example/toppings:
        string: "pepperoni"
      kitchen.pizza.example/extraCheeseAvailable:
        bool: true
      kitchen.pizza.example/availableSlices:
        string: "6,8"
helm install dra-driver-pizza....
  • Install the driver
  • Analyze the resources
  • Creates deviceClasses (for device filtering). 
    • Create deviceClass that selects CPU with x86_64 architecture

Cluster admin


"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "amd"
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "intel"
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "amd"
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: intelCpu
spec:
  selectors:
  - cel:
	expression: |
		attributes["hardware.cpu/architecture"].string == "x86_64" &&
		attributes["hardware.cpu/vendor"].string == "intel"
  • Install the driver
  • Analyze the resources
  • Creates deviceClasses (for device filtering). When a user creates a ResourceClaim, they don’t pick devices directly.
    • create deviceClass that selects CPU resources based on architecture constraints, such as x86_64 or ARM64.

Cluster admin


"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: pizza-vegetarian
...
spec:
  selectors:
  - cel:
      expression: |
        attributes["kitchen.pizza.example/cheese"].string == "mozzarella" &&
        attributes["kitchen.pizza.example/toppings"].string.contains("mushroom") &&
        attributes["kitchen.pizza.example/extraCheeseAvailable"].bool == true
  • Install the driver
  • Analyze the resources
  • Creates deviceClasses (for device filtering). When a user creates a ResourceClaim, they don’t pick devices directly.
    • create deviceClass that selects CPU resources based on architecture constraints, such as x86_64 or ARM64.

Cluster admin


"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: pizza-vegetarian
...
spec:
  selectors:
  - cel:
      expression: |
        attributes["kitchen.pizza.example/cheese"].string == "mozzarella" &&
        attributes["kitchen.pizza.example/toppings"].string.contains("mushroom") &&
        attributes["kitchen.pizza.example/extraCheeseAvailable"].bool == true
  • Install the driver
  • Analyze the resources
  • Creates deviceClasses (for device filtering). When a user creates a ResourceClaim, they don’t pick devices directly.
    • create deviceClass that selects CPU resources based on architecture constraints, such as x86_64 or ARM64.

Cluster admin


"I want a vegetarian pizza with mushrooms, basil, and extra cheese"

deviceClass

attributes

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: gpu.amd.com
spec:
  selectors:
  - cel: 
      expression: "device.driver == 'gpu.amd.com'"
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: mig.nvidia.com
spec:
  selectors:
  - cel:
      expression: "device.driver == 'gpu.nvidia.com' && device.attributes['gpu.nvidia.com'].type == 'mig'"

DevOps/app dev

  • given an access to a cluster
  • has DeviceClasses that describe category of devices
  • creates ResourceClaims (resources that a workload needs) and attaches to the workload (Pod)
    • deviceClass CEL -  filter over available devices
    • resourceClaim CEL - workload specific attributes of hardware. From the acceptable devices, which exact ones do I want right now?
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: my-pizza-order
spec:
  devices:
    requests:
    - name: pizza
      deviceClassName: vegeterian-pizza
      selectors:
      - cel:
          expression: |-
            device.attributes["kitchen.pizza.example/cheese"].string == "mozzarella" &&
            device.attributes["kitchen.pizza.example/toppings"].string.contains("mushroom") &&
            device.attributes["kitchen.pizza.example/extraCheeseAvailable"].bool == true

DevOps/app dev

  • given an access to a cluster
  • has DeviceClasses that describe category of devices
  • creates ResourceClaims (resources that a workload needs) and attaches to the workload (Pod)
    • deviceClass CEL -  filter over available devices
    • resourceClaim CEL - workload specific attributes of hardware. From the acceptable devices, which exact ones do I want right now?
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: claim-cpu-capacity-20
spec:
  devices:
    requests:
    - name: numa0-cpus
      exactly:
        deviceClassName: dra.cpu
        capacity:
          requests:
            dra.cpu/cpu: "10"
        selectors:
        - cel:
            expression: device.attributes["dra.cpu"].numaNodeID == 0
    - name: numa1-cpus
      exactly:
        deviceClassName: dra.cpu
        capacity:
          requests:
            dra.cpu/cpu: "10"
        selectors:
        - cel:
            expression: device.attributes["dra.cpu"].numaNodeID ==1

Pod references ResourceClaim

ResourceClaim
deviceClass + CEL constraints

DeviceClass
cluster wide device filter

ResourceSlice
devices advertised by
driver

DRA driver

scheduler
matches the claim to the node

ResourceClaim is allocated
status is updated by scheduler

DRA driver on Node
NodePrepareResource called

Kubelet mounts devices into Pod sandbocx

Pod starts
device ready, CDI injected
 

DRA isn't only about GPU

DRA isn't only about GPU

 

CPU, memory, hugepages, NIC, RDMA net devices and etc.

Migration from Device plugins to DRA

 

DRA

By fmuyassarov

DRA

  • 34