@vipulgupta2048

I am Vipul!

@vipulgupta2048

We-pull

  • Product Owner & Documentation Lead at balena
  • Run a docs initiative called Mixster
  • Volunteer: PyCon India & ALiAS
  • Works remotely from Noida, India (burning up atm)
  • Pronouns: He/him/his

@vipulgupta2048

Meet our subject, balenaOS

  • Open-source Yocto-based embedded OS
  • Meant to run containers on IoT devices
  • Fault-tolerant, field-tested, completely free
  • Supports 100+ Single Board Computers (SBCs)

Haven't heard of it?

@vipulgupta2048

BalenaOS in action

BalenaOS runs containers on CO2 sensors, farms, smart dustbins, trucks...

First to run Docker containers in space.

@vipulgupta2048

BalenaOS making waves

... underwater drones, museums, photo booths, parkings, forklifts, and your talking wireless speakers

@vipulgupta2048

@vipulgupta2048

The challenges

With smart cities & even your toaster getting firmware updates. The stakes have never been higher

While, testing software is hard.

@vipulgupta2048

Testing software
on hardware is painful.

@vipulgupta2048

Challenges with embedded

@vipulgupta2048

@vipulgupta2048

@vipulgupta2048

Challenges building
embedded operating systems

  • Absolute fault tolerance 
  • Deploying releases confidently
  • Maintaining backward compatibility
  • Adding support for new devices 
  • Patching issues in existing devices

Our solution, Autokit

Worker

(controller)

Raspberry Pi 4
(Device Under Test)

With Autokit, we enabled a Hardware in loop testing pipeline for balenaOS

@vipulgupta2048

Autokit Features

  • Flashing storage mediums (like, SD cards, eMMC etc)
  • Controlling the power
  • Detecting if the DUT is on or off
  • Boot mode selection & customized boot process
  • Simulate USB connections
  • Support display capture & serial communication
  • Works & can test with modems, HDDs 
  • Provide wifi or ethernet connection to the DUT
  • Ability to interact with other interfaces on the DUT

@vipulgupta2048

Leaving the best for last!

  • STL files available for 3D printing cases
  • No custom hardware, all of-the-shelf components
  • Docs to build your own autokit
  • Free and open-source!

@vipulgupta2048

@vipulgupta2048

Hardware in the loop testing pipeline in a nutshell

GitHub
Actions

@vipulgupta2048

The journey from PR to Production begins

@vipulgupta2048

I love big YAML, I cannot lie.

@vipulgupta2048

1. The Pull Request

A contributor opens a pull request on the balenaOS repository.

@vipulgupta2048

2. Build them all!

PR triggers multiple GitHub Actions jobs that build BalenaOS releases for each device we configure a caller workflow. 

@vipulgupta2048

@vipulgupta2048

On PR open: Build + Test jobs

On Closed: Build + Deploy (Releases already tested)

On Dispatch: Whatever you desire.

@vipulgupta2048

Intel-NUC calls the big composite workflow, controls triggers, inputs, external contributor access & secrets 

3. Freshly baked balenaOS

Artifacts built by the build job are uploaded to GH Artifact Storage to be downloaded in test jobs managed by our testing framework, Leviathan.

@vipulgupta2048

@vipulgupta2048

4. Leviathan takes over!

Leviathan is the software layer that interacts with balenaOS images, does configuration and talks to our Intel-NUC device under test connected to hardware jig, autokit.

@vipulgupta2048

The software glue, Leviathan

  • Finds an available autokit
  • Configures balenaOS images
  • Run the tests by interfacing with the autokit
  • The same test suite supports over 50+ device types 
  • Tests written in node-tap, results in JSON
  • Free and open-source.

@vipulgupta2048

Turning on the autopilot.

Leviathan abstracts away complexities by enabling developers to write one test suite to target all devices.

No matter their power requirements, boot process, or flash procedure. Leviathan takes care of everything.

// Leviathan test to flash a device connected to the autokit (Sample)

await this.worker.off(); // Ensure test device is off before flashing

await this.worker.flash(this.os.image.path); // Flash the device with balenaOS image

await this.worker.on(); // Turn on the device

test.true(
  true,
  `${this.os.image.path} should be flashed properly`,
);

@vipulgupta2048

5. Hardware in loop testing (HiLT)

Workers automatically provision the device under test with the OS, provides power, network & executes our commands

@vipulgupta2048

@vipulgupta2048

[
  {
    "suite": "Managed BalenaOS release suite",
    "stats": {
      "tests": 12,
      "ran": 10,
      "skipped": 2,
      "passed": 10,
      "failed": 0
    },
    "tests": {
      "Image preload test": "passed",
      "Move device to hostapd test App": "skipped",
      "Move device back to original app": "skipped",
      "Provisioning without deltas": "passed",
      "Override lock test": "passed",
      "Update supervisor randomized timer": "passed",
      "Set device environment variables": "passed",
      "Set service environment variables": "passed",
      "SSH authentication in production mode": "passed",
      "SSH authentication in development mode": "passed",
      "os-config service on boot": "passed",
      "os-config service randomized timer": "passed"
    },
    "dateTime": "Thu Apr 27 2023 11:43:52 GMT+0000 (Coordinated Universal Time)"
  },
  {
    "suite": "Hostapp update suite",
    "stats": {
      "tests": 5,
      "ran": 5,
      "skipped": 0,
      "passed": 5,
      "failed": 0
    },
    "tests": {
      "Broken balena-engine": "passed",
      "Broken VPN": "passed",
      "Rollback altboot (broken init) test": "passed",
      "HUP from previous release": "passed",
      "HUP from this release": "passed"
    },
    "dateTime": "Thu Apr 27 2023 11:58:25 GMT+0000 (Coordinated Universal Time)"
  },
  {
    "suite": "Unmanaged BalenaOS release suite",
    "stats": {
      "tests": 45,
      "ran": 36,
      "skipped": 9,
      "passed": 36,
      "failed": 0
    },
    "tests": {
      "check secure boot": "passed",
      "BeagleBone Black u-boot overlay test: deactivate HDMI": "skipped",
      "243390-rpi3 - CUS/EUS chipsets test": "skipped",
      "fingerprint file test": "passed",
      "ext4 filesystems are checked on boot": "passed",
      "OS-release file check": "passed",
      "Installer used migrator module": "passed",
      "issue file check": "passed",
      "issue.net file check": "passed",
      "Chronyd service": "passed",
      "Sync test": "passed",
      "Source test": "passed",
      "Offline sources test": "passed",
      "System time skew test": "passed",
      "kernel-overlap test": "passed",
      "Bluetooth scanning test": "skipped",
      "Container healthcheck test": "passed",
      "Container exposed variables test": "passed",
      "Identification test": "skipped",
      "Cellular tests": "passed",
      "hostname configuration test": "passed",
      "ntpServer test": "passed",
      "dnsServers test": "passed",
      "os.network.connectivity test": "passed",
      "os.network.wifi.randomMacAddressScan test": "passed",
      "udevRules test": "passed",
      "persistentLogging configuration test": "passed",
      "Reboot test": "skipped",
      "Wired test": "skipped",
      "Wireless test": "skipped",
      "Socks5 test": "passed",
      "Http-connect test": "passed",
      "Engine socket is exposed in development images": "passed",
      "Engine socket is not exposed in production images": "passed",
      "Engine watchdog recovery": "passed",
      "Engine healthcheck performance": "passed",
      "Under-voltage test": "passed",
      "Ramdisks, zram and loop devices are not scanned for rootfs": "passed",
      "by-state links are created": "passed",
      "DToverlay & DTparam tests": "skipped",
      "state partition reset": "passed",
      "data partition reset": "passed",
      "RevPi Core 3 DIO module test": "skipped",
      "zram is enabled and configured as swap": "passed",
      "Internet sharing iptables rules test": "passed"
    },
    "dateTime": "Thu Apr 27 2023 11:24:52 GMT+0000 (Coordinated Universal Time)"
  }
]

200+ assertions, 62 tests

Avg. runtime: 90-120 minutes

@vipulgupta2048

62+ tests ran

[
  {
    "suite": "Managed BalenaOS release suite",
    "stats": {
      "tests": 12,
      "ran": 10,
      "skipped": 2,
      "passed": 10,
      "failed": 0
    },
    "tests": {
      "Image preload test": "passed",
      "Move device to hostapd test App": "skipped",
      "Move device back to original app": "skipped",
      "Provisioning without deltas": "passed",
      "Override lock test": "passed",
      "Update supervisor randomized timer": "passed",
      "Set device environment variables": "passed",
      "Set service environment variables": "passed",
      "SSH authentication in production mode": "passed",
      "SSH authentication in development mode": "passed",
      "os-config service on boot": "passed",
      "os-config service randomized timer": "passed"
    },
    "dateTime": "Thu Apr 27 2023 11:43:52 GMT+0000 (Coordinated Universal Time)"
  },
  {
    "suite": "Hostapp update suite",
    "stats": {
      "tests": 5,
      "ran": 5,
      "skipped": 0,
      "passed": 5,
      "failed": 0
    },
    "tests": {
      "Broken balena-engine": "passed",
      "Broken VPN": "passed",
      "Rollback altboot (broken init) test": "passed",
      "HUP from previous release": "passed",
      "HUP from this release": "passed"
    },
    "dateTime": "Thu Apr 27 2023 11:58:25 GMT+0000 (Coordinated Universal Time)"
  },
  {
    "suite": "Unmanaged BalenaOS release suite",
    "stats": {
      "tests": 45,
      "ran": 36,
      "skipped": 9,
      "passed": 36,
      "failed": 0
    },
    "tests": {
      "check secure boot": "passed",
      "BeagleBone Black u-boot overlay test: deactivate HDMI": "skipped",
      "243390-rpi3 - CUS/EUS chipsets test": "skipped",
      "fingerprint file test": "passed",
      "ext4 filesystems are checked on boot": "passed",
      "OS-release file check": "passed",
      "Installer used migrator module": "passed",
      "issue file check": "passed",
      "issue.net file check": "passed",
      "Chronyd service": "passed",
      "Sync test": "passed",
      "Source test": "passed",
      "Offline sources test": "passed",
      "System time skew test": "passed",
      "kernel-overlap test": "passed",
      "Bluetooth scanning test": "skipped",
      "Container healthcheck test": "passed",
      "Container exposed variables test": "passed",
      "Identification test": "skipped",
      "Cellular tests": "passed",
      "hostname configuration test": "passed",
      "ntpServer test": "passed",
      "dnsServers test": "passed",
      "os.network.connectivity test": "passed",
      "os.network.wifi.randomMacAddressScan test": "passed",
      "udevRules test": "passed",
      "persistentLogging configuration test": "passed",
      "Reboot test": "skipped",
      "Wired test": "skipped",
      "Wireless test": "skipped",
      "Socks5 test": "passed",
      "Http-connect test": "passed",
      "Engine socket is exposed in development images": "passed",
      "Engine socket is not exposed in production images": "passed",
      "Engine watchdog recovery": "passed",
      "Engine healthcheck performance": "passed",
      "Under-voltage test": "passed",
      "Ramdisks, zram and loop devices are not scanned for rootfs": "passed",
      "by-state links are created": "passed",
      "DToverlay & DTparam tests": "skipped",
      "state partition reset": "passed",
      "data partition reset": "passed",
      "RevPi Core 3 DIO module test": "skipped",
      "zram is enabled and configured as swap": "passed",
      "Internet sharing iptables rules test": "passed"
    },
    "dateTime": "Thu Apr 27 2023 11:24:52 GMT+0000 (Coordinated Universal Time)"
  }
]

200+ assertions, 62 tests

Avg. runtime: 90-120 minutes

@vipulgupta2048

Breaking things intentionally to see if the OS can recover & update

@vipulgupta2048

6. Actions reports back results

@vipulgupta2048

The jobs are completed in their own time, and statuses are updated back on the pull request to be reviewed by the team to see what's going on.

6. Actions reports back results

The jobs are completed in their own time, and statuses are updated back on the pull request to be reviewed by the team to see what's going on.

@vipulgupta2048

<<<<

7. PR is merged 

After all required checks pass, & PR gets approved,
it gets merged automatically.

@vipulgupta2048

8. Merge === Deploy

The closed event trigger re-runs BalenaOS builds, and pushes them to S3 with the required artifacts. All using GitHub Actions. They show up here to be used by our users.

@vipulgupta2048

BalenaOS CI/CD at scale!

  • 95 device types supported by balenaOS.
  • 47 devices are supported on the autokit. 
  • For each PR, new commit, a new BalenaOS draft release is created (20 such incidents in a day) 
  • Each balenaOS release triggers 3 test suite runs.

GitHub Actions builds +1000 balenaOS releases daily

and runs +3000 test jobs.

@vipulgupta2048

Going all in on actions!

Reliability, reusablity, context retrieval & transparency of logs

*Also, why we are migrating from Jenkins

@vipulgupta2048

Going all in on actions!

Ease of getting started: Using either GitHub's hosted runners or... 

*Also, does help that this exists

@vipulgupta2048

Going all in on actions!

BYOR: Bring your own runners! Cost savings 💸💸💸

*Also, does help that this exists

@vipulgupta2048

Going all in on actions!

Setting up environments for testing, staging, and prod

*Also, does help that this exists

@vipulgupta2048

The Journey

  • Propose changes
  • Build draft balenaOS releases from PR changes.
  • Run draft release on an actual device.
  • Test the release end-to-end by automating the device.
  • Gain feedback from the loop to make improvements
  • Make changes to the pull request.
  • Tests provide instant feedback on changes & reviews
  • PR gets merged & the release gets deployed. 

This is hardware in loop testing with Jenkins

Gettin' Exponential Gains

With Hardware in Loop Testing & Jenkins for CI/CD, we have managed to:

  • Reduce our OS release cycle from weeks to hours 
  • Achieve exponential scale to support new devices
  • Save thousands of hours in troubleshooting
  • Follow TDD & new tests being added every day

Most importantly, we even added QEMU support for testing virtual devices. Thanks to the Chip Shortage 2022

But Vipul, how can I use this now at work?

Worker

(controller)

Remote devices must be directly controlled, automated, and maintained using scripts.

@vipulgupta2048

Worker

(controller)

Feedback on changes

Testing software directly on hardware
in a CI/CD pipeline

@vipulgupta2048

?

Worker

(controller)

?

?

?

?

?

?

?

?

?

?

?

Quality Assurance, stress testing, random testing, environmental testing

@vipulgupta2048

And, you can test Operating Systems

That's the talk.

@vipulgupta2048

What we learned

  • Testing operating systems is important
  • And, incredibly painful to scale
  • But doesn't have to be.
  • When you are actually testing on the hardware.
  • And, using GitHub Actions. 

@vipulgupta2048

Resources

@vipulgupta2048

And, that's about it!

Questions? Collaborate? Work with us? Reach out!

@vipulgupta2048

Reviews cheesecakes, closes issues & runs Mixster to "right" the docs for startups

Feedback please + Link to the slides

Made with Slides.com