Off of the
shaky Grounds

the path towards stabilizing Substrate

benjamin kampmann

gnunicorn.org // ben@parity.io

Jun 24th, 2020

@Parity & Friends Meetup, Etherlands

Overview

  • Challenges
  • Metering, Measuring & Profiling
  • Freezing, cargo unleash
  • Testing & continuous releasing

CHallenges

HARD PERFORMANCE

  • 🕳️ leaking memory
  • 🥵 CPU spikes
  • 👋 dropping peers
  • 😱 stalling consensus
  • 💸 bad weights

PROCESS

  • massive mono-repo
    with plenty of
    path = ../ and circle-dev-dependencies, how even?
  • polkadot and substrate being out of sync
  • merge-order breaking master-branch

  • unittest cover too little – all tests pass, yet the node is failing
  • big refactors lead to instability for months

FIXING IT

🔧

Leaks, Spikes & Stalls

measure, Measure, Measure

MOAR  TELEMETRY

Better logging

in multi-color and with emojis!

METER EVERYTHING!!!

Adding plenty of more Prometheus gauges

  • resource metrics CPU, memory, load, file handles, etc
  • networking Up-/downstream, etc
  • Peer-2-Peer connected nodes, etc
  • Sync + Import Queue blocks incoming, import times, etc
  • GRANDPA messages, etc
  • Database I/O, cache, state, etc
  • Tokio tasks, channels, sizes, etc
  • Internals caches, hash maps, etc.

> 500 metric Params

INVESTIGATE!

Leaks, Spikes & Stalls

MEMORY PROFILING !!!

How to release, even?

Upgrading & Automatizing the Processes

Freeze on master

  • No more big refactors
  • hard 2-reviewers-rule
  • Polkadot-Companion-PR enforcement
  • focus on stabilizing, bug fixes, docs
  • breaks only on release-critical fixes and pre-defined features
  • cleaning up

Too many moving Parts

Cleaning up

  • started already last year: splitting dependencies, cleaning up tree
  • reducing circular-dependencies
  • updated licenses now Apache2 & GPL3+Classspath-Exc
  • Changelog generation
  • Cleaning up cargo.toml manifests
  • benchmarking pallets for proper Weights
  • Figure out release-strategy
  • huge mono repo = huge dependency trees order of crates still matter
  • each crate must be checked before release with the to-be-released-dependencies
  • Checking these is already a 80min process
  • CI should release
  • Squash-merges mean your git tag is lost after merge
  • crates.io has a rate limit ...

Releasing massive Mono Repos

Cargo unleash Em 🐉

github.com/GNUnicorn/cargo-unleash

  • cargo subcommand to help manage and release massive (Rust) mono repos
  • checks all crates for crates.io requirements
  • packages, builds and publishes
  • helps manage versions, keeps the tree up to date
  • match packages on regexp

$ cargo unleash em-dragons

Alpha -> RC

  • We are now able to release by just tagging any commit
  • we've left the rather unstable alpha phase and are heading towards a final release
  • still waiting to fix remaining bugs that come up during Polkadot launch

The Future?

Testing and Continuous Releases

MONO-Release Repo

  • all crates share the same version we have to bump even if they didn't change
  • keeping versions clean and compatible is complex especially for the outside
  • we'd have to major-bump a lot and that might have to trickle down
  • doing that well is hard – easy to break unintentionally
  • but cargo-semver pre-releases do not work as expected: cargo updates on the same pre-names though semver says they should not!
  • Ensuring things are fine is hard already and we constantly break stuff

Instead: Continuous REleasing

1. Bump every PR

2. RELEASE Every Commit of master

as SEMVER requires – automatically

 – automatically*

3. Profit

But Wait – what about...

'em version numbers

stability !!!

'Em Version Numbers

  • cargo don't care: it will use the latest minor and patch anyways
  • we break things a lot in particular internal APIs
  • and our non-framework-approach makes that public but for the majority, this doesn't matter
  • You only care when your specific dependencies break
  • But what if you had only one dependency to care about?

in a world of github, what even is a 'release'?

substrate client meta crate

– think tokio 0.2 but substrate –

# Cargo.toml
[package]
name = "suprchain"
version = "2.0.0"
authors = ["Benjamin Kampmann <ben@parity.io>"]
edition = "2018"

[dependencies]
suprchain-runtime = "0.2"
substrate-client = "0.2"

// main.rs
use suprchain_runtime::Runtime;
use substrate_node;

fn main() {
    substrate_node::Runner<Runtime>()
       .main();
}

# Cargo.toml
[package]
name = "suprchain"
version = "2.0.0"
authors = ["Benjamin Kampmann <ben@parity.io>"]
edition = "2018"

[dependencies]
suprchain-runtime = "0.2"
substrate-client = { version = "0.2", features = ["unstable-async-offchain"] }

// main.rs
use suprchain_runtime::Runtime;
use substrate_node;

fn main() {
    substrate_node::Runner<Runtime>()
       .with_async_offchain(|cfg| {
          cfg.max_timeout = 360;
       })
       .main();
}

stability

  • we don't have any classic build-QA-release-cycle @parity
  • We released whenever we felt like it
    main reason it didn't happen more frequently is that's quite some work
  • But if it is automatic, it can be done more frequently doesn't really change much on stability

releases indicate stability!

Moar testing – Upcoming

QA happens before the PR is merged

  • Sticking with the 2-reviewers minimum
  • + CODE_OWNERS for special areas
  • Automatic deploying of some PRs to validator nodes and running the changes for a while
  • Benching of relevant PRs
  • Downstream testing of "nightly" build
  • New testing environment* see  right
  • Increasing test coverage
  • Integration testing for Runtime updates of live chains
#[test]
fn transfer_smoketest() {
  // reuse provided test against local runtime
  pallet_balances::tests::transfer_smoke(Runtime.into());
}

// testing a specific feature ourselves
#[test]
fn transfer_should_trigger_event() {
  // given
  let mut test = test::deterministic(Runtime.into());
  // when
  test.read_state(|| {
    <Runtime as CreateTransaction>::create_transaction(
      balance_call,
      signer,
      account,
      nonce,
      )
  });
  // controlled run
  test.produce_blocks(1_u32);
  // then
  test.with_state(|| {
    let events = frame_system::Module::<Runtime>::events();
    assert_eq!(events.len(), 1);

    let events = frame_system::Module::<Runtime>::events();
    assert_eq!(events.len(), 1);
  });
}

thanks!

questions?

benjamin kampmann

gnunicorn.org // ben@parity.io

Jun 24th, 2020

@Parity & Friends Meetup, Etherlands