Tendermint Monitor

Rob Hilgefort

What's Monitored?

  • Validators addresses appear in signatures list of each new block.
  • HTTP Health checks.
    • Calls to `/health` come back `200`.
  • Validator addresses appear in `/validators` endpoint list.
  • Timely recent block.
    • Most recent block is within past seconds.
    • Configurable warning threshold and error threshold.
  • Consistent streaming block
    • Average time between blocks within certain number of seconds.
    • Configurable warning threshold and error threshold.

How Monitored?

  • Scheduled interval jobs.
    • Instead of just relying on websocket events. Ensures analysis will be done.
  • Websocket event analysis on block information
    • In memory block history caching for job analysis.
  • Monitors multiple nodes in parallel.
    • Each node gets their own jobs scheduled.

Reporting

  • STDOUT / STDERR
  • Slack
  • Pager Duty

STDOUT / STDERR

  • Logs valid states/status and app events (HTTP/WS Connections).
  • Verifying app state when booting.
  • Useful for digging in after an alert
  • Also logs errors and warnings to be seen in context of good logs.

Slack

  • Sends messages in a Slack workspace
  • Multiple channels based on severity (mute warnings).

Pager Duty

  • "Error" alerts log to Slack, as well as Pager Duty.

Demo

Deployment

  • Local Tendermint Test Net
  • NodeJS on Google App Engine
  • Monitoring The Monitor

Local Test Net

  • Docker containers
  • Exposed via port forwarding to GAE

NodeJS Google App Engine

  • NodeJS deployed on Google App Engine
  • Manual scaling, 1 instance
  • Exposed via port forwarding to GAE instance

Monitoring The Monitor

  • ExpressJS health endpoint
  • Uptime robot
  • Slack reporting

Monitor Implementation

  • Language(s)
  • Build Tooling
  • Control Flow "Architecture"
  • Integrations
  • Libraries

Language(s)

  • NodeJS
    • Interpreter / Ecosystem
  • TypeScript
    • Fully and strictly statically typed
  • `fp-ts`
    • Library that facilitates functional programming in TypeScript

Build Tooling

  • TypeScript
    • Compiler
  • ESLint
    • Style standard
  • Prettier
    • Formatter
  • ESBuild
    • Build artifacts

Control Flow "Architecture"

  • `toad-scheduler`
    • Simple in-memory job scheduler
    • Used for interval jobs
    • While `node-scheduler` is the most popular, toad was sufficient for needs.
  • `ws`
    • Defacto websockets lib.
    • Used to ingest new blocks.

Integrations

  • `@slack/web-api`
    • `chat.sendMessage`
  • `@pagerduty/pdjs`
    • Only one with TS support
    • Hot garbage 🤮

Libraries

  • `dotenv`
    • Environment variables
  • `express`
    • HTTP health endpoint
  • `got`
    • HTTP abstraction
  • `luxon`
    • DateTime library
  • `ramda`
    • Utility functions
  • `ramda-adjunct`
    • Utility functions

Next Steps

  • If node goes down, you get LOTS of alerts
  • Add Winston logger and transport to GCP
  • Pager duty messages in Slack, ack inline.
  • Reconnect to node after WS disconnects. (Have to restart to get healthy)
  • Write tests.
  • Move away from GAE, not a big fan- would rather Docker and some other managed solution.
  • Push metrics to Prometheus, Grafana dashboarding / alerting.
  • Much better auto-recovery and resolution.
  • GitLab CD. (GAE secrets)

Code Exploration

Thanks 👋

Rob Hilgefort

Tendermint Monitor

By rjhilgefort