Building an Observability Platform

With Crystal

Kirk Haines

wyhaines@gmail.com

@wyhaines

"Building" is a continuing action

 

Minion is far from finished

(but you can help; it's fully opensource)
 

But now, I have a story to tell...

Prologue: What Is Ovservability?

 

New Relic (What Is Observability? - New Relic Blog) says:

Chapter 1

A New Job

Week 1; a Hackathon

Hack a POC product and pitch it.

The company might back it with resources to build it.

A New and Former Coworker

J. Austin had a production vision...

Build the tool that he always wanted when doing professional services and devops work

.    .    .    .    .    .    .    .    .    .    .    .    .    .

A Weekend With React

and Material UI

I don't have any images of the original mockup.

 

(but it was pretty....)

The Basic Concept

 

 

  • Collect Logs
  • Collect System Telemetry
  • Pair Aggregation and Search of Collected Data with...
  • Remote Command Execution that is Logged/Auditable/Limitable
  • Easily Defined Notifications
  • Keep It Light
  • Make it Flexible

The Demo Was Well Received!

Next.... Go forth and build something real!

Practical use case to keep in mind -- COVID-19 testing appliances that need monitoring of custom hardware, that require data gathering to be durable to intermittent internet connection failures, and that can facilitate remote management without providing full remote shell access.

Chapter 2

A Design, A Plan, And Crystal

The Moving Parts

An Agent

 

  • Collect logs/events/telemetry and stream it back to the service.
  • Durable to network failures -- attempt to minimize data loss.
  • Responsible for Command Execution.
  • Bidirectional Communication allows for management of agent configuration from a central UI.
  • Small footprint.
  • Easily Installed.

The Moving Parts

The Streamserver

 

  • Receives data from many Agents.
  • Flexible handling of data destinations -- arbitrary databases, ElasticSearch, flat files, proxied elsewhere.
  • Resource efficient.

The Moving Parts

The API Server

  • HTTP Interface
  • Intended as a backend for both web and CLI tooling

The Moving Parts

Web UI

  • Easy to Use
  • Not Too Ugly, please

The Moving Parts

As coincidence would have it...

 

2007, I wrote Analogger, a fast, asynchronous logging service, in Ruby

 

Stable, with capacity for a phenomenal concurrent connection and messages/second rate. It's core capabilities had a lot of overlap with core capabilities needed by Minion.

Maybe it could be repurposed?

The Moving Parts

Ruby, for the Agent, has some drawbacks:

 

  • The Deployment Story isn't great
  • The RAM usage is difficult to keep really low

The Moving Parts

Ruby, for the Streamserver, was probably fine:

 

  • Analogger servers have been used in production for 13 years, handling billions of messages, so the Minion Streamserver, expanded from the same codebase, should be fine.

Along Comes Crystal (and Go?)

I first learned about Crystal at RubyKaigi 2015

 

I started using it in April. I loved it. TL;DR:

 

  • Lovely Ruby ergonomics for the programmer
  • Blazing fast execution speed
  • Strong typing + type inference helped me find bugs; I liked it
  • Strong deployment story because of compile nature of the language

Along Comes Crystal (and Go?)

I had learned Crystal by translating old Ruby Code to it.

 

I decided to translate Analogger, as that would give a huge head start to the project.

 

Austin was going to write the Agent, and he wanted to use Go, since he was learning it.

Along Comes Crystal (and Go?)

Fast Forward....

 

StreamServer worked, written in Crystal.

Agent lagged behind; writing it from scratch in Go just bogged down.

I had written an Agent skeleton in Crystal to use to test the StreamServer anyway.

So let's just use that!

Chapter 3: The Agent

https://github.com/joshsoftware/minion-agent

  • Small static size -- 3.5M unstripped binary size
  • Small dynamic size -- Running executable is generally < 10M


     
  • Deployment just requires binary compiled for the architecture -- pretty typical
  • Durable to network failure!
    If the streamserver disappears, it can be setup to cache communications locally, spooling them back to the streamserver when it can be contacted again, minimizing data loss. This feature was inherited from Analogger.

Chapter 4: The StreamServer

https://github.com/joshsoftware/minion-streamserver

 

  • Small static size -- 7M unstripped binary size
  • Very fast -- largely limited by IO speeds

That Third Piece -- the API Server

(Chapter 5)

https://github.com/joshsoftware/minion-api

No plan of operations reaches with any certainty beyond the first encounter with the enemy's main force. -- Helmuth von Multke

  • A Rails developer built the initial API server as a very well formed, very standard Rails API server
  • When I picked up what he had done, his work was great, but I was frustrated
  • Even when doing almost nothing, a single instance of the API server was > 200M
  • It just felt heavy for something that was just a simple API server

That Third Piece -- the API Server

Everything Else was in Crystal, so Why Not?

 

But...how to implement it?

That Third Piece -- the API Server

That Third Piece -- the API Server

@blacksmoke16 was talking about Athena on the Crystal Gitter

 

I checked it out

 

Things I liked:

 

That Third Piece -- the API Server

module MinionAPI
  class AuthController < ART::Controller
    @[ART::Get("/api/v1/auth/")]
    def index : String
      "TODO: Return appropriate top level response."
    end

    @[ART::QueryParam("email")]
    @[ART::QueryParam("password")]
    @[ART::Get("/api/v1/auth/signin")]
    def signin(email : String = "", password : String = "") : ART::Response
      signin_impl(email, password)
    end

    @[ART::Post("/api/v1/auth/signin")]
    def signin(request : HTTP::Request) : ART::Response
      raise ART::Exceptions::BadRequest.new "Missing request body." unless body = request.body

      data = JSON.parse(body.gets_to_end)

      handle_invalid_auth_credentials unless email = data["email"]?
      handle_invalid_auth_credentials unless password = data["password"]?

      signin_impl(email.not_nil!.as_s, password.not_nil!.as_s)
    end
    
    # REDACTED
  end
end

(Athena Controller Snippet)

That Third Piece -- the API Server

  • I reimplemented the API server in Crystal
  • Demand Driven Development - API features got built as UI features were implemented that needed them
  • No ORM - it just isn't needed here
  • I kept the Rails Migrations!
WITH RECURSIVE t AS ( (
    SELECT data_key
    FROM telemetries
    WHERE server_id IN($1, $2, $3)
    ORDER BY data_key
    LIMIT 1
  ) UNION ALL
  SELECT (
      SELECT data_key
      FROM telemetries
      WHERE
        data_key > t.data_key AND
        server_id IN($1, $2, $3)
      ORDER BY data_key
      LIMIT 1
  )
  FROM t                      
  WHERE t.data_key IS NOT NULL
)                                              
SELECT data_key
FROM t
WHERE data_key IS NOT NULL;

Chapter 6: Demo Time!

Chapter 7

A Few Things That I Learned

 

  • Rubyists can come up to speed on Crystal quickly!
  • Crystal is NOT Ruby!
  • I have not had any _Crystal_ problems; there is no reason to fear language immaturity at this point.
  • Many external libraries are immature, or abandoned -- due diligence is required there!

Chapter 8: Useful(?) Spin Off

SplayTreeMap.cr

 

https://github.com/wyhaines/splay_tree_map.cr

 

A Splay Tree is a binary search tree that balances in a way such that the most accessed nodes tend to be closer to the root. It can be useful in caches because this optimizes for the most accessed items being the fastest to access.

My version implements leaf pruning, which tends to remove the less commonly accessed data from the tree.

Chapter 9: What is Next?

 

  1. Finish Command Handling
  2. Finish Remote Agent Management from within the UI
  3. Notifications!
  4. There are a LOT of small fixes and enhancements in the github issue tracker for each repository; there's a lot of little stuff to fix.
  5. Client libraries that can send metrics and events and logs of interest directly to an agent.
  6. So much more....

 

 

If you are interested, there are a lot of places where I could use help.

Chapter 10: Questions?

Thank You for Listening!