Building an Observability Platform
With Crystal
Kirk Haines
wyhaines@gmail.com
@wyhaines
"Building" is a continuing action
Minion is far from finished
(but you can help; it's fully opensource)
But now, I have a story to tell...
Chapter 1
A New Job
Week 1; a Hackathon
Hack a POC product and pitch it.
The company might back it with resources to build it.
A New and Former Coworker
J. Austin had a production vision...
Build the tool that he always wanted when doing professional services and devops work
. . . . . . . . . . . . . .
A Weekend With React
and Material UI
I don't have any images of the original mockup.
(but it was pretty....)
The Basic Concept
- Collect Logs
- Collect System Telemetry
- Pair Aggregation and Search of Collected Data with...
- Remote Command Execution that is Logged/Auditable/Limitable
- Easily Defined Notifications
- Keep It Light
- Make it Flexible
The Demo Was Well Received!
Next.... Go forth and build something real!
Practical use case to keep in mind -- COVID-19 testing appliances that need monitoring of custom hardware, that require data gathering to be durable to intermittent internet connection failures, and that can facilitate remote management without providing full remote shell access.
Chapter 2
A Design, A Plan, And Crystal
The Moving Parts
An Agent
- Collect logs/events/telemetry and stream it back to the service.
- Durable to network failures -- attempt to minimize data loss.
- Responsible for Command Execution.
- Bidirectional Communication allows for management of agent configuration from a central UI.
- Small footprint.
- Easily Installed.
The Moving Parts
The Streamserver
- Receives data from many Agents.
- Flexible handling of data destinations -- arbitrary databases, ElasticSearch, flat files, proxied elsewhere.
- Resource efficient.
The Moving Parts
The API Server
- HTTP Interface
- Intended as a backend for both web and CLI tooling
The Moving Parts
Web UI
- Easy to Use
- Not Too Ugly, please
The Moving Parts
As coincidence would have it...
2007, I wrote Analogger, a fast, asynchronous logging service, in Ruby
Stable, with capacity for a phenomenal concurrent connection and messages/second rate. It's core capabilities had a lot of overlap with core capabilities needed by Minion.
Maybe it could be repurposed?
The Moving Parts
Ruby, for the Agent, has some drawbacks:
- The Deployment Story isn't great
- The RAM usage is difficult to keep really low
The Moving Parts
Ruby, for the Streamserver, was probably fine:
- Analogger servers have been used in production for 13 years, handling billions of messages, so the Minion Streamserver, expanded from the same codebase, should be fine.
Along Comes Crystal (and Go?)
I first learned about Crystal at RubyKaigi 2015
I started using it in April. I loved it. TL;DR:
- Lovely Ruby ergonomics for the programmer
- Blazing fast execution speed
- Strong typing + type inference helped me find bugs; I liked it
- Strong deployment story because of compile nature of the language
Along Comes Crystal (and Go?)
I had learned Crystal by translating old Ruby Code to it.
I decided to translate Analogger, as that would give a huge head start to the project.
Austin was going to write the Agent, and he wanted to use Go, since he was learning it.
Along Comes Crystal (and Go?)
Fast Forward....
StreamServer worked, written in Crystal.
Agent lagged behind; writing it from scratch in Go just bogged down.
I had written an Agent skeleton in Crystal to use to test the StreamServer anyway.
So let's just use that!
Chapter 3: The Agent
https://github.com/joshsoftware/minion-agent
- Small static size -- 3.5M unstripped binary size
- Small dynamic size -- Running executable is generally < 10M
- Deployment just requires binary compiled for the architecture -- pretty typical
- Durable to network failure!
If the streamserver disappears, it can be setup to cache communications locally, spooling them back to the streamserver when it can be contacted again, minimizing data loss. This feature was inherited from Analogger.
Chapter 4: The StreamServer
https://github.com/joshsoftware/minion-streamserver
- Small static size -- 7M unstripped binary size
- Very fast -- largely limited by IO speeds
That Third Piece -- the API Server
(Chapter 5)
https://github.com/joshsoftware/minion-api
No plan of operations reaches with any certainty beyond the first encounter with the enemy's main force. -- Helmuth von Multke
- A Rails developer built the initial API server as a very well formed, very standard Rails API server
- When I picked up what he had done, his work was great, but I was frustrated
- Even when doing almost nothing, a single instance of the API server was > 200M
- It just felt heavy for something that was just a simple API server
That Third Piece -- the API Server
Everything Else was in Crystal, so Why Not?
But...how to implement it?
That Third Piece -- the API Server
That Third Piece -- the API Server
@blacksmoke16 was talking about Athena on the Crystal Gitter
I checked it out
Things I liked:
- Great Docs! -- https://athenaframework.org/
- Easy to Understand; Simple to Use
- More than Fast Enough
- Seemed perfect for building APIs
That Third Piece -- the API Server
module MinionAPI
class AuthController < ART::Controller
@[ART::Get("/api/v1/auth/")]
def index : String
"TODO: Return appropriate top level response."
end
@[ART::QueryParam("email")]
@[ART::QueryParam("password")]
@[ART::Get("/api/v1/auth/signin")]
def signin(email : String = "", password : String = "") : ART::Response
signin_impl(email, password)
end
@[ART::Post("/api/v1/auth/signin")]
def signin(request : HTTP::Request) : ART::Response
raise ART::Exceptions::BadRequest.new "Missing request body." unless body = request.body
data = JSON.parse(body.gets_to_end)
handle_invalid_auth_credentials unless email = data["email"]?
handle_invalid_auth_credentials unless password = data["password"]?
signin_impl(email.not_nil!.as_s, password.not_nil!.as_s)
end
# REDACTED
end
end
(Athena Controller Snippet)
That Third Piece -- the API Server
- I reimplemented the API server in Crystal
- Demand Driven Development - API features got built as UI features were implemented that needed them
- No ORM - it just isn't needed here
- I kept the Rails Migrations!
WITH RECURSIVE t AS ( (
SELECT data_key
FROM telemetries
WHERE server_id IN($1, $2, $3)
ORDER BY data_key
LIMIT 1
) UNION ALL
SELECT (
SELECT data_key
FROM telemetries
WHERE
data_key > t.data_key AND
server_id IN($1, $2, $3)
ORDER BY data_key
LIMIT 1
)
FROM t
WHERE t.data_key IS NOT NULL
)
SELECT data_key
FROM t
WHERE data_key IS NOT NULL;
Chapter 6: Demo Time!
Chapter 7
A Few Things That I Learned
- Rubyists can come up to speed on Crystal quickly!
- Crystal is NOT Ruby!
- I have not had any _Crystal_ problems; there is no reason to fear language immaturity at this point.
- Many external libraries are immature, or abandoned -- due diligence is required there!
Chapter 8: Useful(?) Spin Off
SplayTreeMap.cr
https://github.com/wyhaines/splay_tree_map.cr
A Splay Tree is a binary search tree that balances in a way such that the most accessed nodes tend to be closer to the root. It can be useful in caches because this optimizes for the most accessed items being the fastest to access.
My version implements leaf pruning, which tends to remove the less commonly accessed data from the tree.
Chapter 9: What is Next?
- Finish Command Handling
- Finish Remote Agent Management from within the UI
- Notifications!
- There are a LOT of small fixes and enhancements in the github issue tracker for each repository; there's a lot of little stuff to fix.
- Client libraries that can send metrics and events and logs of interest directly to an agent.
- So much more....
If you are interested, there are a lot of places where I could use help.
Chapter 10: Questions?
Thank You for Listening!
Building an Observability Platform With Crystal
By wyhaines
Building an Observability Platform With Crystal
This year I built an observability platform with Crystal, having never done a project with it larger than building some dice rolling simulation tools. Not including the React front end, there is about 7500 lines of crystal between code and specs, and growing, that implement everything from the API server for the React front end to the endpoints for receiving data from the remote agents, all in Crystal. It's fast. It's stable. It's capable. It's easy to work with. It doesn't use a lot of RAM. And it has been really enjoyable to build all of this in Crystal while simultaneously learning the language.
- 675