Metric Centric
problem space
solution space
metrics
a measure
at a point in time
of a thing (with a name)
a measure is a value
value is a number
an integer or floating point value
a point in time is a timestamp
measures are aggregated to a time interval
three basic types of measures
counters
timings
guages
different types exist because we want different aggregate functions
counter
aggregate function is SUM
timing
aggregate function is AVERAGE (aka MEAN)
also SUM, LOWER, UPPER, XPERCENTILE (just for fun)
gauge
aggregate function is LAST
time interval
is a bucket
is a timeslice
is a step
time series
a range of time intervals
successive points in time, spaced at a uniform time interval
a sequence of buckets
a sequence of timeslices
a sequence of data points
http://en.wikipedia.org/wiki/Time_series
a time series has a resolution
sharpest resolution is your smallest time interval
dullest resolution is whatever you want to further aggregate to
time series data is stored in a time series database
http://en.wikipedia.org/wiki/Time_series_database
measures are named
a neat convention is namespacing
datacenterA.roleB.poolC.machineD.cpuusage
a hierarchical taxonomy
or tagging
cpuusage datacentre=A role=B pool=C machine=D
Metrics toolchain
emit
collect
filter, route
aggregate
persist
expose
Emit
collect
http://collectd.org/
filter, route
aggregate
persist
expose
classical toolspace
(nagios, zabbix, zenoss, opennms, cacti)
mixed all these concerns
modern toolspace
prefers single responsibility
modelling metrics
a metric is just a count of an action
an action is
a commmand
a query
an operation
a state change event
in well modeled code
you get metrics for free
spaghetti code -> no discernable metrics
factored class with method for command/query -> count method calls
in well modeled domains
you get meaningful metrics for free
domain model that emits domain events -> count events
in evented systems
a metric is just a count of an event occurring
in well modeled systems
you can push your application down layers
to get cheaper metrics for free
domain model that sits at REST API endpoint -> count different actions (GET), events (POST,PUT,DELETE); success (HTTP status code); processing time (HTTP response time)
resuse HTTP middleware
domain model that sits at 0mq endpoint, 1 socket per action
resuse TCP middleware
sources of metrics
levels
network
machine
Appliance
protocol
middleware
framework
syslog
application -> action -> syslog event -> collector -> parser -> extractor -> filter -> router -> persister
database
application -> application state in DB -> query DB for state at current point in time
application -> application state, state transitions in DB -> query DB for state at any point in time
metrics pipeline
application -> action -> metric emitted -> metric pipeline
event pipeline
application -> event -> metric
lessons learned
retrofitting metrics is crazy plumbing
prefer to build a well modelled app
otherwise you're wasting time with custom one off health monitors with custom one off plumbing
the lower the level, the cheaper and faster the metrics are
push your app to the lowest possible level
domain specific actions are the most meaningful metrics
focus on modelling domain commands, events, queries
technical metrics can be misleading
focus on non-technical metrics in the form of user stories that capture UX or SLA
use technical metrics only to supplement non-technical metrics
an acceptance test represents a system action
derive metrics from acceptance tests; metrics become first class citizens
correlate technical metrics to non-technical ones, in order to guide triage
metrics carry implicit requirements about data age, data distance, data size
structure your metrics as user stories to make these requirements (and probably implementation options) explicit
metrics are expensive: computational complexity, throughput, size
be aware of the costs; make these guide your decisions (cloud is not free)
you don't know you want real time until you do
gun for real time data; not the other way around (cause it will be guaranteed rework)
always keep the problem space in mind
align your metrics model to support future phases
metrics can be deterministic or non-deterministic
depending on when and how data is captured
real time vs non real time
pre-compute vs post-compute
run time eval vs offline eval
different components require different monitoring strategies
distinguish between component level, system of components (SOA service), system of systems (SOA services)
metrics are just a readmodel
metrics are just a subset of general BI
fixed facts and dimensions; not free form
resolution is important
all metrics have a couple optimal resolutions
all metrics have a lot of non-optimal resolutions
metric discovery is analysis and orientation
develop a pipeline based semi automatic toolchain
alerts
triggers
later
Metric Centric
By Martin Suchanek
Metric Centric
- 995