Metric Centric

Metric Oriented Architecture

problem space

collect
analyze
orient
inform
visualize
predict
act

solution space

metrics
triggers
alerts

metrics

a measure

at a point in time

of a thing (with a name)

a measure is a value
value is a number
an integer or floating point value


a point in time is a timestamp

measures are aggregated to a time interval

three basic types of measures


counters

timings

guages


different types exist because we want different aggregate functions


counter


aggregate function is SUM

timing


aggregate function is AVERAGE (aka MEAN)

also SUM, LOWER, UPPER, XPERCENTILE (just for fun)


gauge


aggregate function is LAST

time interval


is a bucket

is a timeslice

is a step

time series


a range of time intervals

successive points in time, spaced at a uniform time interval

a sequence of buckets

a sequence of timeslices

a sequence of data points


http://en.wikipedia.org/wiki/Time_series

a time series has a resolution


sharpest resolution is your smallest time interval


dullest resolution is whatever you want to further aggregate to

time series data is stored in a time series database


http://en.wikipedia.org/wiki/Time_series_database

measures are named


a neat convention is namespacing

		datacenterA.roleB.poolC.machineD.cpuusage

a hierarchical taxonomy


or tagging

		cpuusage datacentre=A role=B pool=C machine=D
 

Metrics toolchain

emit

collect

filter, route

aggregate

persist

expose

Emit


snmp
wmi
logs
(anything really)

collect


http://collectd.org/

filter, route


http://heka-docs.readthedocs.org/
http://www.logstash.net/

aggregate


https://github.com/etsy/statsd/

http://graphite.wikidot.com/ (carbon)

persist


http://opentsdb.net/ (based on HBase)
http://en.wikipedia.org/wiki/RRDtool
http://graphite.wikidot.com/ (whisper)

expose


http://graphite.wikidot.com/ (graphite-web)

classical toolspace 

(nagios, zabbix, zenoss, opennms, cacti) 

mixed all these concerns


modern toolspace 

prefers single responsibility

modelling metrics

a metric is just a count of an action

an action is


a commmand

a query

an operation

a state change event

in well modeled code

you get metrics for free


spaghetti code -> no discernable metrics


factored class with method for command/query -> count method calls


in well modeled domains

you get meaningful metrics for free


domain model that emits domain events -> count events


in evented systems

a metric is just a count of an event occurring

in well modeled systems

you can push your application down layers

to get cheaper metrics for free


domain model that sits at REST API endpoint -> count different actions (GET), events (POST,PUT,DELETE); success (HTTP status code); processing time (HTTP response time)

resuse HTTP middleware

domain model that sits at 0mq endpoint, 1 socket per action

resuse TCP middleware

sources of metrics


ways to emit metrics

https://wiki.verrus.net/display/~msuchanek/Monitoring+Strategy+-+Metrics

levels


metrics can come from different levels

each level has different combination of data age (how old is the data), distance (how quickly can the data be retrieved), size (how big is the data set)

some levels are technical (e.g.plumbing: the number of HTTP requests), 
others are non technical (e.g. number of user logins)

network


network monitoring tools
usually SNMP based

machine


machine/host monitoring tools
SNMP, WMI, or custom agents

Appliance


machine/host monitoring
black box application metrics

protocol


network or machine/host monitoring tools

middleware


(e.g. HTTP server)
syslog or SNMP/WMI

framework


(e.g. ASP.NET, .NET)
Windows Event Log

syslog


application -> action -> syslog event -> collector -> parser -> extractor -> filter -> router -> persister 

database


application -> application state in DB -> query DB for state at current point in time
 

application -> application state, state transitions in DB -> query DB for state at any point in time
 

metrics pipeline


usually fire and forget, UDP

application -> action -> metric emitted -> metric pipeline 

event pipeline


application -> event -> metric 

lessons learned

retrofitting metrics is crazy plumbing


prefer to build a well modelled app


otherwise you're wasting time with custom one off health monitors with custom one off plumbing


the lower the level, the cheaper and faster the metrics are


push your app to the lowest possible level

domain specific actions are the most meaningful metrics


focus on modelling domain commands, events, queries

technical metrics can be misleading


focus on non-technical metrics in the form of user stories that capture UX or SLA


use technical metrics only to supplement non-technical metrics

an acceptance test represents a system action


derive metrics from acceptance tests; metrics become first class citizens

correlate technical metrics to non-technical ones, in order to guide triage

metrics carry implicit requirements about data age, data distance, data size


structure your metrics as user stories to make these requirements (and probably implementation options) explicit

metrics are expensive: computational complexity, throughput, size


be aware of the costs; make these guide your decisions (cloud is not free)

you don't know you want real time until you do


gun for real time data; not the other way around (cause it will be guaranteed rework)

always keep the problem space in mind


align your metrics model to support future phases

metrics can be deterministic or non-deterministic


depending on when and how data is captured


real time vs non real time

pre-compute vs post-compute

run time eval vs offline eval

different components require different monitoring strategies


distinguish between component level, system of components (SOA service), system of systems (SOA services)

metrics are just a readmodel

metrics are just a subset of general BI 


fixed facts and dimensions; not free form

resolution is important


all metrics have a couple optimal resolutions


all metrics have a lot of non-optimal resolutions

metric discovery is analysis and orientation


develop a pipeline based semi automatic toolchain 

alerts

triggers

later


* event based time series databases
http://square.github.io/cube/

* correlation, conversation ids

Metric Centric

By Martin Suchanek

Metric Centric

  • 995