Metric Centric

Metric Oriented Architecture

problem space

collect

analyze

orient

inform

visualize

predict

act

solution space

metrics

triggers

alerts

metrics

a measure

at a point in time

of a thing (with a name)

a measure is a value
value is a number
an integer or floating point value

a point in time is a timestamp

measures are aggregated to a time interval

three basic types of measures

counters

timings

guages

different types exist because we want different aggregate functions

counter

aggregate function is SUM

timing

aggregate function is AVERAGE (aka MEAN)

also SUM, LOWER, UPPER, XPERCENTILE (just for fun)

gauge

aggregate function is LAST

time interval

is a bucket

is a timeslice

is a step

time series

a range of time intervals

successive points in time, spaced at a uniform time interval

a sequence of buckets

a sequence of timeslices

a sequence of data points

http://en.wikipedia.org/wiki/Time_series

a time series has a resolution

sharpest resolution is your smallest time interval

dullest resolution is whatever you want to further aggregate to

time series data is stored in a time series database

http://en.wikipedia.org/wiki/Time_series_database

measures are named

a neat convention is namespacing

		datacenterA.roleB.poolC.machineD.cpuusage

a hierarchical taxonomy

or tagging

		cpuusage datacentre=A role=B pool=C machine=D

Metrics toolchain

emit

collect

filter, route

aggregate

persist

expose

Emit

snmp

wmi

logs

(anything really)

collect

http://collectd.org/

filter, route

http://heka-docs.readthedocs.org/

http://www.logstash.net/

aggregate

https://github.com/etsy/statsd/

http://graphite.wikidot.com/ (carbon)

persist

http://opentsdb.net/ (based on HBase)

http://en.wikipedia.org/wiki/RRDtool

http://graphite.wikidot.com/ (whisper)

expose

http://graphite.wikidot.com/ (graphite-web)

classical toolspace

(nagios, zabbix, zenoss, opennms, cacti)

mixed all these concerns

modern toolspace

prefers single responsibility

modelling metrics

a metric is just a count of an action

an action is

a commmand

a query

an operation

a state change event

in well modeled code

you get metrics for free

spaghetti code -> no discernable metrics

factored class with method for command/query -> count method calls

in well modeled domains

you get meaningful metrics for free

domain model that emits domain events -> count events

in evented systems

a metric is just a count of an event occurring

in well modeled systems

you can push your application down layers

to get cheaper metrics for free

domain model that sits at REST API endpoint -> count different actions (GET), events (POST,PUT,DELETE); success (HTTP status code); processing time (HTTP response time)

resuse HTTP middleware

domain model that sits at 0mq endpoint, 1 socket per action

resuse TCP middleware

sources of metrics

ways to emit metrics

https://wiki.verrus.net/display/~msuchanek/Monitoring+Strategy+-+Metrics

levels

metrics can come from different levels

each level has different combination of data age (how old is the data), distance (how quickly can the data be retrieved), size (how big is the data set)

some levels are technical (e.g.plumbing: the number of HTTP requests),

others are non technical (e.g. number of user logins)

network

network monitoring tools

usually SNMP based

machine

machine/host monitoring tools

SNMP, WMI, or custom agents

Appliance

machine/host monitoring

black box application metrics

protocol

network or machine/host monitoring tools

middleware

(e.g. HTTP server)

syslog or SNMP/WMI

framework

(e.g. ASP.NET, .NET)

Windows Event Log

syslog

application -> action -> syslog event -> collector -> parser -> extractor -> filter -> router -> persister

database

application -> application state in DB -> query DB for state at current point in time

application -> application state, state transitions in DB -> query DB for state at any point in time

metrics pipeline

usually fire and forget, UDP

application -> action -> metric emitted -> metric pipeline

event pipeline

application -> event -> metric

lessons learned

retrofitting metrics is crazy plumbing

prefer to build a well modelled app

otherwise you're wasting time with custom one off health monitors with custom one off plumbing

the lower the level, the cheaper and faster the metrics are

push your app to the lowest possible level

domain specific actions are the most meaningful metrics

focus on modelling domain commands, events, queries

technical metrics can be misleading

focus on non-technical metrics in the form of user stories that capture UX or SLA

use technical metrics only to supplement non-technical metrics

an acceptance test represents a system action

derive metrics from acceptance tests; metrics become first class citizens

correlate technical metrics to non-technical ones, in order to guide triage

metrics carry implicit requirements about data age, data distance, data size

structure your metrics as user stories to make these requirements (and probably implementation options) explicit

metrics are expensive: computational complexity, throughput, size

be aware of the costs; make these guide your decisions (cloud is not free)

you don't know you want real time until you do

gun for real time data; not the other way around (cause it will be guaranteed rework)

always keep the problem space in mind

align your metrics model to support future phases

metrics can be deterministic or non-deterministic

depending on when and how data is captured

real time vs non real time

pre-compute vs post-compute

run time eval vs offline eval

different components require different monitoring strategies

distinguish between component level, system of components (SOA service), system of systems (SOA services)

metrics are just a readmodel

metrics are just a subset of general BI

fixed facts and dimensions; not free form

resolution is important

all metrics have a couple optimal resolutions

all metrics have a lot of non-optimal resolutions

metric discovery is analysis and orientation

develop a pipeline based semi automatic toolchain

alerts

triggers

later

* event based time series databases

http://square.github.io/cube/

* correlation, conversation ids

Metric Centric

By Martin Suchanek

Metric Centric

11 years ago
995

Martin Suchanek

mrtn_su

Metric Centric

problem space

solution space

metrics

Metrics toolchain

Emit

collect

filter, route

aggregate

persist

expose

modelling metrics

sources of metrics

levels

network

machine

Appliance

protocol

middleware

framework

syslog

database

metrics pipeline

event pipeline

lessons learned

alerts

triggers

later

Metric Centric

More from Martin Suchanek