StatsCraft

Monitoring Conference

 

Agenda

  1. Understand the problem.
  2. Understand what monitoring is.
  3. Example use-case(s)
  4. A different approach
  5. Learn methodologies and tools

The Problem

Nir Cohen @ Gigaspaces

@thinkops

http://github.com/nir0s

We monitor because...

We want to satify the customer.

(make money?)

Still underrated...

  • Automated Resource Provisioning
  • Configuration Management
  • Automated Code Deployment
  • Continuous Whatever

Monitoring

  • Automated Resource Provisioning
  • Configuration Management
  • Automated Code Deployment
  • Continuous Whatever
  • Monitoring

PROBLEM!

Blame the tools?

Problem origin

DISCLAIMER

We're monitoring the wrong things.

_rootCauseAnalysis:

the alternative is harder.

We're considering logs a second class citizen.

_rootCauseAnalysis: 

the alternative is harder.

Our data is lacking.

_rootCauseAnalysis:

inertia. that's how it was, that's how it is.

We separate monitoring from application

_rootCauseAnalysis:

we're not used to this. (Ops problem)

We monitor reactively, not proactively

_rootCauseAnalysis:

reaction requires less initial energy than anticipation.

We put uptime above system and product quality

_rootCauseAnalysis:

it's much easier.

We deal with hard limits.

_rootCauseAnalysis:

arbitrary numbers are easier to set.

Monitoring is non-functional but resource hungry

_rootCauseAnalysis:

we just don't accept it.

Good monitoring requires the right people, not just Ops!

_rootCauseAnalysis:

delegation is natural. other have more important things to do.

Alert fatigue is common.

_rootCauseAnalysis:

solving issues is much easier than solving problems, and apparently, we are additted to non-actionable alerts.

We're auto-scaling prematurely

_rootCauseAnalysis:

brute force is natural

We're choosing the wrong tools.

_rootCauseAnalysis:

it's easier to choose the tool than to choose what to monitor.

Good monitoring is hard

_rootCauseAnalysis:

systems become complex, so they're harder to monitor.

So, after all, why do we not monitor properly?

_rootCauseAnalysis:
  1. Simplification

  2. Delegation

  3. Rationalization

No fear,

Let's see how we can make this all better

is here!

If a service crashes and no one is around to monitor it, does it raise an alert?

Made with Slides.com