StatsCraft

Monitoring Conference

 

Agenda

  1. Understand the problem.
  2. Understand what monitoring is.
  3. Example use-case(s)
  4. A different approach
  5. Learn methodologies and tools

The Problem

Nir Cohen @ Gigaspaces

@thinkops

http://github.com/nir0s

We monitor because...

We want to satify the customer.

(make money?)

Still underrated...

  • Automated Resource Provisioning
  • Configuration Management
  • Automated Code Deployment
  • Continuous Whatever

Monitoring

  • Automated Resource Provisioning
  • Configuration Management
  • Automated Code Deployment
  • Continuous Whatever
  • Monitoring

PROBLEM!

Blame the tools?

Problem origin

DISCLAIMER

We're monitoring the wrong things.

_rootCauseAnalysis:

the alternative is harder.

We're considering logs a second class citizen.

_rootCauseAnalysis: 

the alternative is harder.

Our data is lacking.

_rootCauseAnalysis:

inertia. that's how it was, that's how it is.

We separate monitoring from application

_rootCauseAnalysis:

we're not used to this. (Ops problem)

We monitor reactively, not proactively

_rootCauseAnalysis:

reaction requires less initial energy than anticipation.

We put uptime above system and product quality

_rootCauseAnalysis:

it's much easier.

We deal with hard limits.

_rootCauseAnalysis:

arbitrary numbers are easier to set.

Monitoring is non-functional but resource hungry

_rootCauseAnalysis:

we just don't accept it.

Good monitoring requires the right people, not just Ops!

_rootCauseAnalysis:

delegation is natural. other have more important things to do.

Alert fatigue is common.

_rootCauseAnalysis:

solving issues is much easier than solving problems, and apparently, we are additted to non-actionable alerts.

We're auto-scaling prematurely

_rootCauseAnalysis:

brute force is natural

We're choosing the wrong tools.

_rootCauseAnalysis:

it's easier to choose the tool than to choose what to monitor.

Good monitoring is hard

_rootCauseAnalysis:

systems become complex, so they're harder to monitor.

So, after all, why do we not monitor properly?

_rootCauseAnalysis:
  1. Simplification

  2. Delegation

  3. Rationalization

No fear,

Let's see how we can make this all better

is here!

If a service crashes and no one is around to monitor it, does it raise an alert?

The Problem

By Nir Cohen

The Problem

StatsCraft 2015 Keynote on the current problems in monitoring

  • 4