StatsCraft

Monitoring Conference

website and agenda: http://statscraft.org.il
twitter: @statscraft (#statscraft)
facebook: https://www.facebook.com/statscraft.il
email: statscraftcon@gmail.com

Agenda

Understand the problem.
Understand what monitoring is.
Example use-case(s)
A different approach
Learn methodologies and tools

The Problem

Nir Cohen @ Gigaspaces

@thinkops

http://github.com/nir0s

We monitor because...

We want to satify the customer.

(make money?)

Still underrated...

Automated Resource Provisioning
Configuration Management
Automated Code Deployment
Continuous Whatever

Monitoring

Automated Resource Provisioning
Configuration Management
Automated Code Deployment
Continuous Whatever
Monitoring

PROBLEM!

Blame the tools?

Problem origin

DISCLAIMER

We're monitoring the wrong things.

_rootCauseAnalysis:

the alternative is harder.

We're considering logs a second class citizen.

_rootCauseAnalysis: 

the alternative is harder.

Our data is lacking.

_rootCauseAnalysis:

inertia. that's how it was, that's how it is.

We separate monitoring from application

_rootCauseAnalysis:

we're not used to this. (Ops problem)

We monitor reactively, not proactively

_rootCauseAnalysis:

reaction requires less initial energy than anticipation.

We put uptime above system and product quality

_rootCauseAnalysis:

it's much easier.

We deal with hard limits.

_rootCauseAnalysis:

arbitrary numbers are easier to set.

Monitoring is non-functional but resource hungry

_rootCauseAnalysis:

we just don't accept it.

Good monitoring requires the right people, not just Ops!

_rootCauseAnalysis:

delegation is natural. other have more important things to do.

Alert fatigue is common.

_rootCauseAnalysis:

solving issues is much easier than solving problems, and apparently, we are additted to non-actionable alerts.

We're auto-scaling prematurely

_rootCauseAnalysis:

brute force is natural

We're choosing the wrong tools.

_rootCauseAnalysis:

it's easier to choose the tool than to choose what to monitor.

Good monitoring is hard

_rootCauseAnalysis:

systems become complex, so they're harder to monitor.

So, after all, why do we not monitor properly?

_rootCauseAnalysis:

Simplification
Delegation
Rationalization

No fear,

Let's see how we can make this all better

is here!

If a service crashes and no one is around to monitor it, does it raise an alert?

StatsCraft

Monitoring Conference

Agenda

The Problem

We monitor because...

We want to satify the customer.

Still underrated...

PROBLEM!

Blame the tools?

Problem origin

DISCLAIMER

We're monitoring the wrong things.

We're considering logs a second class citizen.

Our data is lacking.

We separate monitoring from application

We monitor reactively, not proactively

We put uptime above system and product quality

We deal with hard limits.

Monitoring is non-functional but resource hungry

Good monitoring requires the right people, not just Ops!

Alert fatigue is common.

We're auto-scaling prematurely

We're choosing the wrong tools.

Good monitoring is hard

So, after all, why do we not monitor properly?

Simplification

Delegation

Rationalization

No fear,

​Let's see how we can make this all better

is here!

The Problem

The Problem

Nir Cohen

More from Nir Cohen

Let's see how we can make this all better