Minor Rants and Advice on Monitoring

Ivan Topolnjak | @ivantopo

Today we talk about the immediate Future: Kamon 1.0.0

Typical Conversation

Questions I ask to people when they brag about their systems:

 

What volume are we talking about?

What response times are you seeing there?

Is it like that all the time?

Do you even monitor?

The Monitoring Tripod

  • Metrics
  • Logs
  • Tracing

Advice: Create a small plan

What is important for you?

(Uptime, Latency, Error Rates)

 

Keep it simple

Express your Service Level Objectives Properly

Use 9's for availability

Use percentiles for Latency

 

 

Example SLO for Uptime

98 % => 28 minutes / day

99 % => 14 minutes / day

99.9 % => 1 minute, 26 seconds / day

99.99 % => 8.6 seconds / day

Example SLO for Uptime

98 % => 14 hours, 36 minutes / month

99 % => 7 hours, 18 minutes / month

99.9 % => 43 minutes, 49 seconds / month

99.99 % => 4 minutes, 23 seconds / month

Example SLO for Latency

50th Percentile <= 50ms

90th Percentile <= 100ms

99th Percentile <= 300ms

Max <= 1 second.

What do they have in common?

What do they have in common?

What do they have in common?

What do they have in common?

What do they have in common?

For all of them, the average is ~50 ns

Latency in Real Life

Latency in Real Life

90th Percentile: ~4 ms

99th Percentile: ~8.5 ms

Never use averages. Never. No. Nope. Nein. Negativo. You'll burn in monitoring hell if you do.

Advice:

Don't average summaries. Don't lie to yourself. This hell is even worse!

Advice:

Know your tools



Dropwizard Metric Values
  - min: 0.238 ms
  - median: 2.015 ms
  - 75th percentile: 2.785 ms
  - 95th percentile: 5.571 ms
  - 98th percentile: 7.34 ms
  - 99th percentile: 8.651 ms
  - 99.9th percentile: 10.551 ms
  - 99.99th percentile: 10.748 ms
  - Max: 10.748 ms


Kamon Metric Values
  - min: 0.234 ms
  - median: 2.04 ms
  - 75th percentile: 2.851 ms
  - 95th percentile: 5.439 ms
  - 98th percentile: 7.209 ms
  - 99th percentile: 8.651 ms
  - 99.9th percentile: 14.156 ms
  - 99.99th percentile: 25.821 ms
  - Max: 29.098 ms

Know your tools



  Dropwizard Metric Values
  - min: 0.297 ms
  - median: 2.007 ms
  - 75th percentile: 2.818 ms
  - 95th percentile: 5.308 ms
  - 98th percentile: 7.078 ms
  - 99th percentile: 8.389 ms
  - 99.9th percentile: 11.534 ms
  - 99.99th percentile: 14.156 ms
  - Max: 14.156 ms


Kamon Metric Values
  - min: 0.234 ms
  - median: 2.04 ms
  - 75th percentile: 2.851 ms
  - 95th percentile: 5.439 ms
  - 98th percentile: 7.209 ms
  - 99th percentile: 8.651 ms
  - 99.9th percentile: 14.156 ms
  - 99.99th percentile: 25.821 ms
  - Max: 29.098 ms

Know your tools



Dropwizard Metric Values
  - min: 0.252 ms
  - median: 2.04 ms
  - 75th percentile: 2.867 ms
  - 95th percentile: 4.882 ms
  - 98th percentile: 6.095 ms
  - 99th percentile: 7.209 ms
  - 99.9th percentile: 9.11 ms
  - 99.99th percentile: 9.83 ms
  - Max: 9.83 ms


Kamon Metric Values
  - min: 0.234 ms
  - median: 2.04 ms
  - 75th percentile: 2.851 ms
  - 95th percentile: 5.439 ms
  - 98th percentile: 7.209 ms
  - 99th percentile: 8.651 ms
  - 99.9th percentile: 14.156 ms
  - 99.99th percentile: 25.821 ms
  - Max: 29.098 ms

Get to know your tools and be aware of their advantages and limitations.

 

Take informed decisions, don't just follow the buzz.

Advice:

Accept that your performance intuition sucks. Most of the time.

Advice:

Log with context. Make sure that all logs related to a single request can be identified.

Advice:

How does Kamon help?

Open Source Project (http://kamon.io)

4+ years around

Metrics and Tracing API

Instrumentation for common libraries

Collection and Reporting are Separate

Instrument once, report anywhere.

Metrics

Recording Metrics


val processingTime = Kamon.histogram("app.service.processing-time")
processingTime.record(42)



val httpStatusCodes = Kamon.counter("http.response.status")
val serverErrors = httpStatusCodes.refine("code" -> "500")
val clientErrors = httpStatusCodes.refine("code" -> "400")

serverErrors.increment()
clientErrors.increment(100)


// This is the same Histogram, everywhere.
Kamon.histogram("app.service.processing-time").record("24")

Tracing

Using our OpenTracing-compatible Tracer

 


val activeSpan = Kamon.buildSpan("my-operation")
    .withTag("span.kind", "server")
    .startActive()


// Do your stuff here

activeSpan.deactivate()
// You got traces, you got metrics!

visit http://opentracing.io

Adding Reporters


Kamon.addReporter(new PrometheusReporter())
Kamon.addReporter(new JaegerReporter())

// Create your own reporter by implementing MetricReporter or SpanReporter

sealed trait Reporter {
  def start(): Unit
  def stop(): Unit
  def reconfigure(config: Config): Unit
}

trait MetricReporter extends Reporter {
  def reportTickSnapshot(snapshot: TickSnapshot): Unit
}

trait SpanReporter extends Reporter {
  def reportSpans(spans: Seq[Span.CompletedSpan]): Unit
}

I'm lazy... do I have to write code to use Kamon? :/

No, just add dependencies and configuration!

The Play Combo

kamon-core

kamon-scala

kamon-akka-2.5

kamon-play-2.4

kamon-executors

kamon-prometheus

Add Dependencies


// build.sbt

resolvers += Resolver.bintrayRepo("kamon-io", "snapshots")

libraryDependencies ++= Seq(
  "io.kamon" %% "kamon-core" % "1.0.0-RC1-1d0548cb8281202738d8d48cbe9cdd62cf209e21",
  "io.kamon" %% "kamon-play-2.5" % "1.0.0-RC1-cf7506f2d05b7e868ab5291c73f45188b7db8069",
  "io.kamon" %% "kamon-akka-2.4" % "1.0.0-RC1-5472bca942c01bb87720263b36978cc0b243365e",
  "io.kamon" %% "kamon-prometheus" % "1.0.0-RC1-f84ea9bd12d4b8e9f4b0dd1e04a99b41c32913ec",
  "io.kamon" %% "kamon-jaeger" % "1.0.0-RC1-6cbd74406aac0bedc20746abc78f27e566de1f90"
)

Configuration

// application.conf
play.modules.enabled += "kamon.play.di.GuiceModule"

kamon {
  environment {
    service = "kamon-showcase"
  }

  util.filters {
    "akka.actor" {
      includes = ["application/user/slow*"]
    }

    "akka.dispatcher" {
      includes = ["**"]
    }

    "akka.router" {
      includes = ["**"]
    }
  }
}

SBT Plugin


// project/plugins.sbt

resolvers += Resolver.bintrayIvyRepo("kamon-io", "sbt-plugins")
addSbtPlugin("io.kamon" % "sbt-aspectj-play-runner" % "1.0.1")

Active Span Management


  def fastActor = Action.async { implicit request =>
    logger.info("In the Controller")
    (Random.shuffle(nonBlockingActors).head ? "hey")
      .mapTo[String]
      .map(s => {
        logger.info("In the future.map")
        Ok(s)
      })(ec)
  }

Without Trace ID :(

[info] [application-akka.actor.default-dispatcher-9] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-9] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-9] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-4] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the future.map

With Trace ID :))

// With this in your logging pattern:

<pattern>%coloredLevel [%X{trace_id}][%thread] %logger{15} - %message%n%xException{10}</pattern>


[info] [3cace0e8c5a4f13d][akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [7665825824d90893][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [0dd690ff92e54334][akka.actor.default-dispatcher-6] HomeController - In the Controller

How does it work?

Traditional vs Reactive Model

In the traditional world (looking at you, servlets)

Traditional vs Reactive Model

In the Reactive World

It works with Futures, Actors, Routers, even across JVMs!

Prometheus Backend

We Recommend Prometheus

Sold at "aggregates buckets, not summaries"

Jaeger Tracing Backend

StatsD, InfluxDB, Graphite, New Relic, Datadog, Sematext SPM, JMX, Riemann, Khronus...

On the road to 1.0.0

https://kamino.io

Thanks for Coming!

Get more info at http://kamon.io/

https://github.com/kamon-io

@kamonteam

minor-rants-and-advice-on-monitoring

By Ivan Topolnjak

minor-rants-and-advice-on-monitoring

Scala Swarm 2017

  • 2,116