Minor Rants about

Monitoring on the JVM

Presented by Ivan Topolnjak (@ivantopo)

Do you even monitor?

The 3 Pillars of Observability

Logs

Metrics

Tracing

Distributed Tracing with Kamon

Late March - meetup.com/HRScala

The Ever Looping Plan

Start with the basics.

(Uptime, Latency, Error Rates)

Close the loop.

Refine and Repeat.

Express your Service Level Objectives Properly

Use 9's for availability

Use percentiles for latency

Use % for error rate

Example SLO for Uptime

98 % => 28 minutes / day

99 % => 14 minutes / day

99.9 % => 1 minute, 26 seconds / day

99.99 % => 8.6 seconds / day

Example SLO for Latency

50th Percentile <= 50ms

90th Percentile <= 100ms

99th Percentile <= 300ms

Max <= 1 second.

At least read chapters 4 and 6

Never use averages. Never. No. Nope. Nein. Negativo. You'll burn in monitoring hell if you do.

What do they have in common?

For all of them, the average is ~50 ns

Latency in Real Life

Don't average summaries. Don't lie to yourself. This hell is even worse!

Don't get obsessed about throughput and latency, there is more to see about your app.

Throughput and Latency

What about Time in Mailbox?

Garbage Collection

Hiccups

Get to know your tools and be aware of their advantages and limitations.

Take informed decisions, don't just follow the buzz.

Know your tools



  Dropwizard Metric Values
  - min: 0.297 ms
  - median: 2.007 ms
  - 75th percentile: 2.818 ms
  - 95th percentile: 5.308 ms
  - 98th percentile: 7.078 ms
  - 99th percentile: 8.389 ms
  - 99.9th percentile: 11.534 ms
  - 99.99th percentile: 14.156 ms
  - Max: 14.156 ms



Actual Expected Values:
  - min: 0.234 ms
  - median: 2.04 ms
  - 75th percentile: 2.851 ms
  - 95th percentile: 5.439 ms
  - 98th percentile: 7.209 ms
  - 99th percentile: 8.651 ms
  - 99.9th percentile: 14.156 ms
  - 99.99th percentile: 25.821 ms
  - Max: 29.098 ms

Know your tools



Dropwizard Metric Values
  - min: 0.252 ms
  - median: 2.04 ms
  - 75th percentile: 2.867 ms
  - 95th percentile: 4.882 ms
  - 98th percentile: 6.095 ms
  - 99th percentile: 7.209 ms
  - 99.9th percentile: 9.11 ms
  - 99.99th percentile: 9.83 ms
  - Max: 9.83 ms



Actual Expected Values:
  - min: 0.234 ms
  - median: 2.04 ms
  - 75th percentile: 2.851 ms
  - 95th percentile: 5.439 ms
  - 98th percentile: 7.209 ms
  - 99th percentile: 8.651 ms
  - 99.9th percentile: 14.156 ms
  - 99.99th percentile: 25.821 ms
  - Max: 29.098 ms

Sampled metrics can be as misleading as averages are.

Beware of what UDP really means: you will definitely loose data and it is NOT free.

Log with context. Make sure that all logs related to a single request can be identified.

Understand the platform you are running on, specially in the reactive world.

Traditional vs Reactive Model

In the traditional world (looking at you, servlets)

Traditional vs Reactive Model

In the Reactive World

What are my options?

Dropwizard Metrics

The legendary metrics library. Previously Codahale Metrics.
Hierarchical model
Lots of instrumentation and exporters
Seriously, don't use it, at least wait until 5.0

Micrometer

Vendor-neutral application metrics facade
Dimensional model
Proper Histograms support
Several integration options (Prometheus, Atlas, InfluxDB, JMX and more)
Integrated into Spring Boot 2.0
Helpful community

OpenCensus

Tracing and Metrics collection libraries
Dimensional metrics model
Several integration options (Prometheus, Zipkin, Stackdriver and more)
Support for non-JVM environments
gRPC and HTTP support right out of the box

Metrics, Tracing and Context Propagation Toolkit
Dimensional metrics model
Proper Histograms support
Several integration options (Prometheus, Zipkin, Kamino, Datadog and more)
Support for Play Framework, Akka, JDBC, Hikari CP, Futures
APIs and Agent-based instrumentation

So, how do I Kamonize my service?

Add Dependencies


// build.sbt

resolvers += Resolver.bintrayRepo("kamon-io", "snapshots")

libraryDependencies ++= Seq(
  "io.kamon" %% "kamon-core" % "1.0.0",
  "io.kamon" %% "kamon-akka-2.4" % "1.0.0",
  "io.kamon" %% "kamon-prometheus" % "1.0.0",
  "io.kamon" %% "kamon-zipkin" % "1.0.0",
  "io.kamon" %% "kamon-jaeger" % "1.0.0"
)

Add Configuration

// application.conf
kamon {
  environment {
    service = "kamon-showcase"
  }

  util.filters {
    "akka.tracked-actor" {
      includes = ["application/user/slow*"]
    }

    "akka.tracked-dispatcher" {
      includes = ["**"]
    }

    "akka.traced-actor" {
      includes = ["**"]
    }
  }
}

Start the Reporters


Kamon.addReporter(new PrometheusReporter())
Kamon.addReporter(new ZipkinReporter())

// OR

Kamon.loadReportersFromConfig()

Start with the AspectJ Agent


// project/plugins.sbt

resolvers += Resolver.bintrayIvyRepo("kamon-io", "sbt-plugins")
addSbtPlugin("io.kamon" % "sbt-aspectj-runner" % "1.0.4")

Optional Step

java -javaagent:/path/to/weaver.jar ....

Include Trace IDs in your logs

<configuration>

    <conversionRule conversionWord="traceID" converterClass="kamon.logback.LogbackTraceIDConverter" />


    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] [%traceID] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <root level="info">
        <appender-ref ref="STDOUT" />
    </root>
</configuration>

Optional Step

Start and Configure your Prometheus, Grafana, Zipkin/Jaeger, etc

Look at Prometheus/Grafana

Look at your logs


[info] [3cace0e8c5a4f13d][akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [7665825824d90893][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [0dd690ff92e54334][akka.actor.default-dispatcher-6] HomeController - In the Controller

Look at Zipkin

Look at Jaeger

Metrics

Recording Metrics


val processingTime = Kamon.histogram("app.service.processing-time")
processingTime.record(42)



val httpStatusCodes = Kamon.counter("http.response.status")
val serverErrors = httpStatusCodes.refine("code" -> "500")
val clientErrors = httpStatusCodes.refine("code" -> "400")

serverErrors.increment()
clientErrors.increment(100)


// This is the same Histogram, everywhere.
Kamon.histogram("app.service.processing-time").record(42)

Tracing

True Distributed Tracing, finally.


val span = Kamon.buildSpan("my-operation")
    .withTag("span.kind", "server")
    .start()


// Do your stuff here

span.finish()
// You got traces, you got metrics!

Instrument once, report anywhere.

Creating Reporters


// Create your own reporter by implementing MetricReporter or SpanReporter

sealed trait Reporter {
  def start(): Unit
  def stop(): Unit
  def reconfigure(config: Config): Unit
}

trait MetricReporter extends Reporter {
  def reportPeriodSnapshot(snapshot: PeriodSnapshot): Unit
}

trait SpanReporter extends Reporter {
  def reportSpans(spans: Seq[FinishedSpan]): Unit
}

Available in 1.0.0

Akka, Akka HTTP, Akka Remote, Scala, Play, JDBC, System Metrics, Zipkin, Jaeger, Executors, Logback, StackDriver, Datadog, InfluxDB, Prometheus and Kamino

Coming soon: Spring Boot, Kafka, Cassandra

a small shameless plug...

https://kamino.io

Made for Kamon. OpenCensus and Micrometer support coming soon
Distributed Tracing
Opinionated Dashboards
Native Histogram Support
Early Access Program Available!

Thanks for Coming!

Get more info at http://kamon.io/

https://github.com/kamon-io

@kamonteam

Questions?

Minor Rants about Monitoring on the JVM

By Ivan Topolnjak

Minor Rants about Monitoring on the JVM

Javantura V5

1,379

Minor Rants about

Monitoring on the JVM

Do you even monitor?

The 3 Pillars of Observability

Logs

Metrics

Tracing

Distributed Tracing with Kamon

Late March - meetup.com/HRScala

The Ever Looping Plan

Express your Service Level Objectives Properly

Example SLO for Uptime

Example SLO for Latency

Never use averages. Never. No. Nope. Nein. Negativo. You'll burn in monitoring hell if you do.

What do they have in common?

For all of them, the average is ~50 ns

Latency in Real Life

Don't average summaries. Don't lie to yourself. This hell is even worse!

Don't get obsessed about throughput and latency, there is more to see about your app.

Throughput and Latency

What about Time in Mailbox?

Garbage Collection

Hiccups

Get to know your tools and be aware of their advantages and limitations.

Take informed decisions, don't just follow the buzz.

Know your tools

Know your tools

Sampled metrics can be as misleading as averages are.

Beware of what UDP really means: you will definitely loose data and it is NOT free.

Log with context. Make sure that all logs related to a single request can be identified.

Understand the platform you are running on, specially in the reactive world.

Traditional vs Reactive Model

Traditional vs Reactive Model

What are my options?

Dropwizard Metrics

Micrometer

OpenCensus

So, how do I Kamonize my service?

Add Dependencies

Add Configuration

Start the Reporters

Start with the AspectJ Agent

Include Trace IDs in your logs

Start and Configure your Prometheus, Grafana, Zipkin/Jaeger, etc

Look at Prometheus/Grafana

Look at your logs

Look at Zipkin

Look at Jaeger

Metrics

Tracing

Instrument once, report anywhere.

Creating Reporters

Available in 1.0.0

a small shameless plug...

https://kamino.io

Thanks for Coming!

Questions?

Minor Rants about Monitoring on the JVM

More from Ivan Topolnjak