Minor Rants and Advice on Monitoring
Ivan Topolnjak | @ivantopo
Today we talk about the immediate Future: Kamon 1.0.0
Typical Conversation
Questions I ask to people when they brag about their systems:
What volume are we talking about?
What response times are you seeing there?
Is it like that all the time?
Do you even monitor?
The Monitoring Tripod
- Metrics
- Logs
- Tracing
Advice: Create a small plan
What is important for you?
(Uptime, Latency, Error Rates)
Keep it simple
Express your Service Level Objectives Properly
Use 9's for availability
Use percentiles for Latency
Example SLO for Uptime
98 % => 28 minutes / day
99 % => 14 minutes / day
99.9 % => 1 minute, 26 seconds / day
99.99 % => 8.6 seconds / day
Example SLO for Uptime
98 % => 14 hours, 36 minutes / month
99 % => 7 hours, 18 minutes / month
99.9 % => 43 minutes, 49 seconds / month
99.99 % => 4 minutes, 23 seconds / month
Example SLO for Latency
50th Percentile <= 50ms
90th Percentile <= 100ms
99th Percentile <= 300ms
Max <= 1 second.
What do they have in common?
What do they have in common?
What do they have in common?
What do they have in common?
What do they have in common?
For all of them, the average is ~50 ns
Latency in Real Life
Latency in Real Life
90th Percentile: ~4 ms
99th Percentile: ~8.5 ms
Never use averages. Never. No. Nope. Nein. Negativo. You'll burn in monitoring hell if you do.
Advice:
Don't average summaries. Don't lie to yourself. This hell is even worse!
Advice:
Know your tools
Dropwizard Metric Values
- min: 0.238 ms
- median: 2.015 ms
- 75th percentile: 2.785 ms
- 95th percentile: 5.571 ms
- 98th percentile: 7.34 ms
- 99th percentile: 8.651 ms
- 99.9th percentile: 10.551 ms
- 99.99th percentile: 10.748 ms
- Max: 10.748 ms
Kamon Metric Values
- min: 0.234 ms
- median: 2.04 ms
- 75th percentile: 2.851 ms
- 95th percentile: 5.439 ms
- 98th percentile: 7.209 ms
- 99th percentile: 8.651 ms
- 99.9th percentile: 14.156 ms
- 99.99th percentile: 25.821 ms
- Max: 29.098 ms
Know your tools
Dropwizard Metric Values
- min: 0.297 ms
- median: 2.007 ms
- 75th percentile: 2.818 ms
- 95th percentile: 5.308 ms
- 98th percentile: 7.078 ms
- 99th percentile: 8.389 ms
- 99.9th percentile: 11.534 ms
- 99.99th percentile: 14.156 ms
- Max: 14.156 ms
Kamon Metric Values
- min: 0.234 ms
- median: 2.04 ms
- 75th percentile: 2.851 ms
- 95th percentile: 5.439 ms
- 98th percentile: 7.209 ms
- 99th percentile: 8.651 ms
- 99.9th percentile: 14.156 ms
- 99.99th percentile: 25.821 ms
- Max: 29.098 ms
Know your tools
Dropwizard Metric Values
- min: 0.252 ms
- median: 2.04 ms
- 75th percentile: 2.867 ms
- 95th percentile: 4.882 ms
- 98th percentile: 6.095 ms
- 99th percentile: 7.209 ms
- 99.9th percentile: 9.11 ms
- 99.99th percentile: 9.83 ms
- Max: 9.83 ms
Kamon Metric Values
- min: 0.234 ms
- median: 2.04 ms
- 75th percentile: 2.851 ms
- 95th percentile: 5.439 ms
- 98th percentile: 7.209 ms
- 99th percentile: 8.651 ms
- 99.9th percentile: 14.156 ms
- 99.99th percentile: 25.821 ms
- Max: 29.098 ms
Get to know your tools and be aware of their advantages and limitations.
Take informed decisions, don't just follow the buzz.
Advice:
Accept that your performance intuition sucks. Most of the time.
Advice:
Log with context. Make sure that all logs related to a single request can be identified.
Advice:
How does Kamon help?
Open Source Project (http://kamon.io)
4+ years around
Metrics and Tracing API
Instrumentation for common libraries
Collection and Reporting are Separate
Instrument once, report anywhere.
Metrics
Recording Metrics
val processingTime = Kamon.histogram("app.service.processing-time")
processingTime.record(42)
val httpStatusCodes = Kamon.counter("http.response.status")
val serverErrors = httpStatusCodes.refine("code" -> "500")
val clientErrors = httpStatusCodes.refine("code" -> "400")
serverErrors.increment()
clientErrors.increment(100)
// This is the same Histogram, everywhere.
Kamon.histogram("app.service.processing-time").record("24")
Tracing
Using our OpenTracing-compatible Tracer
val activeSpan = Kamon.buildSpan("my-operation")
.withTag("span.kind", "server")
.startActive()
// Do your stuff here
activeSpan.deactivate()
// You got traces, you got metrics!
visit http://opentracing.io
Adding Reporters
Kamon.addReporter(new PrometheusReporter())
Kamon.addReporter(new JaegerReporter())
// Create your own reporter by implementing MetricReporter or SpanReporter
sealed trait Reporter {
def start(): Unit
def stop(): Unit
def reconfigure(config: Config): Unit
}
trait MetricReporter extends Reporter {
def reportTickSnapshot(snapshot: TickSnapshot): Unit
}
trait SpanReporter extends Reporter {
def reportSpans(spans: Seq[Span.CompletedSpan]): Unit
}
I'm lazy... do I have to write code to use Kamon? :/
No, just add dependencies and configuration!
The Play Combo
kamon-core
kamon-scala
kamon-akka-2.5
kamon-play-2.4
kamon-executors
kamon-prometheus
Add Dependencies
// build.sbt
resolvers += Resolver.bintrayRepo("kamon-io", "snapshots")
libraryDependencies ++= Seq(
"io.kamon" %% "kamon-core" % "1.0.0-RC1-1d0548cb8281202738d8d48cbe9cdd62cf209e21",
"io.kamon" %% "kamon-play-2.5" % "1.0.0-RC1-cf7506f2d05b7e868ab5291c73f45188b7db8069",
"io.kamon" %% "kamon-akka-2.4" % "1.0.0-RC1-5472bca942c01bb87720263b36978cc0b243365e",
"io.kamon" %% "kamon-prometheus" % "1.0.0-RC1-f84ea9bd12d4b8e9f4b0dd1e04a99b41c32913ec",
"io.kamon" %% "kamon-jaeger" % "1.0.0-RC1-6cbd74406aac0bedc20746abc78f27e566de1f90"
)
Configuration
// application.conf
play.modules.enabled += "kamon.play.di.GuiceModule"
kamon {
environment {
service = "kamon-showcase"
}
util.filters {
"akka.actor" {
includes = ["application/user/slow*"]
}
"akka.dispatcher" {
includes = ["**"]
}
"akka.router" {
includes = ["**"]
}
}
}
SBT Plugin
// project/plugins.sbt
resolvers += Resolver.bintrayIvyRepo("kamon-io", "sbt-plugins")
addSbtPlugin("io.kamon" % "sbt-aspectj-play-runner" % "1.0.1")
Active Span Management
def fastActor = Action.async { implicit request =>
logger.info("In the Controller")
(Random.shuffle(nonBlockingActors).head ? "hey")
.mapTo[String]
.map(s => {
logger.info("In the future.map")
Ok(s)
})(ec)
}
Without Trace ID :(
[info] [application-akka.actor.default-dispatcher-9] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-9] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-9] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-4] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [application-akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [application-akka.actor.default-dispatcher-2] HomeController - In the future.map
With Trace ID :))
// With this in your logging pattern:
<pattern>%coloredLevel [%X{trace_id}][%thread] %logger{15} - %message%n%xException{10}</pattern>
[info] [3cace0e8c5a4f13d][akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [7665825824d90893][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [0dd690ff92e54334][akka.actor.default-dispatcher-6] HomeController - In the Controller
How does it work?
Traditional vs Reactive Model
In the traditional world (looking at you, servlets)
Traditional vs Reactive Model
In the Reactive World
It works with Futures, Actors, Routers, even across JVMs!
Prometheus Backend
We Recommend Prometheus
Sold at "aggregates buckets, not summaries"
Jaeger Tracing Backend
StatsD, InfluxDB, Graphite, New Relic, Datadog, Sematext SPM, JMX, Riemann, Khronus...
On the road to 1.0.0
https://kamino.io
Thanks for Coming!
Get more info at http://kamon.io/
https://github.com/kamon-io
@kamonteam
minor-rants-and-advice-on-monitoring
By Ivan Topolnjak
minor-rants-and-advice-on-monitoring
Scala Swarm 2017
- 2,116