Minor Rants and Advice on Monitoring
Ivan Topolnjak | @ivantopo | Core Team @ Kamon | Co-Founder @ Kamino
Typical Conversation
Questions I ask to people when they brag about their systems:
What volume are we talking about?
What response times are you seeing there?
Is it like that all the time?
Do you even monitor?
The Monitoring Tripod Trinity
- Logs
- Metrics
- Tracing
Create a small plan
What is important for you?
(Uptime, Latency, Error Rates)
Keep it simple
Express your Service Level Objectives Properly
Use 9's for availability
Use percentiles for Latency
Example SLO for Uptime
98 % => 28 minutes / day
99 % => 14 minutes / day
99.9 % => 1 minute, 26 seconds / day
99.99 % => 8.6 seconds / day
Example SLO for Latency
50th Percentile <= 50ms
90th Percentile <= 100ms
99th Percentile <= 300ms
Max <= 1 second.
At least read chapters 4 and 6
Never use averages. Never. No. Nope. Nein. Negativo. You'll burn in monitoring hell if you do.
What do they have in common?
What do they have in common?
What do they have in common?
What do they have in common?
For all of them, the average is ~50 ns
Latency in Real Life
Latency in Real Life
90th Percentile: ~4 ms
99th Percentile: ~8.5 ms
Don't average summaries. Don't lie to yourself. This hell is even worse!
Don't get obsessed about throughput and latency, there is more to see about your app.
Throughput and Latency
What about Time in Mailbox?
Garbage Collection
Hiccups
Get to know your tools and be aware of their advantages and limitations.
Take informed decisions, don't just follow the buzz.
Know your tools
Dropwizard Metric Values
- min: 0.297 ms
- median: 2.007 ms
- 75th percentile: 2.818 ms
- 95th percentile: 5.308 ms
- 98th percentile: 7.078 ms
- 99th percentile: 8.389 ms
- 99.9th percentile: 11.534 ms
- 99.99th percentile: 14.156 ms
- Max: 14.156 ms
Kamon Metric Values
- min: 0.234 ms
- median: 2.04 ms
- 75th percentile: 2.851 ms
- 95th percentile: 5.439 ms
- 98th percentile: 7.209 ms
- 99th percentile: 8.651 ms
- 99.9th percentile: 14.156 ms
- 99.99th percentile: 25.821 ms
- Max: 29.098 ms
Know your tools
Dropwizard Metric Values
- min: 0.252 ms
- median: 2.04 ms
- 75th percentile: 2.867 ms
- 95th percentile: 4.882 ms
- 98th percentile: 6.095 ms
- 99th percentile: 7.209 ms
- 99.9th percentile: 9.11 ms
- 99.99th percentile: 9.83 ms
- Max: 9.83 ms
Kamon Metric Values
- min: 0.234 ms
- median: 2.04 ms
- 75th percentile: 2.851 ms
- 95th percentile: 5.439 ms
- 98th percentile: 7.209 ms
- 99th percentile: 8.651 ms
- 99.9th percentile: 14.156 ms
- 99.99th percentile: 25.821 ms
- Max: 29.098 ms
Accept that your performance intuition sucks. Most of the time.
Log with context. Make sure that all logs related to a single request can be identified.
Understand the platform you are running on, specially in the reactive world.
Traditional vs Reactive Model
In the traditional world (looking at you, servlets)
Traditional vs Reactive Model
In the Reactive World
So, How do I Kamon?
Kamon 1.0.0-RC4 is out!
Add Dependencies
// build.sbt
resolvers += Resolver.bintrayRepo("kamon-io", "snapshots")
libraryDependencies ++= Seq(
"io.kamon" %% "kamon-core" % "1.0.0-RC4",
"io.kamon" %% "kamon-akka-2.4" % "1.0.0-RC4",
"io.kamon" %% "kamon-prometheus" % "1.0.0-RC4",
"io.kamon" %% "kamon-zipkin" % "1.0.0-RC4",
"io.kamon" %% "kamon-jaeger" % "1.0.0-RC4"
)
Add Configuration
// application.conf
kamon {
environment {
service = "kamon-showcase"
}
util.filters {
"akka.tracked-actor" {
includes = ["application/user/slow*"]
}
"akka.tracked-dispatcher" {
includes = ["**"]
}
"akka.traced-actor" {
includes = ["**"]
}
}
}
Start the Reporters
Kamon.addReporter(new PrometheusReporter())
Kamon.addReporter(new ZipkinReporter())
// OR
Kamon.loadReportersFromConfig()
SBT Plugin
// project/plugins.sbt
resolvers += Resolver.bintrayIvyRepo("kamon-io", "sbt-plugins")
addSbtPlugin("io.kamon" % "sbt-aspectj-play-runner" % "1.0.1")
Optional Step
Configure Logback
<configuration>
<conversionRule conversionWord="traceID" converterClass="kamon.logback.LogbackTraceIDConverter" />
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] [%traceID] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<root level="info">
<appender-ref ref="STDOUT" />
</root>
</configuration>
Optional Step
Configure your Prometheus, Grafana, Zipkin/Jaeger
Look at Prometheus
Look at your logs
[info] [3cace0e8c5a4f13d][akka.actor.default-dispatcher-5] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-5] HomeController - In the Controller
[info] [7665825824d90893][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the Controller
[info] [0fabed64a3a78776][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [26a67a44e184349b][akka.actor.default-dispatcher-3] HomeController - In the future.map
[info] [26a67a44e184349b][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7665825824d90893][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-2] HomeController - In the Controller
[info] [96c77b462855dae1][akka.actor.default-dispatcher-2] c.NonBlockingActor - done at the non-blocking actor.
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-5] HomeController - In the future.map
[info] [7d4544d88b5f90b2][akka.actor.default-dispatcher-12] c.NonBlockingActor - done at the non-blocking actor.
[info] [0dd690ff92e54334][akka.actor.default-dispatcher-6] HomeController - In the Controller
Look at Zipkin
Look at Jaeger
Metrics
Recording Metrics
val processingTime = Kamon.histogram("app.service.processing-time")
processingTime.record(42)
val httpStatusCodes = Kamon.counter("http.response.status")
val serverErrors = httpStatusCodes.refine("code" -> "500")
val clientErrors = httpStatusCodes.refine("code" -> "400")
serverErrors.increment()
clientErrors.increment(100)
// This is the same Histogram, everywhere.
Kamon.histogram("app.service.processing-time").record(42)
Tracing
True Distributed Tracing, finally.
val span = Kamon.buildSpan("my-operation")
.withTag("span.kind", "server")
.start()
// Do your stuff here
span.finish()
// You got traces, you got metrics!
Instrument once, report anywhere.
Creating Reporters
// Create your own reporter by implementing MetricReporter or SpanReporter
sealed trait Reporter {
def start(): Unit
def stop(): Unit
def reconfigure(config: Config): Unit
}
trait MetricReporter extends Reporter {
def reportTickSnapshot(snapshot: TickSnapshot): Unit
}
trait SpanReporter extends Reporter {
def reportSpans(spans: Seq[FinishedSpan]): Unit
}
Available in 1.0.0-RC4
Akka, Akka HTTP, Akka Remote, Scala, Play, JDBC, System Metrics, Zipkin, Jaeger, Executors, Logback, StackDriver Prometheus and Kamino
Coming soon
Documentation, a lot of it.
More updated modules from 0.6.x Series
a small shameless plug...
https://kamino.io
https://kamino.io
Thanks for Coming!
Get more info at http://kamon.io/
https://github.com/kamon-io
@kamonteam
Minor Rants and Advice on Monitoring
By Ivan Topolnjak
Minor Rants and Advice on Monitoring
BeeScala 2017
- 1,264