Connecting the Dots
Petra Bierleutgeb
@pbvie
with Distributed Tracing
Agenda
- Introduction to Distributed Tracing
- What, Why, How and Where
- Distributed Tracing in Scala
- transparent tracing vs. explicit tracing
- opentracing-java
- patterns for better ergonomics
- context propagation
- writing your own instrumentation
- Demo: Tracing a small system
we'll focus on this one
Motivation
- (Distributed) Tracing is a very useful tool
especially (but not only) in micro-service oriented systems - the OpenTracing Java API is not a (very) good fit for Scala
- relies on global state
- tedious to use with async code
- existing tooling often quite heavy-weighted
-
improve usability in Scala by leveraging language features
- explore feasibility of stateless tracing
Introduction to
Distributed Tracing
What, Why, How and Where
What
Distributed Tracing
- tells the story of a workflow that might involve multiple services from start to finish
- describes for a (usually non-trivial) task
- which steps are involved
- how are they connected
- how long did each step take (duration)
Helpful for
- understanding complex systems
- breaking down latency
- root cause analysis
- anomaly detection
- dependency analysis
How
A Trace
- can be thought of as a directed acyclic graph (DAG)
- nodes are spans
- edges are references
handle order request
create order event
publish order
get user
A Trace
- can be thought of as a directed acyclic graph (DAG)
- but is usually pictured as a Gantt chart
handle order request
create order event
publish order
get user
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5590765/gantt_small.png)
Terminology
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5599526/trace_example.png)
trace
span
- Trace: The complete workflow
- Span: a logical step/piece of work
Terminology
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5599526/trace_example.png)
baggage items
tags
-
Tag: key/value pairs, valid for one span
- examples: component, http.path, db.statement
-
Baggage Item: key/value pairs, propagated between spans
- examples: transactionId, userId
see: https://github.com/opentracing/specification/blob/master/semantic_conventions.md
Terminology
-
Logs: key/value pairs, attached to the span at a certain point in time
- used to indicate errors or other events of interest
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5593729/error_log.png)
see: https://github.com/opentracing/specification/blob/master/semantic_conventions.md
error log
References
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5602556/references.png)
- causal relationships between a child span and a parent span
-
childOf
- parent span depends on child span
- parent span may not finish before child span
-
followsFrom
-
parent span doesn't depend on (and wait for) result
of child span - example: publish command that triggers further action
-
parent span doesn't depend on (and wait for) result
followsFrom
childOf
...and where?
Tracing-agnostic applications vs. tracing as part of application code
Network Level
- + no changes to application code
- ~ coarse-grained compared to in-application tracing
- ~ required effort depends a lot on available infrastructure
- - not (very) useful for monoliths
- example: Tracing with Istio/Envoy
Application Level
- + detailed traces
- + works for all applications and infrastructures
- + provide more context
- - increases non-functional complexity*
- - requires support for different frameworks/tools*
- examples
- transparent: Lightbend Telemetry, Kamon
- manual: OpenTracing implementations
The OpenTracing API
...and a look at opentracing-java
OpenTracing API
- an implementation independent API for Distributed Tracing
- official API definitions available in several languages
- example: opentracing-java
- some concrete implementations
- Jaeger
- Zipkin
- DataDog
- some concrete implementations
OpenTracing
- benefits
- program against API instead of concrete implementation
=> no vendor lock-in - growing number of OpenTracing compatible tracers
- program against API instead of concrete implementation
The OpenTracing Java API
- Tracer
- GlobalTracer
- Scope
- Span
- SpanContext
- Carrier (Injector/Extractor)
OpenTracing Java - Example
Span span = tracer.buildSpan("someWork").start();
try (Scope scope = tracer.scopeManager().activate(span, false)) {
// Do things.
} catch(Exception ex) {
Tags.ERROR.set(span, true);
span.log(Map.of(Fields.EVENT, "error", Fields.ERROR_OBJECT, ex));
} finally {
span.finish();
}
Scope
- OpenTracing relies on a ScopeManager to store the currently active span
- + spans don't have to be passed around
- - introduces global state
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/4877644/grumpy.jpg)
Scopes and global state - a critical view
- OpenTracing APIs focus on state and often favor side-effects
- global/invisible state in concurrent programs is hard
- difficult to verify and understand
-
contrary to good programming practice
- immutability
- reduce side-effects
- behavior of functions depends only on input (arguments)
Scope? Who said anything about scope?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5542099/no_scope.png)
Tracing without Scopes
- the OpenTracing specification doesn't mention scopes
-
trace data does not include information about
- scopes
- span activation/deactivation
- can we have stateless tracing?
- exTra (explicit Tracing)
- experimental!
Sidenote: Tracing in FP
- explicit tracing aims to remove global state but there's still side effects all over the place
- fine for many applications but not suitable for pure FP
- projects
Distributed Tracing in Scala
Better ergonimics with
exTra
exTra API (some of it)
-
withSpan/withSpanAsync
execute a block of code with the current span, log errors and finish span on completion -
withChildSpan/withChildSpanAsync
create a new child span for the current span (provided via implicit TracingContext and execute the given block of code with it -
withFollowsFrom/withFollowsFromAsync
same as above but with followsFrom relationship instead of childOf
withSpan
def withSpan[T](span: Span)(op: Span => T): T =
try {
op(span)
} finally {
span.finish()
}
usage
withSpan(tracer.buildSpan("myOp").start()) { span =>
// do something
}
Making the span implicit
def withSpan[T](op: Span => T)(implicit span: Span): T =
try {
op(span)
} finally {
span.finish()
}
usage
implicit val span = ...
// some other code
withSpan { _ =>
// do something
}
Works great for error handling
def withSpan[T](span: Span)(op: Span => T): T =
try {
op(span)
} catch {
case t: Throwable =>
span.setTag(Tags.ERROR.getKey, true)
span.log(
Map(
Fields.EVENT -> "error",
Fields.ERROR_OBJECT -> t
).asJava
)
throw t
} finally {
span.finish()
}
Works great for error handling
implicit val span = ...
withSpan { _ =>
7 / 0
}
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5593729/error_log.png)
span is reported and marked as error
TracingContext
- in addition to the active span we often need access to the tracer
- TracingContext: wraps up tracer and span so they can be passed around more easily
final case class TracingContext(tracer: Tracer, span: Span)
TracingContext - Example
def calculateSum(a: Int, b: Int)(implicit tc: TracingContext): Int = {
val childSpan = tc.tracer.buildSpan("calculateSum").asChildOf(tc.span).start()
withSpan(childSpan) { _ =>
a + b
}
}
TracingContext - Example
def calculateSum(a: Int, b: Int)(implicit tc: TracingContext): Int = {
val childSpan = tc.tracer.buildSpan("calculateSum").asChildOf(tc.span).start()
withSpan(childSpan) { _ =>
a + b
}
}
this is a very common pattern...let's implement a more concise syntax for it
Implementing withChildSpan - 1
def withSpan[T](op: TracingContext => T)(implicit tc: TracingContext): T =
try {
op(tc.span)
} catch {
case t: Throwable =>
tc.span.setTag(Tags.ERROR.getKey, true)
tc.span.log(errorLog(t).asJava)
throw t
} finally {
tc.span.finish()
}
updating withSpan to work with TracingContext
Implementing withChildSpan - 2
def withChildSpan[T](
operationName: String
)(op: TracingContext => T)(implicit tc: TracingContext): T = {
val childSpan = tc.tracer.buildSpan(operationName).asChildOf(tc.span).start()
val childTracingContext = tc.copy(span = childSpan)
withSpan(op)(childTracingContext)
}
now we can implement withChildSpan
- create childSpan
- create new TracingContext with child span
- call previously defined withSpan function
withChildSpan - Usage
withSpan(TracingContext.withNewRootSpan("example")(tracer)) { implicit tc =>
withChildSpan("child-a") { _ =>
Thread.sleep(20)
}
withChildSpan("child-b") { _ =>
Thread.sleep(10)
}
withChildSpan("child-c") { _ =>
Thread.sleep(15)
7 / 0
}
}
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5601855/with-child-span.png)
withChildSpan - async
withSpan(TracingContext.withNewRootSpan("example")(tracer)) { implicit tc =>
withChildSpan("async-child") { _ =>
Future {
blocking(Thread.sleep(3000))
}
}
}
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5601941/with-child-async-wrong.png)
duration of 'async-child' is only 7.14ms
withChildSpan - async
def withSpan[T](op: TracingContext => T)(implicit tc: TracingContext): T =
try {
op(tc.span)
} catch {
case t: Throwable =>
tc.span.setTag(Tags.ERROR.getKey, true)
tc.span.log(errorLog(t).asJava)
throw t
} finally {
tc.span.finish()
}
- if op is returning a Future, this code doesn't work for 'childOf' relationships
withChildSpan - async
def withSpanAsync[T](op: TracingContext => Future[T])
(implicit ec: ExecutionContext, tc: TracingContext): Future[T] =
op(tc)
.recoverWith {
case t: Throwable =>
tc.span.setTag(Tags.ERROR.getKey, true)
tc.span.log(errorLog(t).asJava)
Future.failed(t)
}
.andThen {
case _ => tc.span.finish()
}
- use recover or recoverWith in combination with andThen
- makes sure span is finished after Future is completed
Cross-process Tracing
Transporting span context across service boundaries
Distributed Tracing
Service A
Service B
?
Traces usually involve multiple services
Distributed Tracing
- in order to continue a trace in a different process/service we need to pass over the span context
- propagation format depends on tracer implementation
- => the same OpenTracing implementation needs to be used on both sides
Span Context
- internal properties + user defined baggage items
- example: Jaeger
- key: uber-trace-id
value: 8c8c205d814630e:c94f0c7a2f89f95c:b41032f2168aae28:1
- key: uber-trace-id
trace id
span id
parent id
flags
Injector/Extractor
- used to inject/extract span context
- consists of a format and a carrier
- example:
- format: TEXT_MAP
- carrier: Kafka Headers, HTTP Headers
Implementing a custom extractor/injector
public interface TextMap extends Iterable<Map.Entry<String, String>> {
Iterator<Map.Entry<String,String>> iterator();
void put(String key, String value);
}
Example: Kafka
class KafkaHeaderInjector(private val headers: Headers) extends TextMap {
override def iterator(): util.Iterator[JavaMap.Entry[String, String]] =
throw new UnsupportedOperationException(
"should only be used with Tracer.inject()"
)
override def put(key: String, value: String): Unit =
headers.add(key, value.getBytes)
}
val record = new ProducerRecord(...)
tc.tracer.inject(tc.span.context(),
Format.Builtin.TEXT_MAP,
new KafkaHeaderInjector(record.headers()))
producer.send(record)
Demo System
Glühwein Ltd.
mulled wine
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
user service
order service
preparation service
client
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520353/db.png)
gRPC req/resp
Akka HTTP
Kafka
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
Akka Actors
thirsty customer
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
Code Example
1. Receive and handle
order request
user service
order service
preparation service
client
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520353/db.png)
gRPC req/resp
Akka HTTP
Kafka
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
Akka Actors
thirsty customer
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
2. Verify user exists
user service
order service
preparation service
client
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520353/db.png)
gRPC req/resp
Akka HTTP
Kafka
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
Akka Actors
thirsty customer
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
3. Publish and consume
create-order command
user service
order service
preparation service
client
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520353/db.png)
gRPC req/resp
Akka HTTP
Kafka
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
Akka Actors
thirsty customer
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
4. Handle order
with Akka
user service
order service
preparation service
client
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520353/db.png)
gRPC req/resp
Akka HTTP
Kafka
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
Akka Actors
thirsty customer
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5606108/final.png)
Result
The End
Thank you!
Questions?
Credits
![](https://s3.amazonaws.com/media-p.slid.es/uploads/695388/images/5520082/logo.png)
Icon made by Freepik from flaticon.com
Connecting the Dots with Distributed Tracing
By Petra Bierleutgeb
Connecting the Dots with Distributed Tracing
@Scala eXchange 2018
- 2,352