Pragmatic Tracing

Neven Miculinić

  • Debugging

  • Profiling

  • Resource attribution

Observability toolset

Logs

  • Access logs
  • Event logs
  • Application logs
  • dmesg
  • msg="Goal scored" player="Ivan Perisic" time="2018-07-11" stadium="Luzhniki" game="Croatia-England"

Metrics

 

  • Error rate
  • Latency
  • Throughput
  • CPU utilization
  • Number of goals scored
  • Counters
  • Gauges
  • Histograms
  • Rank estimation

PProf

FlameGraph

Traces

Traces

Trace:

  • Span: Ivan kicking the ball
  • Span: ball travelling
  • Span: Ball entering the goal

Usecases

 

Overall performance overview

Which % I'm doing:

  • CPU work
  • GPU work
  • IO waiting
  • Lock waiting
  • DB work
  • ...

Detecting big slow part

Detecting fanout

GET /users?tag=XXX

SELECT id FROM USERS WHERE tag=XXX
GET /user/id

SELECT * FROM USERS WHERE id=$1

Detecting fanout

Chrome tracing format

Chrome tracing format

[
  {
    "name": "Asub",
    "cat": "PERF",
    "ph": "B",
    "pid": 22630,
    "tid": 22630,
    "ts": 829
  },
  {
    "name": "Asub",
    "cat": "PERF",
    "ph": "E",
    "pid": 22630,
    "tid": 22630,
    "ts": 833
  }
]

Catapult project

Distributed tracing

Backends

  • Jaeger (Uber, now CNCF incubating project)
  • Zipkin (Twitter, now OpenZipkin)
  • Google stackdriver
  • AWS X-Ray
  • ... 

Client-side Vendor neutral APIs

  • Open tracing
  • Open census

Commonalities

  • High github start count
  • Multi-language implementation and support
  • Evolving standards
  • Various db/http/... middleware support
  • Various service mesh support

Differences

OT OC
CNCF Incubating Organization formerly google, now different stakeholders

Differences

OT OC
CNCF Incubating Organization formerly google, now different stakeholders
Trace API only Feature Set
 
Trace & metrics API

Differences

OT OC
CNCF Incubating Organization formerly google, now different stakeholders
Trace API only Feature Set
 
Trace & metrics API
specified in each API call Tracers Set of global traces

Differences

OT OC
CNCF Incubating Organization formerly google, now different stakeholders
Trace API only Feature Set
 
Trace & metrics API
specified in each API call Tracers Set of global traces
Depends on backend Propagation format Specified by OC standard

Differences

OT OC
CNCF Incubating Organization formerly google, now different stakeholders
Trace API only Feature Set
 
Trace & metrics API
specified in each API call Tracers Set of global traces
Depends on backend Propagation format Specified by OC standard
Evolving, some deprecated APIs Overall feeling More polished API, nicer to work with

Open Census example

j, err := jaeger.NewExporter(jaeger.Options{
		Endpoint:    "http://localhost:14268",
		ServiceName: "opencensus-tracing",
})
trace.RegisterExporter(exporter)

Open Census example

ctx, span := trace.StartSpan(ctx, "some/useful/name")
defer span.End()

Open Census example

span.AddAttribute(trace.StringAttribute("key", "value"))

Open Census example

span.Annotate(nil, "some useful annotation")
span.Annotate(
    []trace.Attribute{trace.BoolAttribute("key", true)},
    "some useful log data",
)

Open Census example

client = &http.Client{Transport: &ochttp.Transport{}}
http.ListenAndServe(addr, &ochttp.Handler{Handler: handler})

Open Census example

req = req.WithContext(ctx)
resp, err := client.Do(req)

Open Census example

func HandleXXX (w http.ResponseWriter, req *http.Request) {
    ctx := req.Context() 
    // ...
}

Summary

Usecases

  • Reasoning about overall performance overview
  • Detecting big slow operations
  • Fan out detection

Summary

Tools

  • Chrome tracing format
  • Open tracing
  • Open census
  • Use various http/db/gRPC middlewares injecting opentracing/opencensus spans
  • Service mesh integrations

Summary

Backends

  • Jaeger
  • Zipkin
  • Cloud provided SaaS

Summary

Simple actionable steps

  • Pick tracing backend, use sampling if needed
  • Add opentracing/opencensus middleware:
    • http
    • database
    • gRPC
    • ...
  • Observe the benefits

Further reading 

References

  • This presentation: https://slides.com/nmiculinic/pragmatic-tracing
  • Chrome tracing format specification https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview 
  • https://www.jaegertracing.io/docs/
  • https://github.com/catapult-project/catapult
  • https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e
  • https://opencensus.io/
  • http://opentracing.io/

References

  • GOTO 2016 • What I Wish I Had Known Before Scaling Uber to 1000 Services • Matt Ranney: https://www.youtube.com/watch?v=kb-m2fasdDY
  • "How NOT to Measure Latency" by Gil Tene: https://www.youtube.com/watch?v=lJ8ydIuPFeU
  • So, you want to trace your distributed system? Key design insights from years of practical experience -- http://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf

Q&A