Pragmatic Tracing
Neven Miculinić
-
Debugging
-
Profiling
-
Resource attribution
Observability toolset
Logs
- Access logs
- Event logs
- Application logs
- dmesg
- msg="Goal scored" player="Ivan Perisic" time="2018-07-11" stadium="Luzhniki" game="Croatia-England"
Metrics
- Error rate
- Latency
- Throughput
- CPU utilization
- Number of goals scored
- Counters
- Gauges
- Histograms
- Rank estimation
PProf
FlameGraph
Traces
Traces
Trace:
- Span: Ivan kicking the ball
- Span: ball travelling
- Span: Ball entering the goal
Usecases
Overall performance overview
Which % I'm doing:
- CPU work
- GPU work
- IO waiting
- Lock waiting
- DB work
- ...
Detecting big slow part
Detecting fanout
GET /users?tag=XXX
SELECT id FROM USERS WHERE tag=XXX
GET /user/id
SELECT * FROM USERS WHERE id=$1
Detecting fanout
Chrome tracing format
Chrome tracing format
[
{
"name": "Asub",
"cat": "PERF",
"ph": "B",
"pid": 22630,
"tid": 22630,
"ts": 829
},
{
"name": "Asub",
"cat": "PERF",
"ph": "E",
"pid": 22630,
"tid": 22630,
"ts": 833
}
]
Catapult project
Distributed tracing
Backends
- Jaeger (Uber, now CNCF incubating project)
- Zipkin (Twitter, now OpenZipkin)
- Google stackdriver
- AWS X-Ray
- ...
Client-side Vendor neutral APIs
- Open tracing
- Open census
Commonalities
- High github start count
- Multi-language implementation and support
- Evolving standards
- Various db/http/... middleware support
- Various service mesh support
Differences
OT | OC | |
---|---|---|
CNCF Incubating | Organization | formerly google, now different stakeholders |
Differences
OT | OC | |
---|---|---|
CNCF Incubating | Organization | formerly google, now different stakeholders |
Trace API only | Feature Set |
Trace & metrics API |
Differences
OT | OC | |
---|---|---|
CNCF Incubating | Organization | formerly google, now different stakeholders |
Trace API only | Feature Set |
Trace & metrics API |
specified in each API call | Tracers | Set of global traces |
Differences
OT | OC | |
---|---|---|
CNCF Incubating | Organization | formerly google, now different stakeholders |
Trace API only | Feature Set |
Trace & metrics API |
specified in each API call | Tracers | Set of global traces |
Depends on backend | Propagation format | Specified by OC standard |
Differences
OT | OC | |
---|---|---|
CNCF Incubating | Organization | formerly google, now different stakeholders |
Trace API only | Feature Set |
Trace & metrics API |
specified in each API call | Tracers | Set of global traces |
Depends on backend | Propagation format | Specified by OC standard |
Evolving, some deprecated APIs | Overall feeling | More polished API, nicer to work with |
Open Census example
j, err := jaeger.NewExporter(jaeger.Options{
Endpoint: "http://localhost:14268",
ServiceName: "opencensus-tracing",
})
trace.RegisterExporter(exporter)
Open Census example
ctx, span := trace.StartSpan(ctx, "some/useful/name")
defer span.End()
Open Census example
span.AddAttribute(trace.StringAttribute("key", "value"))
Open Census example
span.Annotate(nil, "some useful annotation")
span.Annotate(
[]trace.Attribute{trace.BoolAttribute("key", true)},
"some useful log data",
)
Open Census example
client = &http.Client{Transport: &ochttp.Transport{}}
http.ListenAndServe(addr, &ochttp.Handler{Handler: handler})
Open Census example
req = req.WithContext(ctx)
resp, err := client.Do(req)
Open Census example
func HandleXXX (w http.ResponseWriter, req *http.Request) {
ctx := req.Context()
// ...
}
Summary
Usecases
- Reasoning about overall performance overview
- Detecting big slow operations
- Fan out detection
Summary
Tools
- Chrome tracing format
- Open tracing
- Open census
- Use various http/db/gRPC middlewares injecting opentracing/opencensus spans
- Service mesh integrations
Summary
Backends
- Jaeger
- Zipkin
- Cloud provided SaaS
Summary
Simple actionable steps
- Pick tracing backend, use sampling if needed
- Add opentracing/opencensus middleware:
- http
- database
- gRPC
- ...
- Observe the benefits
Further reading
References
- This presentation: https://slides.com/nmiculinic/pragmatic-tracing
- Chrome tracing format specification https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview
- https://www.jaegertracing.io/docs/
- https://github.com/catapult-project/catapult
- https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e
- https://opencensus.io/
- http://opentracing.io/
References
- GOTO 2016 • What I Wish I Had Known Before Scaling Uber to 1000 Services • Matt Ranney: https://www.youtube.com/watch?v=kb-m2fasdDY
- "How NOT to Measure Latency" by Gil Tene: https://www.youtube.com/watch?v=lJ8ydIuPFeU
- So, you want to trace your distributed system? Key design insights from years of practical experience -- http://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf
Q&A
Pragmatic Tracing
By Neven Miculinić
Pragmatic Tracing
- 318