Almost 30 years in the sector
Mostly as Software Engineer
Web - 3D - Middleware - Mobile - Big Data
More recent as Architect
Data - SRE - Infrastructure
Community
Apache Beam contributor
OpenTelemetry Collector contributor
Collibra
Principal Systems Architect
10 years of Linux and o11y
Software engineer
Hosting provider & ISP experience
Community
Self-hosting at deuxfleurs.fr
Garage geo-distributed S3 engine
https://garagehq.deuxfleurs.fr/
Collibra
Staff Production Engineer, SRE
"That Grafana guy"
A data intelligence platform powered by active metadata
AI Governance
Data Catalog
Data Governance
Data Lineage
Data Notebook
Data Privacy
Data Quality & Observability
Protect
Belgian Origin but now a global company
$ uptime
23:13:08 up 3 days, 2:06, 2 users, load average: 0.27, 0.29, 0.33Metrics are all around
RRDtool, released July 1999 (25 years ago)
The beginning of time series
Breaking up collection, storage and display
collectd
statsd
graphite
The graphite web interface, from the graphite kickstart
push
query
gather and broker metrics
store and serve query
Prometheus and the pull model
application
exporter
prometheus
pull
query
expose metrics
store and serve query
alert manager
dashboard tool
Text
Text
api.http.requests.get.200 <value> <epoch_timestamp>Metric protocols over time
api_http_requests method="GET",endpoint="/api",status="200" <value> <epoch_timestamp>api_http_requests_total{method="GET",
endpoint="/api", status="200"} <value>Carbon
InfluxDB line protocol
Prometheus
Exporter
api.http.requests.get.200 <value> <epoch_timestamp>Metric protocols over time
api_http_requests method="GET",endpoint="/api",status="200" <value> <epoch_timestamp>api_http_requests_total{method="GET",
endpoint="/api", status="200"} <value>Carbon
InfluxDB line protocol
Prometheus
Exporter
[278968.646837] systemd-journald[43]: Time jumped backwards, rotating.
Log structure
Plain text
Structured text
Time jumped backwards, rotating.
Exported 182 nodes from 1 roots in 0.038s
172.169.5.255 - - [01/Oct/2024:22:18:13 +0000] "GET / HTTP/1.1" 200 1608 "-" "Mozilla/5.0 zgrab/0.x"
. ____ _ __ _ _
/\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/ ___)| |_)| | | | | || (_| | ) ) ) )
' |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot :: (v2.3.2.RELEASE)
.
.
.
... : Started SpringLoggerApplication in 3.054 seconds (JVM running for 3.726)
... : Starting my application 0ASCII art
JSON logs, the end of the scale
JSON logs
{
"timestamp": "2022-01-18T11:12:13.000Z",
"level": "INFO",
"logger": "c.c.i.a.w.a.UserAuthenticationListener",
"message": "User demo logged in for the 84th time as admin", 🧙
"app_username": "demo",
"session_id": "ea4811878fbd8780367c16fb64ad6658",
"session_timeout_ms": 86400000,
"app_product_permissions": "viewer,editor,admin",
"user_consecutive_login": 84,
"app_action": "LOGIN",
"client_ip": "127.0.0.1"
}
The log format scale
🧙 Human
🤖 Machine
ASCII art
Plain Sentences
Formated, key-values pairs
JSON-format
Binary-encoded (eg. gRPC, WAL)
Log backends
Flat files
Indexed storage
Columnar storage
Syslog...
ELK stack, Graylog
Grafana Loki, Clickhouse
The (stack)traces
The issues
...and 18 more
Traces - requirements
Traces protocols over time
Dapper (Google, 2010)
Zipkin (Twitter, 2012)
Jaeger (Uber, 2015)
OpenTracing (2016)
W3C tracing context (2019)
OpenTelemetry (2019)
https://xkcd.com/927/
OpenCensus (Google, 2018)
2014
Plan !
...and remove what you don't use!
Text
An alert MUST NOT fire unless there is an issue
We do not alert on planned changes
(maintenance, de-provisioning...)
An alert SHOULD indicate an actual problem
70% CPU is NOT an actual problem
(unless it impacts the service)
The rests are not alerts, they are NOTIFICATIONS or REPORTS
Still important, but not worth waking someone up for
What metrics do we need?
What about the time window?
Alert check
Disk 75%
T
Application write to disk
T+10s
Alert check
Disk 75%
T+60s
Disk full, app delete file
T+40s
Service A depends on service B
Service B is down
Who should alert?
Service A depends on service B
Service B is down
Who should alert?
1000 instances of service A depends on service B
Service B is down
Who should alert?
The client cannot reach service A
The client cannot reach service A
A service is returning 99% of 4xx responses
Metrics are always an aggregation. You lose information.
– Me
| COUNT | 0 |
| # | 0 |
| AVG | 0 |
| hist bytes | 32 |
| raw bytes | 0 |
| COUNT | 4.1234 |
| # | 1 |
| AVG | 4.1234 |
| hist bytes | 68 |
| raw bytes | 4 |
4.1234
| COUNT | 7.8176 |
| # | 2 |
| AVG | 3.9088 |
| hist bytes | 68 |
| raw bytes | 8 |
3.6942
| COUNT | 12.9287 |
| # | 3 |
| AVG | 4.3096 |
| hist bytes | 72 |
| raw bytes | 12 |
5.1111
| COUNT | 17.5643 |
| # | 4 |
| AVG | 4.3911 |
| hist bytes | 72 |
| raw bytes | 16 |
4.6356
| COUNT | 21.8207 |
| # | 5 |
| AVG | 4.3641 |
| hist bytes | 72 |
| raw bytes | 20 |
4.2564
| COUNT | 25.8206 |
| # | 6 |
| AVG | 4.3034 |
| hist bytes | 72 |
| raw bytes | 24 |
3.9999
| COUNT | 29.5572 |
| # | 7 |
| AVG | 4.2225 |
| hist bytes | 72 |
| raw bytes | 28 |
3.7366
| COUNT | 34.1006 |
| # | 8 |
| AVG | 4.2626 |
| hist bytes | 72 |
| raw bytes | 32 |
4.5434
| COUNT | 37.0351 |
| # | 9 |
| AVG | 4.1150 |
| hist bytes | 72 |
| raw bytes | 36 |
2.9345
| COUNT | 42.7585 |
| # | 10 |
| AVG | 4.2759 |
| hist bytes | 76 |
| raw bytes | 40 |
5.7234
| COUNT | 42.7585 |
| # | 11 |
| AVG | 5.1153 |
| hist bytes | 96 |
| raw bytes | 44 |
13.5101
| COUNT | 42.7585 |
| # | 12 |
| AVG | 5.8141 |
| hist bytes | 96 |
| raw bytes | 48 |
13.5000
Examplar
| SUM | 153.7686 |
| # | 32 |
| AVG | 4.8053 |
| hist bytes | 96 |
| raw bytes | 128 |
4.2
Tags (InfluxDB), labels (prometheus), attributes (OpenTelemetry)
serie_name [attributes...] value
Tags (InfluxDB), labels (prometheus), attributes (OpenTelemetry)
serie_name [attributes...] value
Attributes have a value space.
http_code :
100, 101, 102, 103, 200, 201, 202, 203, 204, 205, 206, 207, 208, 226, 300, 301, 302, 303, 304, 305, 306, 307, 308, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 421, 422, 423, 424, 425, 426, 427, 428, 429, 431, 451, 500, 501, 502, 503, 504, 505, 506, 507, 508, 510, 511
Max cardinality is 64
Attributes cardinality can combine
request_count [http_code, http_verb] value
http_verb can be HEAD, GET, POST, PUT, PATCH, OPTIONS, DELETE, LINK, UNLINK (9)
Attributes cardinality can combine
request_count [http_code, http_verb] value
http_verb can be HEAD, GET, POST, PUT, PATCH, OPTIONS, DELETE, LINK, UNLINK (9)
Total (theoretical) cardinality is 64 x 9 = 576
Unlikely in practice
Total (likely) cardinality : 5 x 23 = 115 (~25%)
Keep series cardinality below 10k
Monitor your usage
Drop metrics you don't need
Drop metrics you don't use
Configure software for variable cardinality levels
Only export details when you need them
Depending on the framework
Pull Mode - Prometheus
Push Mode - OpenTelemetry
TSDB stores time series datapoints
Any database system can be a TSDB
Some are simply more... efficient than others!
TSDB stores time series datapoints
Any database system can be a TSDB
Some are simply more... efficient than others!
SPECIFICS
TSDB examples (Apache 2.0 or MIT)
| Solution | Query language(s) | Clustering |
|---|---|---|
| Elasticsearch | Lucene, KQL | Yes |
| Prometheus | PromQL | No |
| InfluxDB (3.x) | Flux | Yes (Enterprise) |
| TimescaleDB | SQL | No |
| VictoriaMetrics | PromQL (extended) | Yes* |
| Grafana Mimir | PromQL | Yes |
SQL and Lucene
Good for...
Log-events containing metrics
PromQL
Good for...
Metrics and histograms
*for the vast majority of backend
Application restart
Re-Aggregation - Spatial
Re-Aggregation - Temporal
Dropping Data
Manually laying breadcrumbs
This error happened here
Hiding duration and counts in log lines
JUL, Log4J, SLF4J
Appenders and Formatters : JSON, Console, TCP, ...
OpenTelemetry adapts the existing API's
public enum StandardLevel {
FATAL(100),
ERROR(200),
WARN(300),
INFO(400),
DEBUG(500),
TRACE(600),
ALL(Integer.MAX_VALUE);
}ThreadContext.put("ipAddress", request.getRemoteAddr());
ThreadContext.put("hostName", request.getServerName());
ThreadContext.put("loginId", session.getAttribute("loginId"));
void performWork() {
ThreadContext.push("performWork()");
LOGGER.debug("Performing work");
// Perform the work
ThreadContext.pop();
}
ThreadContext.clear();ThreadContext.put("ipAddress", request.getRemoteAddr());
ThreadContext.put("hostName", request.getServerName());
ThreadContext.put("loginId", session.getAttribute("loginId"));
void performWork() {
ThreadContext.push("performWork()");
LOGGER.debug("Performing work");
// Perform the work
ThreadContext.pop();
}
ThreadContext.clear();Syslog, journald, Filebeats, OpenTelemetry
StdOut/Err, Network, File
TB of logs per month
300 developers
No accountability
No visibility
TB of logs per month
300 developers
No accountability
No visibility
2MB+ log line in the wild
Entire queries, value dump, stacktraces
2MB+ log line in the wild
Entire queries, value dump, stacktraces
traceparent:
00-0af7651916cd43dd8448eb211c80319c-b9c7c989f97918e1-01
version
trace-id
parent-id
flags
tracestate:
sub=urn:usr:me,ip=127.0.0.1
OpenTelemetry implementation Warning: Baggage is not added automatically to your signals
{
"name": "/v1/sys/health",
"context": {
"trace_id": "7bba9f33312b3dbb8b2c2c62bb7abe2d",
"span_id": "086e83747d0e381e",
"flags": "01"
},
"parent_id": "",
"start_time": "2021-10-22 16:04:01.209458162 +0000 UTC",
"end_time": "2021-10-22 16:04:01.209514132 +0000 UTC",
"status_code": "STATUS_CODE_OK",
"status_message": "",
"attributes": {
"net.transport": "IP.TCP",
"net.peer.ip": "172.17.0.1",
"net.peer.port": "51820",
"net.host.ip": "10.177.2.152",
"net.host.port": "26040",
"http.method": "GET",
"http.target": "/v1/sys/health",
"http.server_name": "mortar-gateway",
"http.route": "/v1/sys/health",
"http.user_agent": "Consul Health Check",
"http.scheme": "http",
"http.host": "10.177.2.152:26040",
"http.flavor": "1.1"
},
}{
"name": "/v1/sys/health",
"context": {
"trace_id": "7bba9f33312b3dbb8b2c2c62bb7abe2d",
"span_id": "086e83747d0e381e",
"flags": "01"
},
"parent_id": "",
"start_time": "2021-10-22 16:04:01.209458162 +0000 UTC",
"end_time": "2021-10-22 16:04:01.209514132 +0000 UTC",
"status_code": "STATUS_CODE_OK",
"status_message": "",
"attributes": {
"net.transport": "IP.TCP",
"net.peer.ip": "172.17.0.1",
"net.peer.port": "51820",
"net.host.ip": "10.177.2.152",
"net.host.port": "26040",
"http.method": "GET",
"http.target": "/v1/sys/health",
"http.server_name": "mortar-gateway",
"http.route": "/v1/sys/health",
"http.user_agent": "Consul Health Check",
"http.scheme": "http",
"http.host": "10.177.2.152:26040",
"http.flavor": "1.1"
},
}{
"name": "/v1/sys/health",
"context": {
"trace_id": "7bba9f33312b3dbb8b2c2c62bb7abe2d",
"span_id": "086e83747d0e381e",
"flags": "01"
},
"parent_id": "",
"start_time": "2021-10-22 16:04:01.209458162 +0000 UTC",
"end_time": "2021-10-22 16:04:01.209514132 +0000 UTC",
"status_code": "STATUS_CODE_OK",
"status_message": "",
"attributes": {
"net.transport": "IP.TCP",
"net.peer.ip": "172.17.0.1",
"net.peer.port": "51820",
"net.host.ip": "10.177.2.152",
"net.host.port": "26040",
"http.method": "GET",
"http.target": "/v1/sys/health",
"http.server_name": "mortar-gateway",
"http.route": "/v1/sys/health",
"http.user_agent": "Consul Health Check",
"http.scheme": "http",
"http.host": "10.177.2.152:26040",
"http.flavor": "1.1"
},
}{
"name": "/v1/sys/health",
"context": {
"trace_id": "7bba9f33312b3dbb8b2c2c62bb7abe2d",
"span_id": "086e83747d0e381e",
"flags": "01"
},
"parent_id": "",
"start_time": "2021-10-22 16:04:01.209458162 +0000 UTC",
"end_time": "2021-10-22 16:04:01.209514132 +0000 UTC",
"status_code": "STATUS_CODE_OK",
"status_message": "",
"attributes": {
"net.transport": "IP.TCP",
"net.peer.ip": "172.17.0.1",
"net.peer.port": "51820",
"net.host.ip": "10.177.2.152",
"net.host.port": "26040",
"http.method": "GET",
"http.target": "/v1/sys/health",
"http.server_name": "mortar-gateway",
"http.route": "/v1/sys/health",
"http.user_agent": "Consul Health Check",
"http.scheme": "http",
"http.host": "10.177.2.152:26040",
"http.flavor": "1.1"
},
}{
"name": "/v1/sys/health",
"context": {
"trace_id": "7bba9f33312b3dbb8b2c2c62bb7abe2d",
"span_id": "086e83747d0e381e",
"flags": "01"
},
"parent_id": "",
"start_time": "2021-10-22 16:04:01.209458162 +0000 UTC",
"end_time": "2021-10-22 16:04:01.209514132 +0000 UTC",
"status_code": "STATUS_CODE_OK",
"status_message": "",
"attributes": {
"net.transport": "IP.TCP",
"net.peer.ip": "172.17.0.1",
"net.peer.port": "51820",
"net.host.ip": "10.177.2.152",
"net.host.port": "26040",
"http.method": "GET",
"http.target": "/v1/sys/health",
"http.server_name": "mortar-gateway",
"http.route": "/v1/sys/health",
"http.user_agent": "Consul Health Check",
"http.scheme": "http",
"http.host": "10.177.2.152:26040",
"http.flavor": "1.1"
},
}Text
When the execution flow is broken we can split the traces with a link
Request
Process
Return
Job start
Job end
link: 584ca56
trace_id: 7836bc4
trace_id: 584ca56
link: 7836bc4
async
Head Sampling
% of trace-id
Tail Sampling
based on error state
Automatic Instrumentation
use it when it's available: Java has good support
Manual Instrumentation
automatic doesn't solve all your traces
Logs have a history of not being well-structured
OpenTelemetry Events are Log + Semantic Conventions
Traces are richer than logs, but can be sampled
Don't do it without events governance
Don't stop at operational, start thinking business-events
Make sure you are in control over your data stream, route it to BI systems
Raw Data stored
Cardinality is less of an issue
Handled by logging systems, BigQuery, Snowflake
Use Tracing for your breadcrumbs
Use Events for important operations and business events
Augment with Metrics were you think you need to alert or act upon in real time
Logs should be last resort
OpenTracing
OpenCensus
2019 - Two Open-Source Projects merged
Text
Many companies realized they were doing the same
Still, investigate carefully of companies providing their own SDK version
Avoid companies/projects that say they are OTEL compatible
Kubernetes does better
message Span {
bytes trace_id = 1;
bytes span_id = 2;
string trace_state = 3;
bytes parent_span_id = 4;
fixed32 flags = 16;
string name = 5;
enum SpanKind {
SPAN_KIND_UNSPECIFIED = 0;
SPAN_KIND_INTERNAL = 1;
SPAN_KIND_SERVER = 2;
SPAN_KIND_CLIENT = 3;
SPAN_KIND_PRODUCER = 4;
SPAN_KIND_CONSUMER = 5;
}
SpanKind kind = 6;
fixed64 start_time_unix_nano = 7;
fixed64 end_time_unix_nano = 8;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 9;
}message Span {
bytes trace_id = 1;
bytes span_id = 2;
string trace_state = 3;
bytes parent_span_id = 4;
fixed32 flags = 16;
string name = 5;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 9;
message Event {
fixed64 time_unix_nano = 1;
string name = 2;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 3;
}
repeated Event events = 11;
}message Span {
bytes trace_id = 1;
bytes span_id = 2;
string trace_state = 3;
bytes parent_span_id = 4;
fixed32 flags = 16;
string name = 5;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 9;
message Link {
bytes trace_id = 1;
bytes span_id = 2;
string trace_state = 3;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 4;
fixed32 flags = 6;
}
repeated Link links = 13;
}message Span {
bytes trace_id = 1;
bytes span_id = 2;
string trace_state = 3;
bytes parent_span_id = 4;
fixed32 flags = 16;
string name = 5;
repeated opentelemetry.proto.common.v1.KeyValue attributes = 9;
Status status = 15;
}
message Status {
string message = 2;
enum StatusCode {
STATUS_CODE_UNSET = 0;
STATUS_CODE_OK = 1;
STATUS_CODE_ERROR = 2;
};
StatusCode code = 3;
}message TracesData {
repeated ResourceSpans resource_spans = 1;
}
message ResourceSpans {
opentelemetry.proto.resource.v1.Resource resource = 1;
repeated ScopeSpans scope_spans = 2;
string schema_url = 3;
}
message Resource {
repeated opentelemetry.proto.common.v1.KeyValue attributes = 1;
uint32 dropped_attributes_count = 2;
}
message ScopeSpans {
opentelemetry.proto.common.v1.InstrumentationScope scope = 1;
repeated Span spans = 2;
string schema_url = 3;
}message TracesData {
repeated ResourceSpans resource_spans = 1;
}
message ResourceSpans {
opentelemetry.proto.resource.v1.Resource resource = 1;
repeated ScopeSpans scope_spans = 2;
string schema_url = 3;
}
message ScopeSpans {
opentelemetry.proto.common.v1.InstrumentationScope scope = 1;
repeated Span spans = 2;
string schema_url = 3;
}
message InstrumentationScope {
string name = 1;
string version = 2;
repeated KeyValue attributes = 3;
uint32 dropped_attributes_count = 4;
}Text
Keys and values which describe commonly observed concepts, protocols, and operations
General: General Semantic Conventions.
Cloud Providers: Semantic Conventions for cloud providers libraries.
CloudEvents: Semantic Conventions for the CloudEvents specification.
Database: Semantic Conventions for database operations.
Exceptions: Semantic Conventions for exceptions.
FaaS: Semantic Conventions for Function as a Service (FaaS) operations.
Feature Flags: Semantic Conventions for feature flag evaluations.
Generative AI: Semantic Conventions for generative AI (LLM, etc.) operations.
GraphQL: Semantic Conventions for GraphQL implementations.
HTTP: Semantic Conventions for HTTP client and server operations.
Messaging: Semantic Conventions for messaging operations and systems.
Object Stores: Semantic Conventions for object stores operations.
RPC: Semantic Conventions for RPC client and server operations.
System: System Semantic Conventions.
| Attribute | Type | Description | Examples | Requirement Level | Stability |
|---|---|---|---|---|---|
| container.id | string | Container ID. Usually a UUID, as for example used to identify Docker containers. The UUID might be abbreviated. | a3bf90e006b2 | Recommended | Experimental |
| container.image.id | string | Runtime specific image identifier. Usually a hash algorithm followed by a UUID. [1] | sha256:19c92d0a00d1b66d897bceaa7319bee0dd38a10a851c60bcec9474aa3f01e50f | Recommended | Experimental |
| container.image.name | string | Name of the image the container was built on. | gcr.io/opentelemetry/operator | Recommended | Experimental |
| container.image.tags | string[] | Container image tags. An example can be found in Docker Image Inspect. Should be only the <tag> section of the full name for example from registry.example.com/my-org/my-image:<tag>. | ["v1.27.1", "3.5.7-0"] | Recommended | Experimental |
| container.label.<key> | string | Container labels, <key> being the label name, the value being the label value. | container.label.app=nginx | Recommended | Experimental |
| container.name | string | Container name used by container runtime. | opentelemetry-autoconf | Recommended | Experimental |
| container.runtime | string | The container runtime managing this container. | docker; containerd; rkt | Recommended | Experimental |
| container.command | string | The command used to run the container (i.e. the command name). [4] | otelcontribcol | Opt-In | Experimental |
| Language | Traces | Metrics | Logs |
|---|---|---|---|
| C++ | Stable | Stable | Stable |
| C#/.NET | Stable | Stable | Stable |
| Erlang/Elixir | Stable | Development | Development |
| Go | Stable | Stable | Beta |
| Java | Stable | Stable | Stable |
| JavaScript | Stable | Stable | Development |
| PHP | Stable | Stable | Stable |
| Python | Stable | Stable | Development |
| Ruby | Stable | Development | Development |
| Rust | Beta | Alpha | Alpha |
| Swift | Stable | Development | Development |
activedirectorydsreceiver
aerospikereceiver
apachereceiver
apachesparkreceiver
awscloudwatchmetricsreceiver
awsfirehosereceiver
awsxrayreceiver
azuremonitorreceiver
carbonreceiver
datadogreceiver
dockerstatsreceiver
elasticsearchreceiver
filelogreceiver
filestatsreceiver
fluentforwardreceiver
githubreceiver
googlecloudmonitoringreceiver
googlecloudpubsubreceiver
googlecloudspannerreceiver
mysqlreceiver nginxreceiver osqueryreceiver otelarrowreceiver otlpjsonfilereceiver podmanreceiver postgresqlreceiver prometheusreceiver prometheusremotewritereceiver pulsarreceiverrabbitmqreceiver receivercreator redisreceiver snmpreceiver statsdreceiver syslogreceiver tcplogreceiver webhookeventreceiver zipkinreceiver zookeeperreceiver
haproxyreceiver
hostmetricsreceiver
httpcheckreceiver
iisreceiver
influxdbreceiver
jaegerreceiver
jmxreceiver
journaldreceiver
k8sclusterreceiver
k8seventsreceiver
k8sobjectsreceiver
kafkametricsreceiver
kafkareceiver
kubeletstatsreceiver
lokireceiver
memcachedreceiver
mongodbatlasreceiver
mongodbreceiver
mysqlreceiver
attributesprocessor
coralogixprocessor
cumulativetodeltaprocessor
deltatocumulativeprocessor
deltatorateprocessor
filterprocessor
geoipprocessor
groupbyattrsprocessor
groupbytraceprocessor
intervalprocessor
k8sattributesprocessor
logdedupprocessor
logstransformprocessor
metricsgenerationprocessor
metricstransformprocessor
probabilisticsamplerprocessor
redactionprocessor
remotetapprocessor
resourcedetectionprocessor
resourceprocessor routingprocessor schemaprocessor spanprocessor sumologicprocessor tailsamplingprocessor transformprocessor
batchprocessor
memorylimiterprocessor
alertmanagerexporter alibabacloudlogserviceexporter awscloudwatchlogsexporter awsemfexporter awskinesisexporter awss3exporter awsxrayexporter azuredataexplorerexporter azuremonitorexporter carbonexporter cassandraexporter clickhouseexporter coralogixexporter datadogexporter datasetexporter dorisexporter elasticsearchexporter fileexporter googlecloudexporter
sentryexporter
signalfxexporter
splunkhecexporter
sumologicexporter
syslogexporter
tencentcloudlogserviceexporter
zipkinexportergooglecloudpubsubexporter googlemanagedprometheusexporter honeycombmarkerexporter influxdbexporter kafkaexporter kineticaexporter loadbalancingexporter logicmonitorexporter logzioexporter lokiexporter mezmoexporter opencensusexporter opensearchexporter otelarrowexporter prometheusexporter prometheusremotewriteexporter pulsarexporter rabbitmqexporter sapmexporter
Manipulate data:
Make your own!
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
x~5000
x~50
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
x~5000
x~50
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
x~5000
x~50
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
x~5000
x~50
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
x~5000
x~50
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
R
P
E
x~5000
x~50
OpenTelemetry is the future of instrumentation and collection
The future of transport and pipelining
It doesn't focus on querying, storing, dash-boarding:
it leaves that to vendors or other projects
A walk through
Proprietary SDK's
Spring's favorite - Micrometer
Native OpenTelemetry
OpenTelemetry Java Agent
Manual Java Setup
Spring Boot OpenTelemetry Starter
receivers:
processors:
exporters:
service:
telemetry:
metrics:
address: "0.0.0.0:10000"
logs:
level: info
encoding: jsonreceivers:
processors:
exporters:
service:
telemetry:
metrics:
address: "0.0.0.0:10000"
logs:
level: info
encoding: jsonreceivers:
otlp:
protocols:
grpc:
http:
processors:
exporters:
service:
telemetry:
metrics:
address: "0.0.0.0:10000"
logs:
level: info
encoding: jsonreceivers:
otlp:
protocols:
grpc:
http:
processors:
exporters:
googlecloud:
project: treactor
log:
default_log_name: opentelemetry.io/collector-exported-log
googlemanagedprometheus:
project: treactor
metric:
resource_filters:
- prefix: cloud
- prefix: host
extra_metrics_config:
enable_target_info: false
enable_scope_info: false
service:receivers:
otlp:
protocols:
grpc:
http:
processors:
exporters:
googlecloud:
project: treactor
log:
default_log_name: opentelemetry.io/collector-exported-log
googlemanagedprometheus:
project: treactor
metric:
resource_filters:
- prefix: cloud
- prefix: host
extra_metrics_config:
enable_target_info: false
enable_scope_info: false
debug:
verbosity: detailed
service:receivers:
otlp:
processors:
exporters:
googlecloud:
googlemanagedprometheus:
debug:
service:
pipelines:
metrics:
receivers: [ otlp ]
exporters: [ googlemanagedprometheus, debug ]
logs:
receivers: [ otlp ]
exporters: [ googlecloud, debug ]
traces:
receivers: [ otlp ]
exporters: [ googlecloud, debug ]receivers:
otlp:
processors:
batch:
# recommended value from docs: https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-otel
send_batch_size: 200
send_batch_max_size: 200
timeout: 5s
exporters:
googlecloud:
googlemanagedprometheus:
debug:
service:
pipelines:
metrics:
receivers: [ otlp ]
exporters: [ googlemanagedprometheus, debug ]
logs:
receivers: [ otlp ]
exporters: [ googlecloud, debug ]
traces:
receivers: [ otlp ]
exporters: [ googlecloud, debug ]receivers:
otlp:
processors:
batch:
transform/reserved_attributes:
- context: resource
statements:
- delete_key(attributes, "process.command_args")
- delete_key(attributes, "process.executable.path")
- delete_key(attributes, "process.runtime.description")
exporters:
googlecloud:
googlemanagedprometheus:
debug:
service:
pipelines:
metrics:
receivers: [ otlp ]
exporters: [ googlemanagedprometheus, debug ]
logs:
receivers: [ otlp ]
exporters: [ googlecloud, debug ]
traces:
receivers: [ otlp ]
exporters: [ googlecloud, debug ]receivers:
otlp:
processors:
batch:
transform/reserved_attributes:
exporters:
googlecloud:
googlemanagedprometheus:
debug:
service:
pipelines:
metrics:
receivers: [ otlp ]
processors: [ batch, transform/reserved_attributes ]
exporters: [ googlemanagedprometheus, debug ]
logs:
receivers: [ otlp ]
processors: [ batch, transform/reserved_attributes ]
exporters: [ googlecloud, debug ]
traces:
receivers: [ otlp ]
processors: [ batch, transform/reserved_attributes ]
exporters: [ googlecloud, debug ]dependencies {
implementation "io.opentelemetry:opentelemetry-api:$otelapi"
implementation "io.opentelemetry:opentelemetry-sdk:$otelsdk"
implementation "io.opentelemetry:opentelemetry-exporter-otlp:$otelsdk"
agent "io.opentelemetry.javaagent:opentelemetry-javaagent:$otelagent"
implementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:$otelagent")
implementation group: 'org.apache.logging.log4j', name: 'log4j-api', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-core', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-slf4j-impl', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-jul', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-web', version: '2.24.1'
implementation('org.springframework.boot:spring-boot-starter-web')
implementation('org.springframework.boot:spring-boot-starter-thymeleaf')
}
dependencies {
implementation "io.opentelemetry:opentelemetry-api:$otelapi"
implementation "io.opentelemetry:opentelemetry-sdk:$otelsdk"
implementation "io.opentelemetry:opentelemetry-exporter-otlp:$otelsdk"
agent "io.opentelemetry.javaagent:opentelemetry-javaagent:$otelagent"
implementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:$otelagent")
implementation group: 'org.apache.logging.log4j', name: 'log4j-api', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-core', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-slf4j-impl', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-jul', version: '2.24.1'
implementation group: 'org.apache.logging.log4j', name: 'log4j-web', version: '2.24.1'
implementation('org.springframework.boot:spring-boot-starter-web')
implementation('org.springframework.boot:spring-boot-starter-thymeleaf')
}
dependencies {
implementation "io.opentelemetry:opentelemetry-api:$otelapi"
implementation "io.opentelemetry:opentelemetry-sdk:$otelsdk"
implementation "io.opentelemetry:opentelemetry-exporter-otlp:$otelsdk"
agent "io.opentelemetry.javaagent:opentelemetry-javaagent:$otelagent"
implementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:$otelagent")
}
task copyAgentJar(type: Copy) {
from configurations.agent
into "src/main/jib/app"
rename { fileName -> "opentelemetry-javaagent.jar" }
}dependencies {
agent "io.opentelemetry.javaagent:opentelemetry-javaagent:$otelagent"
implementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:$otelagent")
}
task copyAgentJar(type: Copy) {}
jib {
to {
image = 'gcr.io/treactor/treactor-java'
credHelper = 'osxkeychain'
tags = ['0.1.5']
}
container {
jvmFlags = ['-javaagent:/app/opentelemetry-javaagent.jar',
'-Xms512m',
'-Xdebug']
mainClass = 'io.treactor.springboot.Application'
ports = ['3330']
format = 'OCI'
}
}
tasks.jib.dependsOn(copyAgentJar)
@Component
public class DevoxxTask {
public DevoxxTask() {
TracerProvider tracerProvider = GlobalOpenTelemetry.getTracerProvider();
tracer = tracerProvider.get("treactor.devoxx", "0.1");
MeterProvider meterProvider = GlobalOpenTelemetry.getMeterProvider();
Meter meter = meterProvider.get(INSTRUMENTATION_SCOPE_NAME);
histogram = meter.histogramBuilder("devoxx.tasks.duration").build();
}
private static final Logger log = LoggerFactory.getLogger(DevoxxTask.class);
}@Component
public class DevoxxTask {
@Component
public class DevoxxTask {
@WithSpan("addTask")
public void addTask() {
queue.add(new Task(Span.current()));
}
}
}@Component
public class DevoxxTask {
@Scheduled(fixedRate = 20000)
@WithSpan("handleTasks")
public void handleTasks() throws InterruptedException {
Span.current().setAttribute("foo", "bar");
while (true) {
Task task = queue.poll();
if (task == null) break;
Span span = tracer.spanBuilder("process")
.startSpan()
.addLink(task.parent);
try {
int sleep = random.nextInt(250);
Thread.sleep(sleep);
LOGGER.info("The time is now {}", dateFormat.format(new Date()));
} finally {
span.end();
}
}
}
}
Consider dashboards and alerts when creating application metrics
Be flexible in what you produce, conservative in what you record
Consider the best tool for the job
Do OpenTelemetry