Prometheus Instrumentation
13 Sept. 2018
What Is Prometheus?
Metrics-Based Monitoring
- Unified System for Metrics and Monitoring
- Pull-Based (Scrapes Targets)
- Integrates Seamlessly with Grafana for Graphs
- Includes a Powerful Expression Language
- Supports Multidimensional Metrics
- Alerts Based on Metrics
- Instrumentation Libraries for White Box Monitoring
Metrics Scraping
- Metrics Are Scraped Over HTTP
- Uses Service Discovery to Find Targets
- Simple Text-Based Format
chaumes@prometheus-901:~$ curl -sS localhost:9100/metrics | grep node_filesystem_avail
# HELP node_filesystem_avail Filesystem space available to non-root users in bytes.
# TYPE node_filesystem_avail gauge
node_filesystem_avail{device="/dev/sda1",fstype="ext4",mountpoint="/"} 2.3620919296e+10
node_filesystem_avail{device="none",fstype="tmpfs",mountpoint="/run/lock"} 5.24288e+06
node_filesystem_avail{device="none",fstype="tmpfs",mountpoint="/run/shm"} 5.20429568e+08
node_filesystem_avail{device="none",fstype="tmpfs",mountpoint="/run/user"} 1.048576e+08
node_filesystem_avail{device="rpc_pipefs",fstype="rpc_pipefs",mountpoint="/run/rpc_pipefs"} 0
node_filesystem_avail{device="srv_salt",fstype="vboxsf",mountpoint="/srv/salt"} 4.1484754944e+11
node_filesystem_avail{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 1.03616512e+08
node_filesystem_avail{device="vagrant",fstype="vboxsf",mountpoint="/vagrant"} 4.1484754944e+11
Instrumentation Libraries
- Python
- Ruby
- Java/Scala
- Go
- Bash (unofficial/3rd-party)
- Many Others (unofficial/3rd-party)
Metric Types
Counter
- Represents a Cumulative Numerical Value
- Monotonically Increases
- e.g. the value can never go down or reset
- Useful for
- number of requests served
- tasks completed
- number of errors
Gauge
- Represents a Single Numerical Value
- Can Increase or Decrease Arbitrarily
- Useful for
- memory or CPU cycles used
- number of threads or processes
- number of tasks (e.g. in a queue)
- number of objects (e.g. in a database)
Histogram
- Samples Observations in Configurable Buckets
- Cumulative Across Buckets
- Exposes Multiple Time Series
- cumulative counters for the observation buckets
- total sum of all observed values
- count of events observed
- Useful for
- Measuring Latencies/Response Times by Quantile
- Approximating Apdex Scores
Summary
- Similar to a Histogram
- Calculates Configurable Quantiles Over a Sliding Time Window
- Cannot Be Aggregated (e.g. among multiple instances)
- Exposes Multiple Time Series
- streaming quantiles of observed events
- total sum of all observed values
- count of observed events
- Useful for
- similar metrics as histograms
Histogram or Summary?
- It's Complicated!
- Read Docs and Seek Guidance
- Guidelines Distilled
- If you need to aggregate, use Histogram
- If you have an idea of the range and distribution of values that will be observed, use Histogram
- If you need an accurate quantile, regardless of the range and distribution of values, use Summary
Service Types
Online
- Human or System Expects an Immediate Response
- White Box Instrumentation Helps Diagnose Where a Problem Lies
- Key Metrics
- number of performed queries (counter)
- number of errors/exceptions (counter)
- latency (histogram or summary)
- Pro Tip: Count Queries When They *END*
Offline
- Continually Running, but Nothing Awaits Response
- Key Metrics
- Items In (counter)
- Items in Progress (gauge)
- Items Out (counter)
- Items Sent (gauge)
- Pro Tip: Use a Heartbeat to Expose Processing Time
Batch
- Like an Offline Service, but Not Continually Running
- Cannot Be Scraped (Must Use Push Gateway)
- Key Metrics
- UNIX Timestamp of Last Successful Run (gauge)
- UNIX Timestamp of Last Failed Run (gauge)
- Duration of Each Processing Stage (gauge)
- Overall Runtime (gauge)
- Number of Records Processed (counter)
- Number of Records Failed (counter)
Examples
Best Practices
Metric Names and Labels
- Use consistent names
- Group metrics together using labels
- But don't go too crazy with labels
- https://prometheus.io/docs/practices/naming/
General Instrumentation
- Watch out for inner loops (can generate LOTS of metrics)
- Use UNIX timetamps instead of measuring time since an event occurred
- Many more tips in the docs!
- https://prometheus.io/docs/practices/instrumentation/
Prometheus Instrumentation
By wryfi
Prometheus Instrumentation
- 324