Prometheus

- How we ditched our legacy monitoring -

@DambrineF

Florian Dambrine - Senior DevOps Engineer - @ GumGum

> Whoami

Florian Dambrine
DevOps Engineer @ GumGum
Joined GumGum 5+ years ago
The one who set up a legacy monitoring...

> Agenda

Monitoring the modern world
Legacy monitoring system matrix
Prometheus
- Facts & Infrastructure overview
- Dynamic registration and service discovery
- Alerting with Prometheus
- Dashboarding Prometheus metrics

> Monitoring the modern world

# Static host monitoring

# autoscaled host monitoring

......

# Container monitoring

# External services monitoring

> Legacy Monitoring System Matrix

# Icinga2

> Legacy Monitoring System Matrix

# Nagios NRPE

# NOTE:
# The following commands might be overriden in conf.d/*.cfg files
#
command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10
command[check_root_disk]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /
command[check_mnt_disk]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -R '^/mnt'
command[check_zombie_procs]=/usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z

# Defines total_procs and load values based on the instance type:
#---- Load ----|------------- Tot. Procs ------------
# Crit    Warn | Crit  Warn
#   +2   1 + 4 |  +20   150 = 30 *  1 + 120 =  1 core
#   +2   2 + 4 |  +20   150 = 30 *  2 + 90  =  2 cores
#   +2   4 + 4 |  +20   210 = 30 *  4 + 90  =  4 cores
#   +2   8 + 4 |  +20   240 = 30 *  8       =  8 cores
#   +2  16 + 4 |  +20   480 = 30 * 16       = 16 cores
#   +2  32 + 4 |  +20   960 = 30 * 32       = 32 cores
#   +2  36 + 4 |  +20  1080 = 30 * 36       = 32 cores

# 2 cores instance
command[check_total_procs]=/usr/lib/nagios/plugins/check_procs -w 300 -c 400
command[check_load]=/usr/lib/nagios/plugins/check_load -w 10,8,6 -c 14,12,10

> Legacy Monitoring System Matrix

# Ganglia / Gmond

> Legacy Monitoring System Matrix

# Monitoring

No data collection at the container level
No custom app metrics collection

> Legacy Monitoring System Matrix

# Data collection / aggregation

Blind in a container world !
Lack of control on group of hosts

> Legacy Monitoring System Matrix

# data visualisation

Lack of readability
Lack of freedom
Not a pleasure to browse

> Legacy Monitoring System Matrix

# Alerting

Alert/Email fatigue !
Cascade emails
No alerting at the cluster level

> Legacy Monitoring System Matrix

# Heatmap

Icinga 2

Nagios

Ganglia

External Services / APIs

Containers

Static Hosts

Autoscaled Hosts

Clusters / Group of hosts

Targets / Tooling

> prometheus - Facts

Open source project (CNCF Graduated)
First commit 7 years ago
Time series collection happens via a pull model over HTTP (but not only)
Autonomous single node servers
Targets are discovered via service discovery or static configuration
PromQL metrics query language

> prometheus - infrastructure Walkthrough

> prometheus - Dynamic discovery

### Global configs
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

### Alertmanager configuration
alerting:
  alertmanagers:
    - scheme: http
      consul_sd_configs:
        - server: 169.254.1.1:8500
          datacenter: us-east-1
          services: [ 'alertmanager' ]

rule_files:
  - 'rules/*'

### Target Scrape Configs
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: prometheus
    consul_sd_configs:
      - server: 169.254.1.1:8500
        datacenter: us-east-1
        services: [ 'prometheus' ]
    relabel_configs:
      - source_labels: [ __meta_consul_tags ]
        regex: ',(?:[^,]+,){0}([^=]+)=([^,]+),.*'
        replacement: '${2}'
        target_label: '${1}'
      - ...
      - ...

$ vim prometheus.yml

> prometheus - Alerting with Prometheus 1/2

groups:
  - name: verity
    rules:
    - alert: ProdVerityHigh5XX
      expr: sum(
        increase(
            traefik_backend_requests_total{cluster="va-verity-ecs--prod", 
                                           backend=~"backend-verity-api__.*",
                                           code=~"5.*"}[1m])) 
        > 20
      for: 1m
      labels:
        severity: critical
        service: verity-api-prod
      annotations:
        summary: "High 5XX for {{ $labels.service }} for more than 1 minute."

$ vim rules/app.yml

# Alertmanager

https://github.com/prometheus/alertmanager

Alertmanager

> prometheus - Alerting with Prometheus 2/2

# The root route on which 
# each incoming alert enters.
route:
  group_by: ['alertname', 'service', 'backend']
  # Wait before sending first alert.
  group_wait: 30s
  # When first alert sent, wait to send a batch
  # (FOR DIFFERENT ALERT BUT SAME GROUP)
  group_interval: 15m
  # If alert sent, wait before resending
  repeat_interval: 30m

  # Child route trees.
  - match:
      service: verity
      receiver: verity

receivers:
  - name: 'ops'
    # SLACK INTEGRATION
    slack_configs:
      - send_resolved: true
        api_url: https://hooks.slack.com/services/...
        chanel: '@CHANEL_NAME'
  
    # PAGER DUTY INTEGRATION 
    # (Escalation policy service key):
    pagerduty_configs:
      - send_resolved: true
        service_key: ...

  - name: 'verity'
    # EMAIL INTEGRATION
    email_configs:
      - send_resolved: true
        to: ops@gumgum.com
        from: noreply@gumgum.com

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    # Apply inhibition if...
    equal: ['alertname', 
            'service', 
            'backend']

$ vim alertmanager.yml

> prometheus - Dashboarding metrics

# Grafana

https://github.com/grafana/grafana

WLAD Tip #0003

$ wget \
    -qO wlad-0003.pdf \
    https://bit.ly/2wP3GeZ \
    && open wlad-0003.pdf

Pssssst: It's about USE & RED method to build relevant Grafana Dashboards

USE: Utilization Saturation and Errors
RED: Rate Errors and Duration

https://www.weave.works/docs/cloud/latest/tasks/monitor/best-instrumenting/

> Thanks !

🚀 We are hiring !

https://slides.com/floriandambrine/prometheus-ditch-your-legacy-moniotring/

Prometheus - How we ditched our legacy monitoring systems

By Florian Dambrine

Prometheus - How we ditched our legacy monitoring systems

Learn how we ditched our legacy monitoring systems at @GumGum and how we transitioned to a modern / flexible monitoring solution based on Prometheus eco-system. We will spend time going over today's monitoring challenges and see how a Prometheus / Consul / Grafana eco-system can help you monitor a modern stack.

1,774

Florian Dambrine

I am a Freelance DevOps Engineer graduated from UTC (University of Technology of Compiègne) in 2014. I am a DevOps enthusiast embracing Cloud computing technologies to build automated infrastructure at large scale.

Prometheus

- How we ditched our legacy monitoring -

> Whoami

> Agenda

> Monitoring the modern world

# Static host monitoring

# autoscaled host monitoring

# Container monitoring

# External services monitoring

> Legacy Monitoring System Matrix

# Icinga2

> Legacy Monitoring System Matrix

# Nagios NRPE

> Legacy Monitoring System Matrix

# Ganglia / Gmond

> Legacy Monitoring System Matrix

# Monitoring

> Legacy Monitoring System Matrix

# Data collection / aggregation

> Legacy Monitoring System Matrix

# data visualisation

> Legacy Monitoring System Matrix

# Alerting

> Legacy Monitoring System Matrix

# Heatmap

Targets / Tooling

> prometheus - Facts

> prometheus - infrastructure Walkthrough

> prometheus - Dynamic discovery

> prometheus - Alerting with Prometheus 1/2

# Alertmanager

> prometheus - Alerting with Prometheus 2/2

> prometheus - Dashboarding metrics

# Grafana

WLAD Tip #0003

> Thanks !

🚀 We are hiring !

Prometheus - How we ditched our legacy monitoring systems

More from Florian Dambrine