A story of Ops recovery

or

The 5 steps of Ops grief

Avishai Ish-Shalom (@nukemberg)

A long long time ago,

In a country far far away...

We inherited a system,

And it was a mess

  • > 500 servers
  • In a crappy DC
  • "Managed" by puppet, salt, glu, shell & ruby scripts
  • + bunch of Windows servers
  • 3 Hadoop clusters
  • Rundeck, Docker, HBase, CouchBase, MySQL, Memcached, Redis, Java, Node.js, .NET...
    If you read a blog about it, we had it

 

And the guy that built the system had just quit....

(Denial)

We took a look at the Puppet code

  • ~40k LoC
  • No tests
  • No community modules
  • Spaghetti code
  • No Hiera

 

(Severe case of NIH syndrome)

(Denial)

We went to work

Act I

"Puppet is crap, let's use Chef!"

(Anger)

When in doubt, go with the flow

  • People revert to familiar things
  • "Blame it on the tool"

 

But...

  • Incumbents have an advantage
  • Tools are rarely the problem

Consolidate

  • Ditched Chef, Glu, Salt
  • Introduced Fabric
  • Puppet for all node management

Act II

"Maybe it's not so bad"

(Bargaining)

How many servers do we have?

  • DC lease bill
  • Zabbix
  • Puppet
  • Active Directory

Numbers didn't add up

We created an inventory

ELK based inventory

  • No PuppetDB
  • We had ELK
  • Lucene queries
  • Aggregations

 

ElasticSearch index template

facter --json | curl -XPUT -d @- http://elasticsearch:9200/inventory/host/`hostname -f`

Updated by Puppet

case $::kernel {
  'Linux': {
    ensure_resource(package, curl, {ensure => present})

    exec{"inventory-facter":
      command => "facter --json -p | curl -XPUT ${es_uri} -d @- || true",
      provider => shell,
      schedule => 'twice-a-day'
    }
  }
  'windows': {
    exec{'inventory-facter':
      command => template("sg_base/inventory-command.ps1.erb"),
      provider => powershell,
      schedule => 'twice-a-day'
    }
  }
  default: {
    warn("Facter inventory class does not support $kernel")
  }
}

Why not write a facts terminus?

  • Simpler
  • Read only
  • PuppetDB co-existence
  • First version was based on Ohai

Act III

No alerts doesn't mean things are working 

(Depression)

Puppet reports dashboard

  • PE - too expensive, takes time
  • PD - deprecated
  • Puppet Explorer - nice, but simple
  • PuppetDB required
  • No ad-hoc aggregations, not flexible

Puppet ES reporter

  • Puppet module in GitHub
  • ES index template
  • Time based index
  • Supports master/agent/standalone

ELK Dashboard

ELK Dashboard

  • Send reports to ES
  • Use Kibana 4 dashboard
  • ES API

Act IV

Weaponize your data

(Acceptance)

Compare data-sources

  • DC billing statement
  • Inventory
  • Puppet reports
  • Zabbix
  • Switch configurations 

Find problems

  • ES aggregations on reports
  • Chronically restarted resources
  • Failed runs
  • Slow runs (relative to peers)
  • Aggregate by O/S, DC, Role

Opsole - operator context

motd - operator context

Act V

Taming a wild Puppet

(Acceptance)

Roles

  • Not a puppet-native concept
  • Central module (dependencies!!)
  • Puppet based node classifier
  • Everything else - hiera
  • Simple logic
node default {
 include sg_base

 if $::role != undef {
   include "roles::${::role}"
 }
}

Alternatives

  • hiera_include node classifier
  • Multiple roles
  • Roles in modules

Hiera

  • Simplifies modules
  • Flexible
  • Separation of concerns
  • Consul integration
:hierarchy: 
 - "node/%{::clientcert}"
 - "dc/%{::dc}"
 - "environment-role/%{::environment}/%{::role}"
 - "role/%{::role}"
 - "environment/%{::environment}"
 - "kernel/%{::kernel}"
 - common

Scaling the Puppetmaster

  • Ruby 1.9 + unicorn
  • Files served by nginx
  • Do we even need a master?

Untangling Spaghetti

  • Virtual resources != include
  • Separation of concerns/modeling
  • functions
  • Use hiera
  • Puppet code as system documentation
  • Design tips
    • contain > anchor
    • include > require
    • define > file{"${conf_d}/my-file.conf": }
    • avoid exec like the plague

Community modules

  • Fast ramp-up
  • Slow to iterate on
  • Featureful
  • Usually (battle) tested
  • Generic
  • Complex

Self-authored modules

  • Slow ramp-up
  • Easy to iterate on
  • Few features
  • Fits your internal patterns
  • Simpler

NIH syndrome

Tests

Tools

  • rspec-puppet/puppetlabs_spec_helper
  • ServerSpec
  • test-kitchen/beaker
  • Cultural barrier
  • Ssssoooo worth it
  • Don't overdo it
  • Unit tests for logic
  • Integration tests

Thank you for listening

(Applaud like your life depends on it)

Made with Slides.com