A story of Ops recovery
or
The 5 steps of Ops grief
Avishai Ish-Shalom (@nukemberg)
A long long time ago,
In a country far far away...
We inherited a system,
And it was a mess
- > 500 servers
- In a crappy DC
- "Managed" by puppet, salt, glu, shell & ruby scripts
- + bunch of Windows servers
- 3 Hadoop clusters
- Rundeck, Docker, HBase, CouchBase, MySQL, Memcached, Redis, Java, Node.js, .NET...
If you read a blog about it, we had it
And the guy that built the system had just quit....
(Denial)
We took a look at the Puppet code
- ~40k LoC
- No tests
- No community modules
- Spaghetti code
- No Hiera
(Severe case of NIH syndrome)
(Denial)
We went to work
Act I
"Puppet is crap, let's use Chef!"
(Anger)
When in doubt, go with the flow
- People revert to familiar things
- "Blame it on the tool"
But...
- Incumbents have an advantage
- Tools are rarely the problem
Consolidate
- Ditched Chef, Glu, Salt
- Introduced Fabric
- Puppet for all node management
Act II
"Maybe it's not so bad"
(Bargaining)
How many servers do we have?
- DC lease bill
- Zabbix
- Puppet
- Active Directory
Numbers didn't add up
We created an inventory
ELK based inventory
facter --json | curl -XPUT -d @- http://elasticsearch:9200/inventory/host/`hostname -f`
Updated by Puppet
case $::kernel {
'Linux': {
ensure_resource(package, curl, {ensure => present})
exec{"inventory-facter":
command => "facter --json -p | curl -XPUT ${es_uri} -d @- || true",
provider => shell,
schedule => 'twice-a-day'
}
}
'windows': {
exec{'inventory-facter':
command => template("sg_base/inventory-command.ps1.erb"),
provider => powershell,
schedule => 'twice-a-day'
}
}
default: {
warn("Facter inventory class does not support $kernel")
}
}
Why not write a facts terminus?
- Simpler
- Read only
- PuppetDB co-existence
- First version was based on Ohai
Act III
No alerts doesn't mean things are working
(Depression)
Puppet reports dashboard
- PE - too expensive, takes time
- PD - deprecated
- Puppet Explorer - nice, but simple
- PuppetDB required
- No ad-hoc aggregations, not flexible
Puppet ES reporter
- Puppet module in GitHub
- ES index template
- Time based index
- Supports master/agent/standalone
ELK Dashboard
ELK Dashboard
- Send reports to ES
- Use Kibana 4 dashboard
- ES API
Act IV
Weaponize your data
(Acceptance)
Compare data-sources
- DC billing statement
- Inventory
- Puppet reports
- Zabbix
- Switch configurations
Find problems
- ES aggregations on reports
- Chronically restarted resources
- Failed runs
- Slow runs (relative to peers)
- Aggregate by O/S, DC, Role
Opsole - operator context
motd - operator context
Act V
Taming a wild Puppet
(Acceptance)
Roles
- Not a puppet-native concept
- Central module (dependencies!!)
- Puppet based node classifier
- Everything else -
hiera - Simple logic
node default {
include sg_base
if $::role != undef {
include "roles::${::role}"
}
}
Alternatives
- hiera_include node classifier
- Multiple roles
- Roles in modules
Hiera
- Simplifies modules
- Flexible
- Separation of concerns
- Consul integration
:hierarchy:
- "node/%{::clientcert}"
- "dc/%{::dc}"
- "environment-role/%{::environment}/%{::role}"
- "role/%{::role}"
- "environment/%{::environment}"
- "kernel/%{::kernel}"
- common
Scaling the Puppetmaster
- Ruby 1.9 + unicorn
- Files served by
nginx - Do we even need a master?
Untangling Spaghetti
- Virtual resources != include
- Separation of concerns/modeling
- functions
- Use hiera
- Puppet code as system documentation
- Design tips
- contain > anchor
- include > require
- define > file{"${conf_d}/my-file.conf": }
- avoid exec like the plague
Community modules
- Fast ramp-up
- Slow to iterate on
- Featureful
- Usually (battle) tested
- Generic
- Complex
Self-authored modules
- Slow ramp-up
- Easy to iterate on
- Few features
- Fits your internal patterns
- Simpler
NIH syndrome
Tests
Tools
- rspec-puppet/puppetlabs_spec_helper
- ServerSpec
- test-kitchen/beaker
- Cultural barrier
- Ssssoooo worth it
- Don't overdo it
- Unit tests for logic
- Integration tests
Thank you for listening
(Applaud like your life depends on it)
Opsole: A story of Ops recovery
By Avishai Ish-Shalom
Opsole: A story of Ops recovery
Sydney puppet user group meetup talk
- 1,985