SPLUNK
Salespitch: "Splunk collects, indexes and harnesses all of the fast-moving machine data generated by your applications, servers and devices—physical, virtual and in the cloud."
Splunk = all data in one place
= A Sysop's happy pill
Wait a second
"All data" sounds obscure.
Ehhh... !?
-
Splunk: The sound of getting exactly what you want?
-
Splunker: Someone who splunk'd another by taking their private info and exposing it for the world to see?
-
Splunk nut: a severe spelunker who likes to go very deep into caves? (a "moist hole"..?)
Nice metaphors, but asså seriously..
What, actually
- For any and all logfiles
- Interactive search results (Charts, zooming etc)
- Save searches and reports
- Pattern/event monitoring and alert notifications
- Almost faster than real time
(Seriously, it's really fast)
Splunk = A great tool for
quality aware
developers!
I expect everyone to use
it regularly.
So far we've learned
-
*(asterix ) for wildcard search is useful
-
Source ("/var/log/httpd/direkte.vg.no-access.log")
-
Host ("buffalobill.int.vgnett.no")
-
Sourcetype (e.g. "apache_access")
-
Sourcetype is a great first argument!
- Most effective limiter
- Confines search results to comparable data
Apache logs
sourcetype="apache_access"
sourcetype="apache_error"
[Note: Cronjobs ran as CLI PHP go to other log files!]
These are the two most valuable log files for a developer.
But, before we continue...
[Warning: An audience wide *SIGH* in...]
Three
Two
One
HTTP CODES 101
[Disclaimer: *SIGHING* allowed. ]
[But only if you know this shit inside out.]
I only have ONE expectation:
Never return a HTTP code without an intent!
Normal operations
201 Created
Synchronous: Creates the resource and then
returns the location of the created resource
(like a redirect). (Example)
vs.
202 Accepted
Asynchronous: Returns before the
processing is completed
(e.g. it's in a queue waiting to be processed).
New door or just a road block?
301 Permanent redirects
Example: www.vg.no/a/123456 -> The actual, permanent article URL.
Preferred. Smart clients updates their local, original URL.
(Example)
vs.
302 Temporary found
Fact: Too many 302s should really be 301s.
Clients are not supposed to update their original, local URL, but reuse it next time.
304 Not modified
Typically Varnish or Apache, automatically.
When your visitor is tripping
The client is doing something wrong,
e.g. invalid query parameter.
The client should never retry the request
without changing it first.
401 Unauthorized
The client does not have access.
(Example)
404 Not found
Common. But splunk reveals it's mostly due to OUR fault, not the visitor.
(Example)
The man in the mirror, get a grip!
500 Internal server error
vs.
503 Service unavailable
What does it MEAN?!
An example would be in order.
Let's look at some amazing code.
$plan = $this->makeAwesomePickupPlan($randomIdeas);
$hotChick = $this->getBar()->pickupChick($plan);
$success = $hotChick->makeRomanticExplosionsInsideOf();
return $success;
WHAT COULD POSSIBLY GO WRONG?
now, -IF- something goes wrong..
Let me illustrate the difference between a 500 and a 503.
500: When poor errorhandling causes unexpected results and your code crashes,
the Apache reacts with a "WTF?!" and ends up in... the fetal position.
(500 = Apache does NOT know what went wrong)
however, with better error handling
Your code can handle unexpected events soberly, professionally and with an intended exit plan.
You control what is logged, how it should be handled and what the feedback to the visitor should be
(you got it, a 503!).
Basically, you say..
"Everybody chill the fuck out, I got this."
To conclude...
So.. If your name is in the "daily blamegame 500 errors" email, this may be how your colleagues imagine you..:
CREATIVE ways to use splunk
Now let's look at how you can use Splunk to level up in the Quality Awareness Skill.
#1 know your access log
sourcetype="apache_access" status=404 url="*my/Funky/Path/script.php*"
sourcetype="apache_access" host="vgproduct-web-*" status=500 | top url
sourcetype="apache_access" host="vgproduct-web-*" status="401" | top url, referer
sourcetype="apache_access" status=5* |top host, url, status limit=50
and so on. Be creative.
#2 Know your error log
sourcetype="apache_error" (Example)
sourcetype="apache_error" | top errortype (Example)
sourcetype="apache_error" errortype="PHP Warning" |top host, PHPError
Advice: Always enable E_ALL for error logging, in all environments
(with display on dev environments)
#3 Timespan
sourcetype="apache_error" host="dinepenger*" (
Example
)
- Zoom
- Switch to realtime "30 minutes"
= Extremely valuable when deploying
#4 Realtime
sourcetype="apache_error" host="dinepenger*" (
Example
)
- Switch to realtime "30 minutes"
= Extremely valuable when deploying new code!
Should we test live..?
#5 Saving your searches
sourcetype="apache_error" host="dinepenger*" (
Example)
- Save ("Deployment Watch Dine Penger")
- Bookmark, share.
#6 Creating alerts
sourcetype="apache_error" "exit signal Segmentation fault"
Diagnose: Typically caused by PHP 5.4 + APC
Importance: Should ideally never occur.
Cure: Replace APC (sysops)
Alert schedules
- Trigger on real time occurrences
- Run on a fixed schedule
- Monitor rolling window
Let's create a Real Time alert
- Real time
- Send email
- Include Throttling
alert for rolling windows
Example: You have an external dependency you cannot control. It typically goes down a couple of times a day, causing curl to fail.
Challenge: Only get alert when the problem is permanent.
Solution: Rolling windows.
sourcetype="apache_error" fetch errortype="[Ipad_Log|ERROR]" (
Example)
- Monitor X minutes
- Alert when Y matches occur
Alerts on a Schedule
- Every X timeunit (mins, hours, days etc)
- Supports crontab with time range
- Supports threshold (>Y matches)
Examples:
- Daily 500 Fetal Positions
- Weekly top PHP errors / host (last 6 days)
- QA: Daily summary of errors for stage/beta servers
#7 stats sparkline
sourcetype="apache_access" status="500" |stats sparkline count by host (Example)
sourcetype="apache_access" host="red.*" |stats sparkline count by username (Example)
#8 charts
sourcetype="apache_error" deprecated | chart count by host (Example)
sourcetype="apache_error" "deprecated" | timechart count by host (Example)
#9 Dashboards
Allows you to combine multiple searches in one screen.
A good case could be a "Deployment monitoring dashboard"
what else could we do?
- Should start using the splunk IRC bot to push alerts to our IRC channels
- Experiment with executing PHP scripts triggered by Alerts
- Create meaningful dashboards
- Organize best practice on all active projects/webapps