SPLUNK

Salespitch: "Splunk collects, indexes and harnesses all of the fast-moving machine data generated by your applications, servers and devices—physical, virtual and in the cloud."

Splunk = all data in one place
= A Sysop's happy pill


Wait a second


"All data" sounds obscure.
Hmm, let's google "splunk" (Feeling lucky?)

Ehhh... !?
    1. Splunk: The sound of getting exactly what you want?
    2. Splunker: Someone who splunk'd another by taking their private info and exposing it for the world to see?
    3. Splunk nut: a severe spelunker who likes to go very deep into caves? (a "moist hole"..?)

Nice metaphors, but asså seriously..

What, actually

  • For any and all logfiles 
  • Interactive search results (Charts, zooming etc) 
  • Save searches and reports 
  • Pattern/event monitoring and alert notifications
  • Almost faster than real time  
    (Seriously, it's really fast) 




Splunk = A great tool for  
quality aware developers! 

I expect everyone to use 
it regularly.

Testdrive


Let's take it for a spin
(use LDAP to log in)

So far we've learned


  • *(asterix ) for wildcard search is useful
  • Source ("/var/log/httpd/direkte.vg.no-access.log")
  • Host ("buffalobill.int.vgnett.no")
  • Sourcetype (e.g. "apache_access")
  • Sourcetype is a great first argument!
  • Most effective limiter
  • Confines search results to comparable data

Apache logs

sourcetype="apache_access"
sourcetype="apache_error"
[Note: Cronjobs ran as CLI PHP go to other log files!]

These are the two most valuable log files for a developer.

But, before we continue...
[Warning: An audience wide *SIGH* in...]
Three
Two
One

HTTP CODES 101

[Disclaimer: *SIGHING* allowed. ]
[But only if you know this shit inside out.]

I only have ONE expectation: 
Never return a HTTP code without an intent!


Normal operations

200 OK (example)
 

Normal operations

201 Created
Synchronous: Creates the resource and then 
returns the location of the created resource 
(like a redirect). (Example)

vs.

202 Accepted
Asynchronous: Returns before the 
processing is completed 
(e.g. it's in a queue waiting to be processed).


New door or just a road block?

301 Permanent redirects
Example: www.vg.no/a/123456 -> The actual, permanent article URL.
Preferred. Smart clients updates their local, original URL.
(Example)
vs.

302 Temporary found
Fact: Too many 302s should really be 301s.
Clients are not supposed to update their original, local URL, but reuse it next time.

304 Not modified
Typically Varnish or Apache, automatically.

When your visitor is tripping

400 Bad request ( Example )
The client is doing something wrong,
e.g. invalid query parameter.

The client should never retry the request
without changing it first.

401 Unauthorized
The client does not have access.
(Example)

404 Not found
Common. But splunk reveals it's mostly due to OUR fault, not the visitor.
(Example)

The man in the mirror, get a grip!

500 Internal server error

vs.

503 Service unavailable

What does it MEAN?!
An example would be in order.

Let's look at some amazing code.


$plan = $this->makeAwesomePickupPlan($randomIdeas);
$hotChick = $this->getBar()->pickupChick($plan);
$success = $hotChick->makeRomanticExplosionsInsideOf();
return $success;

WHAT COULD POSSIBLY GO WRONG?

now, -IF- something goes wrong..

Let me illustrate the difference between a 500 and a 503.

500: When poor errorhandling causes unexpected results and your code crashes,  the Apache reacts with a "WTF?!" and ends up in... the fetal position.

(500 = Apache does NOT know what went wrong)

however, with better error handling

Your code can handle unexpected events soberly, professionally and with an intended exit plan. You control what is logged, how it should be handled and what the feedback to the visitor should be (you got it, a 503!). 
Basically, you say..
"Everybody chill the fuck out, I got this."

To conclude...

So.. If your name is in the "daily blamegame 500 errors" email, this may be how your colleagues imagine you..:

CREATIVE ways to use splunk

Now let's look at how you can use Splunk to level up in the Quality Awareness Skill.

#1 know your access log

sourcetype="apache_access" status=404 url="*my/Funky/Path/script.php*"

sourcetype="apache_access" host="vgproduct-web-*" status=500 | top url

sourcetype="apache_access" host="vgproduct-web-*" status="401" | top url, referer

sourcetype="apache_access" status=5* |top host, url, status limit=50 
(Example)

and so on. Be creative.

#2 Know your error log

sourcetype="apache_error" (Example)

sourcetype="apache_error" | top errortype (Example)

sourcetype="apache_error" errortype="PHP Warning" |top host, PHPError
(Example)

Advice: Always enable E_ALL for error logging, in all environments 
(with display on dev environments)

#3 Timespan

sourcetype="apache_error" host="dinepenger*" ( Example )
- Zoom

- Switch to realtime "30 minutes"
= Extremely valuable when deploying

#4 Realtime

sourcetype="apache_error" host="dinepenger*" ( Example ) 

- Switch to realtime "30 minutes"

= Extremely valuable when deploying new code!

Should we test live..? 

#5 Saving your searches

 sourcetype="apache_error" host="dinepenger*" (Example)

- Save ("Deployment Watch Dine Penger") 
- Bookmark, share.

#6 Creating alerts

sourcetype="apache_error" "exit signal Segmentation fault"
(Example)

Diagnose: Typically caused by PHP 5.4 + APC 
Importance: Should ideally never occur. 
Cure: Replace APC (sysops)


Alert schedules

  1. Trigger on real time occurrences
  2. Run on a fixed schedule
  3. Monitor rolling window

Let's create a Real Time alert

We'll use the Seg Faults (Example)

- Real time
- Send email
- Include Throttling

alert for rolling windows

Example: You have an external dependency you cannot control. It typically goes down a couple of times a day, causing curl to fail.

Challenge: Only get alert when the problem is permanent.

Solution: Rolling windows.

sourcetype="apache_error" fetch errortype="[Ipad_Log|ERROR]" (Example)
- Monitor X minutes
- Alert when Y matches occur

Alerts on a Schedule

- Every X timeunit (mins, hours, days etc)
- Supports crontab with time range
- Supports threshold (>Y matches)

Examples: 
- Daily 500 Fetal Positions
- Weekly top PHP errors / host (last 6 days)
- QA: Daily summary of errors for stage/beta servers

#7 stats sparkline

sourcetype="apache_access" status="500"  |stats sparkline count by host (Example)

sourcetype="apache_access" host="red.*" |stats sparkline count by username (Example)

#8 charts

sourcetype="apache_error" deprecated | chart count by host (Example)
sourcetype="apache_error" "deprecated" | timechart  count by host (Example)

#9 Dashboards

Allows you to combine multiple searches in one screen.
(Example)

A good case could be a "Deployment monitoring dashboard"

what else could we do?

  • Should start using the splunk IRC bot to push alerts to our IRC channels
  • Experiment with executing PHP scripts triggered by Alerts
  • Create meaningful dashboards
  • Organize best practice on all active projects/webapps
Made with Slides.com