Wikimedia Analytics

Graphite

access with NDA

Graphite

  • PHP Profiling
  • Server / Machine stats
  • JobQueue / JobRunner
  • ResourceLoader
  • Gerrit
  • EventLogging
  • etc...
  • Simple metrics... Just numbers
  • Added to using statsd
  • Normally retained for 1 year
  • Resolution decreases with time
  • 1m:7d,5m:14d,15m:30d,1h:1y
echo "metricname:10|c" | nc -w 1 -u statsd.eqiad.wmnet 8125
$statsd = RequestContext::getMain()->getStats();

$service->timing('usageTime', 100);
$service->increment('visitor');
$service->decrement('click');
$service->gauge('gaugor', 333);
$service->set('uniques', 765);

Graphite

  • metrics agregator
  • It flushes data to graphite every 60 seconds
  • statsd adds .lower .upper .p95 etc. to the data

statsd

Supports

  • Counting
  • Sampling
  • Timing
  • Gauges
  • Sets

 

See: github.com/etsy/statsd/blob/master/docs/metric_types.md

Graphite

api

  • target - match 1 or more metrics
    • jobrunner.pop.wikibase-addUsagesForPage.ok.*.count

 

  • from/until - specify a time scale (relative or absolute)
    • from=-8d until=-7d

 

  • format - specify a format!
    • raw / csv / json

 

Other graphy stuff too.... (width etc)

Hadoop, Hive & Kafka

  • Kafka gets sent all of the logs (Distributed logging buffer)

 

  • Hadoop processes and stores much data (all of the logs).
    • Also has a scratch space!
    • Stores data for roughly 1 month

 

  • Hive lets you query all of the things!

access from stat1002

What is stored?

  • Webrequests (raw & refined)
  • Pageviews
  • Projectviews
  • pagecounts_all_sites
  • mediacounts

Web request fields

Sample query

SELECT
  count(*) as count, user_agent
FROM
  webrequest
WHERE
  year = 2015
  AND month = 10
  AND day = 12
  AND hour = 1
  AND uri_host = "www.wikidata.org"
  AND http_status = 200
  AND http_method = "GET"
  AND uri_path LIKE "/wiki/Special:EntityData%"
GROUP BY user_agent
ORDER BY count
LIMIT 999999;

See how many times each user agent accesses Special:EntityData and subpages in a given hour

More data!

Structured api logging soon

API Logs

Generated by mediawiki (rather than from request logs)

2014-08-19 10:12:33 mw1198 wikidatawiki api 
INFO: 
API GET 2.1.0.0 2.1.0.0 T=52ms 
action=opensearch 
format=json 
search=some%20search%20string 
limit=50 
namespace=0 
suggest=true

Long json strings are truncated with [...]

30 days of archives are kept on the analytics cluster

 

/a/mw-log/archive/api.log-$dateStamp.gz

access from fluorine or stat1002

Slave dbs

  • All of the things
  • Scratch space

analytics-store from stat1002

Graphing & Dashing

Name State Data From
Limn Dieing TSV files
Grafana Alive Graphite
Dashiki Alive Who knows?
Shiny Own All of the places?
The other one.... Alive Graphite
Graphite Alive Graphite

Wikimedia Analytics

By Addshore

Wikimedia Analytics

  • 566