Effective HTCondor-based monitoring system
Justas Balcas, James Letts,
Farrukh Aftab Khan, Brian Bockelman
-
2 Collectors (FNAL, CERN)
-
2 Frontends (FNAL, CERN)
-
3 Negotiators (FNAL, CERN)
-
4 factories (CERN, FNAL, UCSD, GOC)
-
CMS Drivers/Submission tools (Schedds):
-
Production (21)
-
Analysis (CRAB3) (7)
-
Analysis (CRAB2) (1)
-
CMS Connect (Plan in 2016) (1)
-
One Pool To Rule Them All
Justas Balcas (CALTECH)
2016-03-01
T1s | T2s | T3s | |
---|---|---|---|
Num of Sites | 7 | 56 | 87 |
Max Running | ~40k | ~167k | ~6k |
Justas Balcas (CALTECH)
2016-03-01
Why is this needed?
-
How much prod/analysis is running on 'Site X' and was running last hour, day, week?
-
Why are my jobs not running?
-
Why 'Site X' is saturated and not running anything?
-
There are jobs imbalance between T2_US and T2_US_SiteX and why this is happening?
-
Which sites are multicore ready?
-
Etc...
Where I can find all of the information?
- All information you can get from: dashboard, wmstats, schedd, factory page, etc...
- This takes time to investigate and know where to look and load information
- Check logs of each component, several machines at once
- condor_status -wide -af Name TotalRunningJobs
- condor_q -const 'DESIRED_Sites=?="T2_US_SiteX"'
- do 'grep … | cat … | sort | uniq -c'
Justas Balcas (CALTECH)
2016-03-01
Where I can find all of the information?
Justas Balcas (CALTECH)
2016-03-01
Can you debug my jobs which ran two days ago...?
Justas Balcas (CALTECH)
2016-03-01
Justas Balcas (CALTECH)
2016-03-01
- Mar 14 - Initial prototype of production view
- Mar 16 - Finished prototype
- ----------
- May 08 - Added CRAB3 view (CRAB2 next day)
- Aug 03 - Totalview, ScheddView, Factoryview
- Aug 31 - Resource utilization
- Sep 03 - Pilot (Multicore/Single) usage in totalview per site
- Nov 10 - Bootstrap3 (For better mobile support)
- Jan 05 - Unique pressure per sites in all views
Production Priority
Support multiple collectors and reuse code in all views
Python2.7 + PEP8 friendly
Justas Balcas (CALTECH)
2016-03-01
Monitoring timeline
Implementation details
- Each view has an independent cronjob which runs every 3 minutes
- Running on VM: 4 VCPUs, 8GB RAM, 2 x High IO disks (500 IOPS, 200GB), 2 x Normal disks (120 IOPS, 200GB)
- Each view is preparing RRDs and json output for website
- HTCondor + HTCondor python bindings
- python-genshi
- httpd + mod_wsgi
- rrdtool-python
- Monitoring code: https://github.com/juztas/prodview/
Justas Balcas (CALTECH)
2016-03-01
-
RequestMemory, MaxWallTimeMins, JobStatus, DESIRED_Sites, MATCH_EXP_JOBGLIDEIN_CMSSite, Qdate, JobPrio
-
3 different view canvases (For workflows & sub-workflows):
-
Main View Overview (Running, Idle, Graphs, Workflow Count, Last Update Time)
-
Workflow overview
-
Site overview
-
Debug information (Grouped Running&Idle with Equal Requirements)
-
- Each view has different links for operators to get more information
Justas Balcas (CALTECH)
2016-03-01
Defaults for all (Prod, CRAB3, CRAB2)
Each task might have different subtasks, matching requirements (Memory, WallTime, CPUs, DESIRED_Sites), different priorities.
Justas Balcas (CALTECH)
2016-03-01
Production Overview
Justas Balcas (CALTECH)
2016-03-01
Analysis Overview (CRAB3/2)
- Since January, the number of CRAB3 jobs has increased from 10-15K to 30-40K parallel running jobs.
- This placed increased load on the pool central managers:
- More collector ClassAd updates
- Increased Negotiator cycle times (matchmaking between jobs and pilots)
- Also has strained ASO recently.
Justas Balcas (CALTECH)
2016-03-01
CRAB3 dashboard (Kibana)
CRAB3 schedds are monitored via a lemon sensor which fetches schedd statistics and publish them to ElasticSearch
We also do grep on shadow logs to get GLExec related errors
Justas Balcas (CALTECH)
2016-03-01
Total Overview
Justas Balcas (CALTECH)
2016-03-01
Pool Overview
- 4 factories in different continents
- Querying all 4 every 3 minutes
+
parsing 2 XMLs which they provide
Justas Balcas (CALTECH)
2016-03-01
Factory Overview
Usage(2015 Dec 10 - Yesterday)
Justas Balcas (CALTECH)
2016-03-01
- Show & Plot negotiation time for 3 negotiators
- Priority of the jobs for production per schedd
- Number of Dagmans per User/Schedd (HTCondor ticket 5519)
- Archiving the data (RRDs look like WEB1.0, but this website is used mostly for near real-time monitoring). Answer is No, but there are plans.
Plan for Archiving the data:
- PER_JOB_HISTORY_DIR to take all Job Classads and publish to ElasticSearch. (Not foreseeable)
- Do condor_history remotely and publish to ES. (Work in progress by Brian B.)
Justas Balcas (CALTECH)
2016-03-01
What Next?
Submission infrastructure group leaders:
Antonio Perez-Calero Yzquierdo (PIC), David Alexander Mason (FNAL) and James Letts (UCSD)
GlideinWMS operations and development team at FNAL:
Anthony Tiradani, Burt Holzman, Krista Larson, Marco Mambelli and Parag Mhashilkar
HTCondor developers:
Todd Tannenbaum, Jaime Frey, Tim Theisen and others working behind the scene
OSG factory operations team:
Brendan Denis, Jeffrey Dost, Martin Kandes and Vassil Verguilov
CRAB3 and WMAgent Operations team:
Alan Malta, Diego Ciangottini, Emilis Rupeika, Jadir Silva, Marco Mascheroni and
Stefano Belforte
And many many others!
Thanks to Everyone (and Brian)
Monitoring needs
By Justas Balcas
Monitoring needs
- 1,136