Effective HTCondor-based monitoring system
Justas Balcas, James Letts,
Farrukh Aftab Khan, Brian Bockelman
-
2 Collectors (FNAL, CERN)
-
2 Frontends (FNAL, CERN)
-
3 Negotiators (FNAL, CERN)
-
4 factories (CERN, FNAL, UCSD, GOC)
-
CMS Drivers/Submission tools (Schedds):
-
Production (21)
-
Analysis (CRAB3) (7)
-
Analysis (CRAB2) (1)
-
CMS Connect (Plan in 2016) (1)
-
One Pool To Rule Them All
Justas Balcas (CALTECH)
2016-03-01
T1s | T2s | T3s | |
---|---|---|---|
Num of Sites | 7 | 56 | 87 |
Max Running | ~40k | ~167k | ~6k |
Justas Balcas (CALTECH)
2016-03-01
Why is this needed?
-
How much prod/analysis is running on 'Site X' and was running last hour, day, week?
-
Why are my jobs not running?
-
Why 'Site X' is saturated and not running anything?
-
There are jobs imbalance between T2_US and T2_US_SiteX and why this is happening?
-
Which sites are multicore ready?
-
Etc...
Where I can find all of the information?
- All information you can get from: dashboard, wmstats, schedd, factory page, etc...
- This takes time to investigate and know where to look and load information
- Check logs of each component, several machines at once
- condor_status -wide -af Name TotalRunningJobs
- condor_q -const 'DESIRED_Sites=?="T2_US_SiteX"'
- do 'grep … | cat … | sort | uniq -c'
Justas Balcas (CALTECH)
2016-03-01
Where I can find all of the information?

Justas Balcas (CALTECH)
2016-03-01
Can you debug my jobs which ran two days ago...?

Justas Balcas (CALTECH)
2016-03-01

Justas Balcas (CALTECH)
2016-03-01
- Mar 14 - Initial prototype of production view
- Mar 16 - Finished prototype
- ----------
- May 08 - Added CRAB3 view (CRAB2 next day)
- Aug 03 - Totalview, ScheddView, Factoryview
- Aug 31 - Resource utilization
- Sep 03 - Pilot (Multicore/Single) usage in totalview per site
- Nov 10 - Bootstrap3 (For better mobile support)
- Jan 05 - Unique pressure per sites in all views
Production Priority
Support multiple collectors and reuse code in all views
Python2.7 + PEP8 friendly
Justas Balcas (CALTECH)
2016-03-01
Monitoring timeline
Implementation details
- Each view has an independent cronjob which runs every 3 minutes
- Running on VM: 4 VCPUs, 8GB RAM, 2 x High IO disks (500 IOPS, 200GB), 2 x Normal disks (120 IOPS, 200GB)
- Each view is preparing RRDs and json output for website
- HTCondor + HTCondor python bindings
- python-genshi
- httpd + mod_wsgi
- rrdtool-python
- Monitoring code: https://github.com/juztas/prodview/
Justas Balcas (CALTECH)
2016-03-01
-
RequestMemory, MaxWallTimeMins, JobStatus, DESIRED_Sites, MATCH_EXP_JOBGLIDEIN_CMSSite, Qdate, JobPrio
-
3 different view canvases (For workflows & sub-workflows):
-
Main View Overview (Running, Idle, Graphs, Workflow Count, Last Update Time)
-
Workflow overview
-
Site overview
-
Debug information (Grouped Running&Idle with Equal Requirements)
-
- Each view has different links for operators to get more information
Justas Balcas (CALTECH)
2016-03-01
Defaults for all (Prod, CRAB3, CRAB2)


Each task might have different subtasks, matching requirements (Memory, WallTime, CPUs, DESIRED_Sites), different priorities.
Justas Balcas (CALTECH)
2016-03-01
Production Overview
Justas Balcas (CALTECH)
2016-03-01
Analysis Overview (CRAB3/2)


- Since January, the number of CRAB3 jobs has increased from 10-15K to 30-40K parallel running jobs.
- This placed increased load on the pool central managers:
- More collector ClassAd updates
- Increased Negotiator cycle times (matchmaking between jobs and pilots)
- Also has strained ASO recently.
Justas Balcas (CALTECH)
2016-03-01
CRAB3 dashboard (Kibana)
CRAB3 schedds are monitored via a lemon sensor which fetches schedd statistics and publish them to ElasticSearch
We also do grep on shadow logs to get GLExec related errors





Justas Balcas (CALTECH)
2016-03-01
Total Overview


Justas Balcas (CALTECH)
2016-03-01
Pool Overview


- 4 factories in different continents
- Querying all 4 every 3 minutes
+
parsing 2 XMLs which they provide
Justas Balcas (CALTECH)
2016-03-01
Factory Overview
Usage(2015 Dec 10 - Yesterday)


Justas Balcas (CALTECH)
2016-03-01
- Show & Plot negotiation time for 3 negotiators
- Priority of the jobs for production per schedd
- Number of Dagmans per User/Schedd (HTCondor ticket 5519)
- Archiving the data (RRDs look like WEB1.0, but this website is used mostly for near real-time monitoring). Answer is No, but there are plans.
Plan for Archiving the data:
- PER_JOB_HISTORY_DIR to take all Job Classads and publish to ElasticSearch. (Not foreseeable)
- Do condor_history remotely and publish to ES. (Work in progress by Brian B.)
Justas Balcas (CALTECH)
2016-03-01
What Next?
Submission infrastructure group leaders:
Antonio Perez-Calero Yzquierdo (PIC), David Alexander Mason (FNAL) and James Letts (UCSD)
GlideinWMS operations and development team at FNAL:
Anthony Tiradani, Burt Holzman, Krista Larson, Marco Mambelli and Parag Mhashilkar
HTCondor developers:
Todd Tannenbaum, Jaime Frey, Tim Theisen and others working behind the scene
OSG factory operations team:
Brendan Denis, Jeffrey Dost, Martin Kandes and Vassil Verguilov
CRAB3 and WMAgent Operations team:
Alan Malta, Diego Ciangottini, Emilis Rupeika, Jadir Silva, Marco Mascheroni and
Stefano Belforte
And many many others!
Thanks to Everyone (and Brian)
Copy of Monitoring needs
By Justas Balcas
Copy of Monitoring needs
- 470