Justas Balcas, James Letts,
Farrukh Aftab Khan, Brian Bockelman
2 Collectors (FNAL, CERN)
2 Frontends (FNAL, CERN)
3 Negotiators (FNAL, CERN)
4 factories (CERN, FNAL, UCSD, GOC)
CMS Drivers/Submission tools (Schedds):
Production (21)
Analysis (CRAB3) (7)
Analysis (CRAB2) (1)
CMS Connect (Plan in 2016) (1)
T1s | T2s | T3s | |
---|---|---|---|
Num of Sites | 7 | 56 | 87 |
Max Running | ~40k | ~167k | ~6k |
How much prod/analysis is running on 'Site X' and was running last hour, day, week?
Why are my jobs not running?
Why 'Site X' is saturated and not running anything?
There are jobs imbalance between T2_US and T2_US_SiteX and why this is happening?
Which sites are multicore ready?
Etc...
RequestMemory, MaxWallTimeMins, JobStatus, DESIRED_Sites, MATCH_EXP_JOBGLIDEIN_CMSSite, Qdate, JobPrio
3 different view canvases (For workflows & sub-workflows):
Main View Overview (Running, Idle, Graphs, Workflow Count, Last Update Time)
Workflow overview
Site overview
Debug information (Grouped Running&Idle with Equal Requirements)
Each task might have different subtasks, matching requirements (Memory, WallTime, CPUs, DESIRED_Sites), different priorities.
CRAB3 schedds are monitored via a lemon sensor which fetches schedd statistics and publish them to ElasticSearch
We also do grep on shadow logs to get GLExec related errors
Plan for Archiving the data:
Submission infrastructure group leaders:
Antonio Perez-Calero Yzquierdo (PIC), David Alexander Mason (FNAL) and James Letts (UCSD)
GlideinWMS operations and development team at FNAL:
Anthony Tiradani, Burt Holzman, Krista Larson, Marco Mambelli and Parag Mhashilkar
HTCondor developers:
Todd Tannenbaum, Jaime Frey, Tim Theisen and others working behind the scene
OSG factory operations team:
Brendan Denis, Jeffrey Dost, Martin Kandes and Vassil Verguilov
CRAB3 and WMAgent Operations team:
Alan Malta, Diego Ciangottini, Emilis Rupeika, Jadir Silva, Marco Mascheroni and
Stefano Belforte
And many many others!