Effective HTCondor-based monitoring system

Justas Balcas, James Letts,

Farrukh Aftab Khan, Brian Bockelman

  • 2 Collectors (FNAL, CERN)

  • 2 Frontends (FNAL, CERN)

  • 3 Negotiators (FNAL, CERN)

  • 4 factories (CERN, FNAL, UCSD, GOC)

  • CMS Drivers/Submission tools (Schedds):

    • Production (21)

    • Analysis (CRAB3) (7)

    • Analysis (CRAB2) (1)

    • CMS Connect (Plan in 2016) (1)                                        

One Pool To Rule Them All

Justas Balcas (CALTECH)

2016-03-01

T1s T2s T3s
Num of Sites 7 56 87
Max Running ~40k ~167k ~6k

Justas Balcas (CALTECH)

2016-03-01

Why is this needed?

  • How much prod/analysis is running on 'Site X' and was running last hour, day, week?

  • Why are my jobs not running?

  • Why 'Site X' is saturated and not running anything?

  • There are jobs imbalance between T2_US and T2_US_SiteX and why this is happening?

  • Which sites are multicore ready?

  • Etc...

Where I can find all of the information?

  • All information you can get from: dashboard, wmstats, schedd, factory page, etc...
    • This takes time to investigate and know where to look and load information
    • Check logs of each component, several machines at once
    • condor_status -wide -af Name TotalRunningJobs
    • condor_q -const 'DESIRED_Sites=?="T2_US_SiteX"'
    • do 'grep … | cat … | sort | uniq -c'

Justas Balcas (CALTECH)

2016-03-01

Where I can find all of the information?

Justas Balcas (CALTECH)

2016-03-01

Can you debug my jobs which ran two days ago...?

Justas Balcas (CALTECH)

2016-03-01

Justas Balcas (CALTECH)

2016-03-01

  • Mar 14 - Initial prototype of production view
  • Mar 16 - Finished prototype
  • ----------
  • May 08 - Added CRAB3 view (CRAB2 next day)
  • Aug 03 - Totalview, ScheddView, Factoryview
  • Aug 31 - Resource utilization
  • Sep 03 - Pilot (Multicore/Single) usage in totalview per site
  • Nov 10 - Bootstrap3 (For better mobile support)
  • Jan 05 - Unique pressure per sites in all views
                   Production Priority
                   Support multiple collectors and reuse code in all views
                   Python2.7 + PEP8 friendly

Justas Balcas (CALTECH)

2016-03-01

Monitoring timeline

Implementation details

 

  • Each view has an independent cronjob which runs every 3 minutes
  • Running on VM: 4 VCPUs, 8GB RAM, 2 x High IO disks (500 IOPS, 200GB), 2 x Normal disks (120 IOPS, 200GB)
  • Each view is preparing RRDs and json output for website

 

Justas Balcas (CALTECH)

2016-03-01

  • RequestMemory, MaxWallTimeMins, JobStatus, DESIRED_Sites, MATCH_EXP_JOBGLIDEIN_CMSSite, Qdate, JobPrio

  • 3 different view canvases (For workflows & sub-workflows):

    • Main View Overview (Running, Idle, Graphs, Workflow Count, Last Update Time)

    • Workflow overview

    • Site overview

    • Debug information (Grouped Running&Idle with Equal Requirements)

  •         Each view has different links for operators to get more information

Justas Balcas (CALTECH)

2016-03-01

Defaults for all (Prod, CRAB3, CRAB2)

Each task might have different subtasks, matching requirements (Memory, WallTime, CPUs, DESIRED_Sites), different priorities.

Justas Balcas (CALTECH)

2016-03-01

Production Overview

Justas Balcas (CALTECH)

2016-03-01

Analysis Overview (CRAB3/2)

  • Since January, the number of CRAB3 jobs has increased from 10-15K to 30-40K parallel running jobs.
  • This placed increased load on the pool central managers:
    • More collector ClassAd updates
    • Increased Negotiator cycle times (matchmaking between jobs and pilots)
  • Also has strained ASO recently.

Justas Balcas (CALTECH)

2016-03-01

CRAB3 dashboard (Kibana)

CRAB3 schedds are monitored via a lemon sensor which fetches schedd statistics and publish them to ElasticSearch

We also do grep on shadow logs to get GLExec related errors

Justas Balcas (CALTECH)

2016-03-01

Total Overview

Justas Balcas (CALTECH)

2016-03-01

Pool Overview

  • 4 factories in different continents
  • Querying all 4 every 3 minutes
    +
    parsing 2 XMLs which they provide

Justas Balcas (CALTECH)

2016-03-01

Factory Overview

Usage(2015 Dec 10 - Yesterday)

Justas Balcas (CALTECH)

2016-03-01

  • Show & Plot negotiation time for 3 negotiators
  • Priority of the jobs for production per schedd
  • Number of Dagmans per User/Schedd (HTCondor ticket 5519)
  • Archiving the data (RRDs look like WEB1.0, but this website is used mostly for near real-time monitoring). Answer is No, but there are plans.

Plan for Archiving the data:

  • PER_JOB_HISTORY_DIR to take all Job Classads and publish to ElasticSearch. (Not foreseeable)
  • Do condor_history remotely and publish to ES. (Work in progress by Brian B.)

Justas Balcas (CALTECH)

2016-03-01

What Next?

Submission infrastructure group leaders:
Antonio Perez-Calero Yzquierdo (PIC), David Alexander Mason (FNAL) and James Letts (UCSD)

GlideinWMS operations and development team at FNAL:
Anthony Tiradani, Burt Holzman, Krista Larson, Marco Mambelli and Parag Mhashilkar
HTCondor developers:
Todd Tannenbaum, Jaime Frey, Tim Theisen and others working behind the scene
OSG factory operations team:
Brendan Denis, Jeffrey Dost, Martin Kandes and Vassil Verguilov
CRAB3 and WMAgent Operations team:
Alan Malta, Diego Ciangottini, Emilis Rupeika, Jadir Silva, Marco Mascheroni and

Stefano Belforte

 

And many many others!

Thanks to Everyone (and Brian)

Monitoring needs

By Justas Balcas

Monitoring needs

  • 1,136