Effective HTCondor-based monitoring system

Justas Balcas, James Letts,

Farrukh Aftab Khan, Brian Bockelman

  • 2 Collectors (FNAL, CERN)

  • 2 Frontends (FNAL, CERN)

  • 3 Negotiators (FNAL, CERN)

  • 4 factories (CERN, FNAL, UCSD, GOC)

  • CMS Drivers/Submission tools (Schedds):

    • Production (21)

    • Analysis (CRAB3) (7)

    • Analysis (CRAB2) (1)

    • CMS Connect (Plan in 2016) (1)                                        

One Pool To Rule Them All

Justas Balcas (CALTECH)

2016-03-01

T1s T2s T3s
Num of Sites 7 56 87
Max Running ~40k ~167k ~6k

Justas Balcas (CALTECH)

2016-03-01

Why is this needed?

  • How much prod/analysis is running on 'Site X' and was running last hour, day, week?

  • Why are my jobs not running?

  • Why 'Site X' is saturated and not running anything?

  • There are jobs imbalance between T2_US and T2_US_SiteX and why this is happening?

  • Which sites are multicore ready?

  • Etc...

Where I can find all of the information?

  • All information you can get from: dashboard, wmstats, schedd, factory page, etc...
    • This takes time to investigate and know where to look and load information
    • Check logs of each component, several machines at once
    • condor_status -wide -af Name TotalRunningJobs
    • condor_q -const 'DESIRED_Sites=?="T2_US_SiteX"'
    • do 'grep … | cat … | sort | uniq -c'

Justas Balcas (CALTECH)

2016-03-01

Where I can find all of the information?

Justas Balcas (CALTECH)

2016-03-01

Can you debug my jobs which ran two days ago...?

Justas Balcas (CALTECH)

2016-03-01

Justas Balcas (CALTECH)

2016-03-01

  • Mar 14 - Initial prototype of production view
  • Mar 16 - Finished prototype
  • ----------
  • May 08 - Added CRAB3 view (CRAB2 next day)
  • Aug 03 - Totalview, ScheddView, Factoryview
  • Aug 31 - Resource utilization
  • Sep 03 - Pilot (Multicore/Single) usage in totalview per site
  • Nov 10 - Bootstrap3 (For better mobile support)
  • Jan 05 - Unique pressure per sites in all views
                   Production Priority
                   Support multiple collectors and reuse code in all views
                   Python2.7 + PEP8 friendly

Justas Balcas (CALTECH)

2016-03-01

Monitoring timeline

Implementation details

 

  • Each view has an independent cronjob which runs every 3 minutes
  • Running on VM: 4 VCPUs, 8GB RAM, 2 x High IO disks (500 IOPS, 200GB), 2 x Normal disks (120 IOPS, 200GB)
  • Each view is preparing RRDs and json output for website

 

Justas Balcas (CALTECH)

2016-03-01

  • RequestMemory, MaxWallTimeMins, JobStatus, DESIRED_Sites, MATCH_EXP_JOBGLIDEIN_CMSSite, Qdate, JobPrio

  • 3 different view canvases (For workflows & sub-workflows):

    • Main View Overview (Running, Idle, Graphs, Workflow Count, Last Update Time)

    • Workflow overview

    • Site overview

    • Debug information (Grouped Running&Idle with Equal Requirements)

  •         Each view has different links for operators to get more information

Justas Balcas (CALTECH)

2016-03-01

Defaults for all (Prod, CRAB3, CRAB2)

Each task might have different subtasks, matching requirements (Memory, WallTime, CPUs, DESIRED_Sites), different priorities.

Justas Balcas (CALTECH)

2016-03-01

Production Overview

Justas Balcas (CALTECH)

2016-03-01

Analysis Overview (CRAB3/2)

  • Since January, the number of CRAB3 jobs has increased from 10-15K to 30-40K parallel running jobs.
  • This placed increased load on the pool central managers:
    • More collector ClassAd updates
    • Increased Negotiator cycle times (matchmaking between jobs and pilots)
  • Also has strained ASO recently.

Justas Balcas (CALTECH)

2016-03-01

CRAB3 dashboard (Kibana)

CRAB3 schedds are monitored via a lemon sensor which fetches schedd statistics and publish them to ElasticSearch

We also do grep on shadow logs to get GLExec related errors

Justas Balcas (CALTECH)

2016-03-01

Total Overview

Justas Balcas (CALTECH)

2016-03-01

Pool Overview

  • 4 factories in different continents
  • Querying all 4 every 3 minutes
    +
    parsing 2 XMLs which they provide

Justas Balcas (CALTECH)

2016-03-01

Factory Overview

Usage(2015 Dec 10 - Yesterday)

Justas Balcas (CALTECH)

2016-03-01

  • Show & Plot negotiation time for 3 negotiators
  • Priority of the jobs for production per schedd
  • Number of Dagmans per User/Schedd (HTCondor ticket 5519)
  • Archiving the data (RRDs look like WEB1.0, but this website is used mostly for near real-time monitoring). Answer is No, but there are plans.

Plan for Archiving the data:

  • PER_JOB_HISTORY_DIR to take all Job Classads and publish to ElasticSearch. (Not foreseeable)
  • Do condor_history remotely and publish to ES. (Work in progress by Brian B.)

Justas Balcas (CALTECH)

2016-03-01

What Next?

Submission infrastructure group leaders:
Antonio Perez-Calero Yzquierdo (PIC), David Alexander Mason (FNAL) and James Letts (UCSD)

GlideinWMS operations and development team at FNAL:
Anthony Tiradani, Burt Holzman, Krista Larson, Marco Mambelli and Parag Mhashilkar
HTCondor developers:
Todd Tannenbaum, Jaime Frey, Tim Theisen and others working behind the scene
OSG factory operations team:
Brendan Denis, Jeffrey Dost, Martin Kandes and Vassil Verguilov
CRAB3 and WMAgent Operations team:
Alan Malta, Diego Ciangottini, Emilis Rupeika, Jadir Silva, Marco Mascheroni and

Stefano Belforte

 

And many many others!

Thanks to Everyone (and Brian)

Copy of Monitoring needs

By Justas Balcas

Copy of Monitoring needs

  • 404