Introduction
What`s cooking in HTCondor
Docker
condor and dagman
python bindings
Data caching
Other
Testing the limits of condor
HTCondor monitoring
Hands on tutorial
Throughput: the quantity of work done by
an electronic computer in a given period of
time (Dictionary.com)
Job
The HTCondor representation of a piece of work like a Unix process can be an element of a workflow
ClassAd
HTCondor’s internal data representation
Machine or Resource
Computers that can do the processing
Matchmaking
associating a job with a machine resource
Central Manager
central repository for the whole pool does matchmaking
Submit Host
the computer from which jobs are submitted to HTCondor
Execute Host
the computer that runs a job
Docker
condor and dagman
python bindings
Data caching
Docker manages Linux containers.
Containers give Linux processes a private:
This is an “ubuntu” container
Processes in other containers on this machine can NOT see what’s going on in this “ubuntu” container
This is my host OS, running Fedora
DOCKER = /usr/bin/docker
universe = docker
executable = /bin/my_executable
arguments = arg1
docker_image = deb7_and_HEP_stack
transfer_input_files = some_input
output = out
error = err
log = log
queue
The way it works
Universe = Vanilla
Executable = cook
Output = meal$(Process).out
Args = -i pasta
Queue
Args = -i chicken
Queue
Universe = Vanilla
Executable = cook
Output = meal$(Process).out
Args = - i $(Item)
Queue
Item in (pasta, chicken)
Queue <N> <var> in (<item-list>)
Queue <N> <var> matching (<glob-list>)
Queue <N> <vars> from <filename>
Queue <N> <vars> from <script>
Queue <N> <vars> from (
<multiline-list>
)
If you can do it with the command line tools, you should be able to do it with python.
Testing the limits of condor
HTCondor monitoring
To try it out, you can just parse the output of “condor_q” into the desired format.
Then, simply use netcat to send it to the Graphite server
#!/bin/bash
metric=”htcondor.running”
value=$(condor_q | grep R | wc -l)
timestamp=$(date +%s)
echo ”$metric $value $timestamp” | nc \
graphite.yourdomain.edu 2003
#!/bin/bash
metric=”htcondor.running”
value=$(condor_q | grep R | wc -l)
timestamp=$(date +%s)
echo ”$metric $value $timestamp” | nc \
graphite.yourdomain.edu 2003
#!/bin/bash
metric=”htcondor.running”
value=$(condor_q | grep R | wc -l)
timestamp=$(date +%s)
echo ”$metric $value $timestamp” | nc \
graphite.yourdomain.edu 2003
import classad, htcondor
coll = htcondor.Collector("htcondor.domain.edu")
slotState = coll.query(htcondor.AdTypes.Startd, "true",
['Name','JobId','State', 'RemoteOwner','COLLECTOR_HOST_STRING'])
for slot in slotState[:]:
if (slot['State'] == "Claimed"):
slot_claimed += 1
print "condor.claimed "+ str(slot_claimed) + " " + str(timestamp)
Python script polls the history logs periodically for new entries and publishes this to a Redis channel.
Classads get published to a channel on the Redis server and read by Logstash
Due to size of classads on Elasticsearch and because ES only works on data in memory,
data goes into a new
index each month.
Python script is run every minute by a cronjob and collects classads for all jobs.
Default values:
NUM_CPUS = $(DETECTED_CPUS)
MEMORY = $(DETECTED_MEMORY)
$ condor_config_val -dump | grep DETECTED*
DETECTED_CORES = 8
DETECTED_CPUS = 8
DETECTED_MEMORY = 15960
DETECTED_PHYSICAL_CPUS = 4
$ condor_config_val -dump | grep NUM_CPUS*
NUM_CPUS = 8
$ condor_status -totals
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 8 0 0 8 0 0 0
Total 8 0 0 8 0 0 0
Increase num of CPUs and memory
NUM_CPUS = 32
MEMORY = $(DETECTED_MEMORY)*32
Default values:
NUM_CPUS = $(DETECTED_CPUS)
MEMORY = $(DETECTED_MEMORY)
$ condor_config_val -dump | grep DETECTED*
DETECTED_CORES = 8
DETECTED_CPUS = 8
DETECTED_MEMORY = 15960
DETECTED_PHYSICAL_CPUS = 4
$ condor_config_val -dump | grep NUM_CPUS*
NUM_CPUS = 32
$ condor_status -totals
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 32 0 0 32 0 0 0
Total 32 0 0 32 0 0 0
$ condor_status -long slot1@jbalcas | grep -i gpus
TotalGPUs = 2
DetectedGPUs = 2
AssignedGPUs = "CUDA0"
MachineResources = "Cpus Memory Disk Swap GPUs"
GPUs = 1
TotalSlotGPUs = 1
$ condor_config_val -dump gpus
ENVIRONMENT_FOR_AssignedGPUs = GPU_NAME GPU_ID=/CUDA//
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000
MACHINE_RESOURCE_GPUs = CUDA0, CUDA1
If your graphics card is from NVIDIA and it is listed in http://developer.nvidia.com/cuda-gpus, your GPU is CUDA-capable.
Requirements:
MyNonSuspendableSlotIsIdle = \
(NonSuspendableSlotState =!= "Claimed" && \
NonSuspendableSlotState =!= "Preempting")
#NonSuspendable slots are always willing to start jobs.
#Suspendable slots are only willing to start if the NonSuspendable slot is idle
START = \
IsSuspendableSlot!=True && IsSuspendableJob=!=True || \
IsSuspendableSlot && IsSuspendableJob==True && $(MyNonSuspendableSlotIsIdle)
# Suspend the suspendable slot if the other slot is busy.
SUSPEND = \
IsSuspendableSlot && $(MyNonSuspendableSlotIsIdle)!=True
CONTINUE = ($(SUSPEND)) != True