IceCube has an HTCondor compute cluster at UW-Madison, colloquially called NPX:
dschultz@pub1 ~ $ ssh submitter
================ This is the submit node for the NPX cluster ===============
Policies, Condor documentation, howtos, best practices, etc. can be found
at https://wiki.icecube.wisc.edu/index.php/Condor. Here are highlights:
* By default, maximum job runtime is 12 hours
* By default, jobs are allocated 1 CPU core, 1GB of memory, 1GB of disk
dschultz@submitter ~ $
Use this machine for submitting cluster jobs.
You should already have a script or program that you run to create/analyze data and simulation. To run this on a cluster, you need to eliminate:
The basic job framework is strictly file-based input/output. Note that this is excellent for IceTray processing, since we have an I3Reader and I3Writer with modules in between.
HTCondor has lots of options, but we'll only focus on a few. Let's do the most basic thing we can, hello world.
First, we need a script to run:
#!/bin/sh
echo Hello World!
Always verify that the script does what you want before running it on the cluster:
dschultz@submitter ~/test/helloworld $ chmod +x hello.sh
dschultz@submitter ~/test/helloworld $ ./hello.sh
Hello World!
dschultz@submitter ~/test/helloworld $
Good, that does what we want.
Now, let's make the HTCondor submit file:
# this is the script we want to run
executable = hello.sh
# some logging in case we have to debug things
log = hello.log
output = hello.out
error = hello.err
# don't send me any emails about job status
notification = never
# add the job to the queue
queue
And finally we tell HTCondor to run our job:
dschultz@submitter ~/test/helloworld $ condor_submit hello.submit
Submitting job(s).
1 job(s) submitted to cluster 20556929.
dschultz@submitter ~/test/helloworld $
If the cluster is not very busy, this will run immediately and make the output:
dschultz@submitter ~/test/helloworld $ cat hello.out
Hello World!
dschultz@submitter ~/test/helloworld $
For longer jobs, which may take several hours, we can view the status of the job via condor_q:
dschultz@submitter ~/test/helloworld $ condor_q dschultz
-- Submitter: submit.icecube.wisc.edu : <10.128.12.110:34298> : submit.icecube.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
20556929.0 dschultz 5/21 10:30 0+00:00:00 I 0 0.0 hello.sh
1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
dschultz@submitter ~/test/helloworld $
Look, we have an idle job. It's probably waiting to run.
The most common HTCondor job states are:
Say we have a high memory job, which requires 8 GB of memory to run. While it will probably work with the default 1 GB setting, it is polite to ask for what you use. Modify the submit file to:
executable = hello.sh
log = hello.log
output = hello.out
error = hello.err
notification = never
# give me more memory
request_memory = 8000
queue
Other things that can be requested in this manner are disk space (request_disk), cpus (request_cpus), and gpus (request_gpus).
GPU jobs are similarly defined with a gpu request in the submit file:
executable = hello.sh
log = hello.log
output = hello.out
error = hello.err
notification = never
# gpu job
request_gpus = 1
# only run on CUDA gpus?
#requirements = CUDACapability
queue
Note the CUDACapability requrement if your program needs an Nvidia or Cuda gpu.
We also have groups to denote special conditions. The main group is the long group.
The default group sets a job time limit of 12 hours, after which the job is put on Hold. The long group increases this time to 48 hours, at the expense of fewer jobs running at one time. Add a job to the long group with the following submit file:
executable = hello.sh
log = hello.log
output = hello.out
error = hello.err
notification = never
# long job
+AccountingGroup="long.$ENV(USER)"
queue
Recently condor submission has gained new features to make it more useful:
executable = hello.sh
log = hello.log
output = hello.out
error = hello.err
notification = never
# set arguments to executable
arguments = $(Item)
queue 1 in (1,2,3,4,5,6,7,8,9,10 )
This will make 10 jobs, one for each argument.
If you wanted to run on input files, you can do:
queue 1 Item matching (*.i3.gz)
This will start one job per input file, passing the filename as an argument to the executable.
You can also set up a separate file to hold the arguments:
queue 1 Item from arguments.txt
Each line in arguments.txt will create a separate job with that line passed as an argument to the executable.
DAGMan is a tool that comes bundled with HTCondor. It can do two useful things:
Also, DAGMan = Directed Acyclic Graph Manager.
Let's make a basic DAG submit file:
# file name: dagman.submit
JOB job1 job.condor
VARS job1 Filenum="001"
JOB job2 job.condor
VARS job2 Filenum="002"
JOB job3 job.condor
VARS job3 Filenum="003"
JOB job4 job.condor
VARS job4 Filenum="004"
And a regular condor submit file to run:
# file name: job.condor
# special variables:
# Filenum = Filenum var defined in dagman.submit
Executable = job.sh
Arguments = $(Filenum)
output = job.$(Cluster).out
error = job.$(Cluster).err
log = job.log
notification = never
queue
And a script to run:
#!/bin/sh
# file name: job.sh
echo $@
And submit it, limiting to 2 active jobs:
dschultz@submitter ~/test/dagman $ chmod +x job.sh
dschultz@submitter ~/test/dagman $ condor_submit_dag -maxjobs 2 dagman.submit
-----------------------------------------------------------------------
File for submitting this DAG to Condor : dagman.submit.condor.sub
Log of DAGMan debugging messages : dagman.submit.dagman.out
Log of Condor library output : dagman.submit.lib.out
Log of Condor library error messages : dagman.submit.lib.err
Log of the life of condor_dagman itself : dagman.submit.dagman.log
Submitting job(s).
1 job(s) submitted to cluster 21135967.
-----------------------------------------------------------------------
dschultz@submitter ~/test/dagman $
We can see the dagman running:
dschultz@submitter ~/test/dagman $ condor_q dschultz
-- Submitter: submit.icecube.wisc.edu : <10.128.12.110:34298> : submit.icecube.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
21135967.0 dschultz 6/2 14:31 0+00:00:38 R 0 0.3 condor_dagman -f -
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
dschultz@submitter ~/test/dagman $
Looking at the dagman output messages, the job limit can be seen:
...
06/02/14 14:40:49 Of 4 nodes total:
06/02/14 14:40:49 Done Pre Queued Post Ready Un-Ready Failed
06/02/14 14:40:49 === === === === === === ===
06/02/14 14:40:49 0 0 2 0 2 0 0
06/02/14 14:40:49 0 job proc(s) currently held
06/02/14 14:40:49 Note: 2 total job deferrals because of -MaxJobs limit (2)
...
Now let us look at an example with dependencies, where one job must run before another one.
Let's make a dag with 3 parents and one child (maybe processing and cleanup?):
# file name: dagman.submit
JOB job1 job.condor
VARS job1 Filenum="001"
JOB job2 job.condor
VARS job2 Filenum="002"
JOB job3 job.condor
VARS job3 Filenum="003"
JOB job4 job.condor
VARS job3 Filenum="004"
# define the DAG relationship
Parent job1 job2 job3 Child job4
And submit it, limiting to 2 active jobs:
dschultz@submitter ~/test/dagman $ condor_submit_dag -maxjobs 2 dagman.submit
-----------------------------------------------------------------------
File for submitting this DAG to Condor : dagman.submit.condor.sub
Log of DAGMan debugging messages : dagman.submit.dagman.out
Log of Condor library output : dagman.submit.lib.out
Log of Condor library error messages : dagman.submit.lib.err
Log of the life of condor_dagman itself : dagman.submit.dagman.log
Submitting job(s).
1 job(s) submitted to cluster 21139114.
-----------------------------------------------------------------------
dschultz@submitter ~/test/dagman $
Looking at the dagman output messages, we can see the child job is un-ready until the three parents have finished:
...
06/02/14 15:45:31 Of 4 nodes total:
06/02/14 15:45:31 Done Pre Queued Post Ready Un-Ready Failed
06/02/14 15:45:31 === === === === === === ===
06/02/14 15:45:31 0 0 2 0 1 1 0
06/02/14 15:45:31 0 job proc(s) currently held
06/02/14 15:45:31 Note: 1 total job deferrals because of -MaxJobs limit (2)
...
06/02/14 15:45:56 Of 4 nodes total:
06/02/14 15:45:56 Done Pre Queued Post Ready Un-Ready Failed
06/02/14 15:45:56 === === === === === === ===
06/02/14 15:45:56 3 0 0 0 1 0 0
06/02/14 15:45:56 0 job proc(s) currently held
...
Let's first get a basic icetray script:
dschultz@submitter ~/test/icetray $ wget http://code.icecube.wisc.edu/svn/sandbox/bootcamp_madison_2014/my_first_icetray_script.py
dschultz@submitter ~/test/icetray $ chmod +x my_first_icetray_script.py
dschultz@submitter ~/test/icetray $
Now for some examples of how to submit IceTray jobs.
When you only have a few jobs to run, basic HTCondor submission is fine:
# file name: job.condor
Executable = my_first_icetray_script.py
output = job.$(Cluster).out
error = job.$(Cluster).err
log = job.log
notification = never
# use the current metaproject environment
getenv = True
# run on input
Arguments = input.i3.bz2 output.i3.bz2
queue
# run on input2
Arguments = input2.i3.bz2 output2.i3.bz2
queue
For larger numbers of jobs, DAGMan should be used. Here is the dagman submit file:
# file name: dagman.submit
JOB job1 job.condor
VARS job1 gcd="gcd.i3.gz"
VARS job1 infilename="input.i3.bz2"
VARS job1 outfilename="output.i3.bz2"
JOB job2 job.condor
VARS job1 gcd="gcd.i3.gz"
VARS job2 infilename="input2.i3.bz2"
VARS job2 outfilename="output2.i3.bz2"
And the condor submit file:
# file name: job.condor
Executable = my_first_icetray_script.py
output = job.$(Cluster).out
error = job.$(Cluster).err
log = job.log
notification = never
# use the current metaproject environment
getenv = True
Arguments = $(gcd) $(infilename) $(outfilename)
queue
Many wrapper scripts exist to do some of the tedious work of submission, such as modifying the basic template to match valid filenames for the current run(s). Here are some examples.
This shell script will build the dag submit file for you, based on the input directory you give it. Some customization of the script may be necessary each time it is used.
Shell Script:
#!/bin/sh for i in $1/*[0123456789].i3.bz2; do JOBID=job.`basename $i` echo JOB $JOBID job.condor gcdfile=`echo $i | sed s/Part.*.i3.bz2/GCD.i3.gz/g` echo VARS $JOBID gcd=\"$gcdfile\" echo VARS $JOBID infilename=\"$i\" echo VARS $JOBID outfilename=\"data/`basename $i`\" done
This series of scripts can be used to submit one or multiple jobs. It consists of a single job submit file, the DAG builder, and a shell script to submit and begin monitoring in one step. There is some documentation both in a README and in comments in each file.
condorDAGManExamples is located on svn:
dschultz@cobalt01 ~ $ svn ls http://code.icecube.wisc.edu/svn/sandbox/gladstone/condorDAGManExamples/ OneJob.submit README SubmitMyJobs.sh builddag.sh dagman.config
#1 advice: get on slack and ask some questions. We promise to be nice.
If the queue is full you may need to wait up to an hour for your job to start. Also, if you have been running lots of other jobs your priority may be lower than other users.
You can check your priority with condor_userprio.
If you think your job should be running and it isn't, then debugging can start. First, find the ID of the job. Then run condor_q -better-analyze on that ID.
Try running condor_q -hold $USER. This will tell you if your jobs have been stopped, and hopefully why. If nothing appears, either your jobs failed or they will restart automatically.
A common error message is:
-- Submitter: submitter.icecube.wisc.edu : <10.128.12.110:46424> : submitter.icecube.wisc.edu
ID OWNER HELD_SINCE HOLD_REASON
21482446.0 briedel 6/9 21:55 Maximum excution time exceeded: 12:01:24 > 12:00:00
The problem here is that this job was in the standard group and exceeded the 12 hour time limit. It should be resubmitted on the long group.
condor_q or other condor commands are failing with an error like:
-- Failed to fetch ads from: <10.128.12.110:40381> : submitter.icecube.wisc.edu
CEDAR:6001:Failed to connect to <10.128.12.110:40381>
HTCondor is likely overloaded, so stop trying to ask it things. Wait 5 minutes and try again.
If it fails to work for 30 minutes, then there might be a real problem. Email help@icecube.wisc.edu with the error message.
To practice submitting to condor and doing real work, let's find all events that pass the min bias filter from the first 100 Level2 files in this directory:
/data/sim/IceCube/2011/filtered/level2/CORSIKA-in-ice/10668/01000-01999/
Some hints:
A magic shebang is:
#!/bin/sh /cvmfs/icecube.opensciencegrid.org/standard/icetray-start
#METAPROJECT: offline-software/trunk
Be sure to check that the prescale passed too:
frame['FilterMask'][filter].condition_passed and
frame['FilterMask'][filter].prescale_passed
A script to process each file:
#!/bin/sh /cvmfs/icecube.opensciencegrid.org/standard/icetray-start
#METAPROJECT: offline-software/trunk
import sys,os
input = sys.argv[1]
output = os.path.join(sys.argv[2],os.path.basename(input))
from icecube import dataclasses,dataio
outfile = dataio.I3File(output,'w')
try:
for frame in dataio.I3File(input):
if 'FilterMask' in frame and frame['FilterMask']['FilterMinBias_11']:
outfile.push(frame)
finally:
outfile.close()
The condor submit file:
executable = script.py
output = out.$(Process)
error = err.$(Process)
log = log
notification = never
arguments = $(Item) /data/user/dschultz/bootcamp_output
queue 1 Item matching (/data/sim/IceCube/2011/filtered/level2/CORSIKA-in-ice/10668/01000-01999/Level2_IC86.2011_corsika.010668.0010*.i3.bz2)
More tutorials can be found at:
IceCube-specific information: