Condor Job Scheduling Environment










Starting in Fall 2003, you will be able to run your SimpleScalar simulation jobs (actually any jobs) on the machines in Moore 207 in batch cycle-strealing mode (i.e., you submit the jobs to a central scheduler and it runs them on any available machines it can find).  The job scheduler we will use is Condor, which was developed at the University of Wisconsin-Madison. Condor is a very general "grid computing" utility, but we will use it in its very basic form.  Condor is available on the machines in Moore 207 and on halfdome.cis.upenn.edu.
 

Compiling Jobs for Condor

Condor can run normal executables, but it's better to link with the condor checkpointing library.  That way, condor can migrate and restart interrupted jobs.  Compiling for condor is very simple, just preface the compilation/linking line in your makefile with the command condor_compile.   Notice, in my SimpleScalar Makefile, there are two lines for each simulator, one makes a "vanilla" executable.  One makes a "condor" executable.

sim-func:       sysprobe sim-func.$(OEXT) $(FUNC_OBJS) $(EXOLIB)
     $(CC) -o sim-func $(CFLAGS) sim-func.$(OEXT) $(FUNC_OBJS) $(EXOLIB) $(ZLIB) $(MLIBS)

sim-func.condor:        sysprobe sim-func.$(OEXT) $(FUNC_OBJS) $(EXOLIB)
    condor_compile $(CC) -o sim-func.condor $(CFLAGS) sim-func.$(OEXT) $(FUNC_OBJS) $(EXOLIB) $(ZLIB) $(MLIBS)
 

Creating a Condor Job Script

Now, that you have a condor executable, you need a condor job script. Making a job script is very easy. All you need is to create a file in which you define some environment variables.  The basic variables are:

INITIALDIR: the directory in which you want the job to execute
EXECUTABLE: the program you want to run (one of the simulators)
ARGUMENTS: the arguments to the program
INPUT: a file which will be used as the program's standard input
OUTPUT: a file which will be used as the program's standard output
ERROR: a file which will be used as the program's standard error

An important keyword in a condor job script is QUEUE. Whenever the script sees the word QUEUE, it evaluates the current environment variables and creates a condor job to that specification.  That's handy, because you can keep calling QUEUE while changing only the environment variables you want. In fact, you can use your own variables as well.  Here is the file jobs.condor which is in the directory /home8/a/amir/cis501/simplescalar/condor_example/

## MYROOT is my own variable
## PROG is my own variable, too
MYROOT=/home8/a/amir/cis501/simplescalar

INITIALDIR=$(MYROOT)/condor_example
EXECUTABLE=$(MYROOT)/simulators/sim-func.condor
ARGUMENTS=-insn:limit 10000000 $(MYROOT)/traces/specint2000/test/$(PROG).eio
INPUT=
OUTPUT=
ERROR=$(PROG).ssout

PROG=gcc
QUEUE

PROG=eon.kajiya
QUEUE

This file specifies two condor jobs, one runs gcc the other eon.kajiya.  Notice, because we are using eio traces, the simulations have neither input nor output.  The simplescalar output (i.e., the statistics) is written to the standard error.
 

Managing Condor Jobs

Once you have your script, you are ready to go. Here are some useful commands.

condor_submit <script_name>

This submits your jobs to condor.   All jobs submitted by a single script are submitted to a new "cluster" number.  You should remember that number.

bash-2.05$ condor_submit jobs.condor
Submitting job(s)..
2 job(s) submitted to cluster 12.

Condor will send you email when your jobs are done.  If you want to check on your jobs in the meantime, use:

condor_q

For example.

bash-2.05$ condor_q

-- Submitter: canfield.cis.upenn.edu : <158.130.68.19:55913> : canfield.cis.upenn.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  12.0   amir            9/15 10:49   0+00:00:22 R  0   3.8  sim-func.condor -i
  12.1   amir            9/15 10:49   0+00:00:00 I  0   3.8  sim-func.condor -i

2 jobs; 1 idle, 1 running, 0 held

This tells you that you have two jobs on the queue, 12.0 and 12.1. 12.0 has been running for 22 seconds. 12.1 is idle, it will start running when condor finds a machine for it.

If you want to remove a job, use:

condor_rm cluster_number[.job_number]

You can remove individual jobs or entire clusters of jobs.

bash-2.05$ condor_rm 12
Cluster 12 has been marked for removal.

That's about it. If you have any questions, ask the TA's or look on the condor homepage for documentation.