Skip to content

Condor Manual

Carbon Research Group edited this page Apr 11, 2014 · 6 revisions

This page contains the instructions for running Graphite with the Condor scheduler. The instructions are directly applicable only to MIT CSAIL users. External users should modify the file tools/condor_config.py to suit their Condor environment.

Individual Simulations

Start Simulation

First, set the machine pool to be used for Condor in tools/condor_config.py.

There are a total of 10 fosnode machines and 22 draco machines. To use the fosnode machines, set machine_pool to fos. To use the draco machines, set machine_pool to draco.

Next, set the Makefile variable SCHEDULER to condor to use the Condor scheduler. For example,

$ make radix_bench_test SCHEDULER=condor

will run the benchmark radix using the Condor scheduler.

Note: The default is the basic scheduler, so if the SCHEDULER Makefile variable is not explicitly specified on the command line, the spawning script will use the basic Graphite scheduler. The basic scheduler spawns the simulation on the machines specified in the [process_map] section of the configuration file.

The spawning script waits until the simulation passes/fails. This includes the time it takes for Condor to schedule and run the simulation on one of the available machines. Once the simulation completes, the output is printed on the screen.

Kill Simulation

If you want to kill the simulation before it ends, press Cntl+C on the keyboard. This action will automatically remove the simulation from the Condor queue.

Output Files

In addition to the regular output files produced by Graphite (look at Simulation Outputs), a simulation run using the condor scheduler also produces the following 3 files:

  • condor_job.output: The output produced by the Condor job (this includes the output produced by the application as well as any output produced by Graphite during the running of the application).
  • condor_job.submit: The .submit script that was passed to the Condor scheduler (this includes the requirements and configuration parameters of the Condor job).
  • condor_job.sh: The script to be executed by the Condor job.

Batch Simulations

Start Simulations

  1. Use the file tools/run_tests.py as a template for writing scripts to run automated simulations. The file is well-documented and easy to understand. Some examples of parameters used in the automated simulations script are:
  • Scheduler to use (condor or basic)
  • Directory to place results in
  • Configuration file name
  • Benchmarks to run
  1. Select either fos or draco as the machine pool to use in the Condor configuration file (tools/condor_config.py).
  2. Run the simulation using
$ python tools/run_tests.py

Kill Simulations

To kill simulations, press Cntl+C on the keyboard. This automatically removes all the queued simulations from the Condor queue.


Limitations

  1. The Graphite directory should be on the NFS file system. Condor does not play well with AFS and no attempt has been made to copy over the needed file for a simulation into the NFS space.
  2. Only simulations with a single process can be run with Condor. To run multi-process/multi-machine simulations, use the basic scheduler.
  3. You can either use the fos or draco machine pool for running batch simulations but not both. We are working on a Graphite fix that will allow the usage of both pools simultaneously in one batch simulation. Right now, you can start one batch simulation with fos and another with draco.

Condor Cheatlist

Normally, if you use the above instructions, you would not need to use any explicit Condor commands. But in case you need to manually inspect/remove jobs on the Condor queue, the following commands are useful.

To check the status of the Condor queue,

$ condor_q -global -submitter ${USER}

To check which machines the simulations submitted by ${USER} are currently occupying,

$ condor_q -global -submitter ${USER} -long | grep RemoteHost

To remove all Condor jobs submitted by ${USER},

$ condor_rm ${USER}