Running jobs (TORQUE/Maui)
IMPORTANT NOTES FOR TORQUE:
NOTE: PBS/TORQUE is not able to deliver batch output files back to your submission directory when your username is 16 characters or longer. If you do have a user name 16 characters or longer, please add:
#PBS -k oe
To your batch script and your batch output files will be delivered to your home directory.
NOTE: The max walltime for Bora and Hima nodes is 72hrs. Currently, there is no warning if you submit a job to bora or hima with a walltime greater than 72hrs, it will simply sit in the queue. This will be resolved once bora and hima are switched over to the Slurm batch system.
NOTE: Your start-up login shell must not produce output for the batch system to deliver output back to you correctly. If you change your .cshrc or .cshrc.XXX and it produces any sort of output to the screen when you log in, you will not receive your batch output files.
NOTE: Spaces in linux file names and directories will break the delivery of batch output. Never use spaces in file names or directories in Linux.
To ensure that users' calculations do not interfere with each other and that computational resources are allocated fairly and efficiently, William & Mary HPC systems employ the TORQUE resource manager in conjunction with the Maui cluster scheduler. With few exceptions, any computation on W&M HPC systems must be submitted and run via TORQUE/Maui -- collectively and more generically the "job scheduler" or "batch system" -- as a job. To schedule and assign resources to your job, the system needs to know what resources your job requires, so you must answer the following questions and provide those answers to TORQUE's qsub
command.
What type of computer?
W&M HPC systems are composed of several different types of computers, each with different available hardware and software. If your job can run on any computer ("node") in the cluster, that's excellent (your job is easier to fit in and will start sooner)! However, if it needs a specific type of computer, you must select which ones are acceptable using node properties/features.
How many of them?
Merely allocating extra nodes will not increase performance, but if you know that your application can use multiple nodes simultaneously (distributed memory parallelism, e.g. with MPI), you can request a particular number of nodes be allocated to your job.
How many processors per node?
Every node in the cluster has more than one processor, and again, merely allocating extra processors will not increase performance, but if you know that your application can use multiple processors simultaneously (via shared memory parallelism, e.g. with OpenMP or threads), you can request a particular number of processors per node.
How long?
You must give the job scheduler an upper bound on how long your job will run, called walltime
("wall" as in real time you'd see on a clock on the wall, to distinguish it from "CPU" time spent actively using a processor).
The maximum is usually either 180 hours (vortex,hurricane,whirlwind,wind,ice) or 72 hours (cyclops,femto,bora,hima). Computations which cannot complete within this time limit should be broken up into multiple jobs by incorporating a checkpoint/restart capability (which is advisable regardless, in case of equipment failures).
Because actual runtimes are not known until a job completes, the job scheduler can only schedule jobs based on their declared walltime
limits. Turnaround time for individual jobs, as well as overall system utilization, are improved when walltime
limits are reasonably accurate. Excessive time limits will lower your job's priority and reduce opportunities to fit your job into holes in the schedule, meaning that you may have to wait longer for your results when the system is busy. On the other hand, be sure that walltime
limits provide enough cushion so your jobs will not be terminated prematurely if they take a little longer than expected.