Advanced Job Examples#
These are provided to give you a starting point, and give you ideas on what can be done.
Submitting Mixed Mode Jobs#
The ‘mixed mode’ (MPI+OpenMP) programming model typically involves MPI processes running across nodes and OpenMP threads upon each node with the total number of processes (MPI*OpenMP) equalling the number of physical processor cores.
Your code will need to call MPI_Init and make use of OpenMP directives. You will compile your code using an MPI wrapper and enabling OpenMP support, for example:
$ mpif90 -openmp example.f90 -o mixed.exe
You will need to determine
ppn, the number of MPI processes per node, and
tpp, the number of OpenMP threads per MPI process.
Additionally, you can either ask for a given number of nodes nodes or for the
total number of MPI processes
np. Note that
ppn is related to
Your submission script would then need to contain:
#$ -V #$ -l h_rt=01:00:00 #$ -l nodes=<nodes>, ppn=<ppn>, tpp=<tpp> mpirun ./a.out
#$ -V #$ -l h_rt=01:00:00 #$ -l np=<np>, ppn=<ppn>, tpp=<tpp> mpirun ./a.out
There are 24 cores per node on ARC3, so you would typically ensure
There are 40 cores per node on ARC4, so you would typically ensure
Using ARC4, To run an MPI+OpenMP executable mixed.exe with 80 MPI processes each launching 4 OpenMP threads, the following submission script would be needed:
#$ -V #$ -cwd #$ -l h_rt=01:00:00 #$ -l np=80, ppn=10, tpp=4 mpirun ./mixed.exe
This will allocate 8 nodes (=8*40=320 cores). Each node will have 10 MPI processes, each of which will have 4 OpenMP threads (so 10*4=40 processes per node in total, and 8*140=320 (=80MPI*4OpenMP) processes in total.
Alternatively, the same effect can be achieved by:
#$ -V #$ -cwd #$ -l h_rt=01:00:00 #$ -l nodes=8, ppn=10, tpp=4 mpirun ./mixed.exe
Note that the
OMP_NUM_THREADS environment variable is automatically set by
the batch system and so you do not need to set this in your environment.
The SGE scheduler allows you to submit a job but then hold them in the queue until certain job dependencies are met, ie. it will hold the job until a previous job (or number of jobs have completed).
To do this, submit the first job as normal:
$ qsub job1.sh
As usual, the scheduler will give you a job ID
If you then want to submit another job that is dependent on the first one
completing, use the
-hold_jid option for qsub:
$ qsub -hold_jid <jobid> job2.sh
Job 2 will remain in the queue until job 1 has completed.
Alternatively, if you have given the job a name (using the
-N option), then
you can put a hold on subsequent jobs using the name instead:
$ qsub -hold_jid <jobname> job2.sh
If you give several jobs the same name, then the job will be held until all the
named jobs have been completed.
See Batch jobs for more details on
SGE supports jobs that exit telling the scheduler that they’re not finished, and so should be rescheduled. This is very useful when you have a long running job that ultimately will take longer than 48 hours to complete, but you can easily save the current progress as you go along. Here is a dummy example to hopefully give you ideas on how this could be used:
#!/bin/bash #$ -l h_rt=0:10:0 # Which subtask we're currently processing. This script assumes we're only # completing one subtask per SGE job, but there's no reason you couldn't # process more. counter=1 # Read the restartfile, if present, to continue from where we got up to if [ -f restartfile ];then echo loading restart file . restartfile fi # If we detected we've been restarted, we may need to run differently, so # here's how you could detect it. if [[ $RESTARTED == 1 ]];then echo Auto-submitted job continuing fi # Complete a unit of work echo Doing work - counter: $counter # Increment the counter once we've done a unit of work let counter++ # Write the next task number to the restartfile echo counter=$counter > restartfile # If we've not yet completed all our tasks, tell the scheduler that we'll need # to run again if [ $counter -le 10 ];then exit 99 fi # We've completed all tasks, so don't restart. echo Complete!
Non integer task array sequences#
This shows how you can run a task array with two sets of parameters where one of them is non-integer.
# Run a total of 72 tasks #$ -t 1-72 # Put the logs into a separate directory #$ -o logs #$ -e logs #$ -cwd -V #$ -l h_rt=0:10:0 # Define array dimensions, lining up with the 72 tasks we're running rows=18 columns=4 # Calculate row and column indexes row_index=$(( ($SGE_TASK_ID - 1) / $columns + 1 )) col_index=$(( ($SGE_TASK_ID - 1) % $columns + 1 )) # Map the integer value into the floating point value we actually wanted calculated_float=$(printf %.4f\\n "$(($col_index))e-4") # Run the experiment passing those values in ./run_code $calculated_float $row_index