Parallel Processing
In parallel processing, a process is split into parts that can be executed simultaneously on different processors. Common approaches to organizing and running parallel jobs include
- Multi-process program: tasks are split up and run simultaneously on multiple processors with different input.
- Multi-threaded program: with OpenMPI or pthreads.
- Embarrassingly parallel: many instances of a single-threaded program (aka job array).
- Master/slave: a master program orchestrates multiple slave programs.
Note: For Parallel Processing, Jukka Suomela from Aalto has produced excellent materials and courses.
Why would you do things in parallel? Besides because its fun?
Mostly because when you have more processes to work on a project, it will be completed faster. We'll introduce some analogies here now, so skip this chapter if you don't like them.
You can have one well paid labourer to build a pyramid but it takes quite some time, unless the pyramid is very small. For grande design, you'd probably need a bit more well paid labourers. I think ten thousand ought to do it and you hand down everyone a set of instructions to build a pyramid. All goes well, everyone is silent and shortly you have ten thousand little pyramids.
Well, your well paid labourers are not exactly the smartest bunch on the planet, and nobody told them to build a single pyramid.
You think that probably the workers should be allowed to talk after all.
What are MPI and Slurm?
The Message Passing Interface (MPI) is a standardized library for exchanging messages between multiple parallel processes running over multiple computers across distributed memory.
It is important to know if your application scales well, namely that by adding more processing resources, then increasing amounts of input do not incur in increasing amount of processing time.
Slurm, a cluster management and parallel job scheduling system, facilitates resource allocation for running parallel processes. Key concepts include
- Task: total number of instances to be executed. In Slurm syntax -–ntasks or -n .
- CPU: total number of cores to use. In Slurm syntax --cpu-per-tasks or -c .
- Node total number of computational units (a unit may include multiple CPU's and share same memory space). In Slurm syntax –node or -N .
From now on, CPU is indistinguishable from core.
When a program is run in parallel, multiple copies of such program (instances) are created, and we refer here to each copy as a parallel process.
MPI Implementation Considerations
Turso has several different MPI libraries available for use. Generally OpenMPI is universal, while Intel MPI and MVAPICH provide better overall performance. Which one to use depends on your code and personal preferences. Slurm handles different versions of MPI differently.
OpenMPI 2.x does not require specific options for Infiniband, since it is defined as default. However, other MPI implementations do require some specific settings. See below:
Intel MPI requires users to point out explicitly to the PMI library with
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
For details of Slurm + impi see https://software.intel.com/en-us/articles/how-to-use-slurm-pmi-with-the-intel-mpi-library-for-linux
OpenMPI 3.x or later requires users to explicitly set pmix to be set for srun:
--mpi=pmix
MVAPICH requires at the moment for user to explicitly set pmi2 when using srun:
--mpi=pmi2
Message Passing with OpenMPI
You can think of above this way: You decide that you want to have 2 instances (tasks) of the command to be executed, and then allocate each instance (task) one CPU core to run on. This would make a Message Passing MPI job (parameters: -n 2, -c 1). Each task (instance) of the command could run in same, or different node. Tasks (instances) in separate nodes would communicate with each other over the interconnect network. Example follows:
Note! If you are using OpenMPI 4.x or newer, Infiniband is not assumed as the interconnect. You will need to specify this explicitly with --mca parameter, e.g.
--mca btl_openib_allow_ib 1
You can find an example of an MPI program from Wikipedia: https://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program
In turso login node, load the default MPI and compile the source code with
module load OpenMPI <------ case sensitivempicc wiki_mpi_example.c -o foobar.mpi
To find the exact version of the default OpenMPI, notice the symbol (D) in the listing of related modules,
----------------------- /appl/easybuild/modulefiles/all ------------------------
OpenMPI/3.1.1-GCC-7.3.0-2.30 OpenMPI/4.0.5-gcccuda-2020b
OpenMPI/3.1.3-GCC-8.2.0-2.31.1 OpenMPI/4.1.1-GCC-11.2.0
OpenMPI/3.1.4-GCC-8.3.0 OpenMPI/4.1.4-GCC-11.3.0
OpenMPI/4.0.3-GCC-9.3.0 OpenMPI/4.1.5-GCC-12.3.0 (D) <------ default
OpenMPI/4.0.5-GCC-10.2.0 xthi/1.0.0-OpenMPI-4.1.4-GCC-11.3.0 (L)
Where:
D: Default Module
L: Module is loaded
Example of job requesting four tasks with 1 core for each task:
#!/bin/bash # #SBATCH --job-name=foobar_mpi #SBATCH --output=<mpitest.out> #SBATCH -c 1 ## CPU cores per task #SBATCH -n 4 ## number of tasks#SBATCH -N 4 ## number of nodes #SBATCH -t 10:00 #SBATCH --mem-per-cpu=100M ## Memory request module purgemodule load OpenMPI srun foobar.mpi
MPI and Hyperthreaded Cores
If you use nodes with Hyperthreading enabled (e.g. anything but kaXX nodes) cores you probably would like not to use the HT for performance reasons. To avoid the HT performance impact, and the utilization of HT cores on the nodes where HT is enabled, you can use following:
#SBATCH --hint=nomultithread
MPI and Network Filesystem Performance Warning
With certain versions of MPI, you may see an warning message like this concerning performance and network filesystems/Lustre:
WARNING: Open MPI will create a shared memory backing file in adirectory that appears to be mounted on a network filesystem.Creating the shared memory backup file on a network file system, suchas NFS or Lustre is not recommended -- it may cause excessive networktraffic to your file servers and/or cause shared memory traffic inOpen MPI to be much slower than expected....
In reality, we have found that unless the FS is exceedingly busy, this is somewhat overcautious in case you use $WRKDIR and the performance drop, if any, is negligible.
MPI Performance Options
Following option can be given to srun or sbatch to inform Slurm that multiple nodes are used and job placement is logically contiguous.
#BATCH --contiguous
srun --contiguous <rest of the arguments>
Shared memory OpenMP
If however, you have command to run on a single instance (task), with 4 cpu's, and use local memory space for communications, it would be Shared Memory OpenMP job (parameters: -n 1, -c 4). All requested CPU's have to be from the same node and maximum size for the job equals to the number of CPU's sharing the same memory space. Eg. if a node has 24 cores, that would be the maximum size for the job.
First, you would need to compile the program:
gcc -fopenmp wiki_omp_example.c -o fobar.omp
OpenMP example requesting one task and 4 cpus for the task:
#!/bin/bash # #SBATCH --job-name=foobar_omp #SBATCH -o foobar-omp.out # #SBATCH -n 1 #SBATCH -c 4 #SBATCH -t 10:00 #SBATCH --mem-per-cpu=100M export OMP_PROC_BIND=TRUE export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./foobar.omp
MPI - Consumable resources
Below are some of the most common options to specify when submitting MPI job.
Job Wall Time limit:
#SBATCH -t <Wall Time limit>
Number of tasks in the MPI job:
#SBATCH -n <number of tasks>
Job CPU count equals to the cores:
#SBATCH -c <Number of CPU cores per task>
Job memory limit:
#SBATCH --mem-per-cpu=<MB>
Job Arrays - Embarrassingly Parallel Jobs
Slurm supports job arrays where multiple jobs can be executed with identical parameters. Each job in the array is considered for execution separately, like any other ordinary batch job and is therefore also subject to the fairness calculations. Ukko2 supports practically unlimited array size. However, please do pay attention to the special nature of the arrays when constructing your own.
If using array jobs, and the job requirements change during array run, it is advisable to divide the array to chunks with identical resource requirements. Oversubscribing prohibits other jobs from using reserved resources.
There is no practical limit for number of array jobs that can be submitted, or executed at any one time. However system has default limit of 10001 for a single array.
Example for array job script:
#!/bin/bash # #SBATCH --job-name=emb_arr #SBATCH --output=res_emb_arr.txt #SBACTH -c 1 #SBATCH -n 1 #SBATCH -t 10:00 #SBATCH --mem-per-cpu=100 # #SBATCH --array=1-8 srun ./example.emb $SLURM_ARRAY_TASK_ID
While arrays are extremely efficient way to submit large quantity of jobs, it is not advisable to submit array with entirely different task requirements and base the resource request on the highest array task. (Eg. 2000 1 core tasks using 15Mb of memory and 1000 1 core tasks using 250Gb of memory, and setting the limit to latter for whole 3000 task array. Instead, it is far more efficient to create separate 2000 and 1000 task arrays with appropriate resource requests).
If the running time of your program is short, creating a job array will incur a lot of overhead and you should consider packing your jobs. For this we can recommend use of somewhat obscurely named option "--exclusive" as in example below. To submit 1000 jobs in packs of 8 (allocating one task for a job, and one core for each task):
#!/bin/bash # #SBATCH --ntasks=8 for i in {1..1000} do srun -c 1 -n 1 --exclusive ./example.emb $i & done wait
Array Job e-mail notifications
Additionally if array jobs are used, you may wish to receive single notification once the whole array is done (default behaviour). Or by setting ARRAY_TASKS -option, receive mail notification for every task of the array. Latter may be useful for debugging but may be inconvenient if array has thousands of tasks.
#SBATCH --mail-type=ARRAY_TASKS,<other options>
Master/Slave
This is typically used in a producer/consumer setup where one program creates computing tasks for the other programs to perform (for example Hadoop cluster).
Creating reservation for Master/Slave type jobs, requesting four tasks and one core for each task. Example assumes that task placement is not relevant, and they can end up on same node, or different nodes. See below more details about placement:
#!/bin/bash # #SBATCH --job-name=test_ms #SBATCH --output=foobar.out #SBATCH -n 4 #SBATCH -c 1 #SBATCH -t 10:00 #SBATCH --mem-per-cpu=100 srun --multi-prog master.cfg
File "master.cfg" would then state the intended configuration, for example:
0 echo Master 1-3 echo Slave %t
This instructs Slurm to create four tasks. One of them running "Master" and other three "Slave".
Task placement to multiple nodes
At times, there may be need to place each task to a separate node (performance, scalability testing, I/O, or other reasons). Example to submit a four task job, one core for each task and each task placement to a separate node:
#!/bin/bash #SBATCH --tasks-per-node=1 #SBATCH - n 4 #SBATCH - c 1 #SBATCH -o out.put srun foobar
Advanced Parallel Processing
There are several special MPI hybrid cases that may be very useful in specific circumstances. Below a few examples to show how they can be done in Slurm.
MPI and multithreading
First, it is possible to mix MPI and OpenMP multithreading in a simple job (job called foobar assumes reservation of 8 tasks, 4 cores each):
#!/bin/bash # #SBATCH -n8 #SBATCH -c 4 module load OpenMPI export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./foobar
MPI and Array Jobs
Besides of mixing MPI and multithreading, you can also mix array jobs, below a brief example:
#!/bin/bash # #SBATCH --array=1-10 #SBATCH -n 8 #SBATCH -c 4 module load OpenMPI export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./fobar $SLURM_ARRAY_TASK_ID
GPU Jobs
Example GPU batch script could look like following:
#!/bin/bash #SBATCH --job-name=GPUbar #SBATCH -n 1 #SBATCH -c 1 #SBATCH --ntasks-per-node=1 #SBATCH --time=1:00:00 #SBATCH --mem-per-cpu=1000 #SBATCH -p gpu #SBATCH --gres=gpu:1 module load <application/version> executable <input.dat>
If you need GPU the queue needs to be specified with -p parameter as well as the requested GPU resource:
#SBATCH -p gpu
#SBATCH --gres=gpu:1
Note that it is always good idea to request at least as many CPU cores as GPUs.
Advanced GPU optimization
Resource affinity may have significant impact on the GPU performance. For this purpose "--accel-bind" was added to Slurm. Following options are supported:
- g → Bind each task to GPUs which are closest to the allocated CPUs.
- m → Bind each task to MICs which are closest to the allocated CPUs.
- n → Bind each task to NICs which are closest to the allocated CPUs.
- v → Verbose mode. Log how tasks are bound to GPU and NIC devices.
Here is an example srun command:
srun --gres=gpu:2 --ntasks-per-socket=2 --accel-bind=g -n 4 ./foobar-MPI