Page tree
Skip to end of metadata
Go to start of metadata

Parallel processing is a mode of operation in which a process is split into parts, and which are executed simultaneously on different processors attached to the same computer. There are several different methods to go around running parallel jobs and different methods have different benefits.

Brief descriptions of various parallel jobs:

  • Multi-process program where tasks are split up and run simultaneously on multiple processors with different input.
  • Running a multithreaded program with OpenMPI or pthreads.
  • Running several instances of a single-threaded program (embarrassingly parallel, or a job array)
  • Running one master program controlling several slave programs (master/slave)

What is MPI?

The message passing interface (MPI is) is a standardized means of exchanging messages between multiple computers running a parallel program across distributed memory.

From the Slurm and job placement perspective, it is important to know if your application scales well.  Understanding of the basic differences of nodes, tasks and cpu's is essential (for all intents and purposes, cpu is indistinguishable from core). In short therefore,

  • Task describes how many instances of your command are executed. In Slurm language, Tasks are referred to as "-–ntasks" or "-n".
  • CPU describes how many cpu's your command can use. In Slurm language, CPU's are referred to as "--cpus-per-tasks", or "-c".
  • Node describes computational unit which may contain many CPU's sharing same memory space. In Slurm language nodes are referred to as "–node" or "-N".

Message Passing OpenMPI

You can think of above this way: You decide that you want to have 2 instances (tasks) of the command to be executed, and then allocate each instance (task) 1 cpu to run on. This would make a Message Passing MPI job (parameters: -n 2, -c 1). Each task (instance) of the command could run in same, or different node. Tasks (instances) in separate nodes would communicate with each other over the interconnect network. Example follows:

You can find example MPI program from Wikipedia: https://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program

First, you will need to compile the program:

module load openmpi
mpicc wiki_mpi_example.c -o foobar.mpi

Example of job requesting four tasks with 1 core for each task:

#!/bin/bash
#
#SBATCH --job-name=foobar_mpi
#SBATCH --output=<mpitest.out>
#SBATCH -c 1                     // CPU's
#SBATCH -n 4                     // tasks per CPU
#SBATCH -t 10:00
#SBATCH --mem-per-cpu=100        // Memory request per core

module load OpenMPI
srun foobar.mpi
MPI Performance options

Following option can be given to srun or sbatch to inform Slurm that multiple nodes are used and job placement is logically contiguous.

#BATCH --contiguous
srun --contiguous <rest of the arguments>

Shared memory OpenMPI

If however, you have command to run on a single instance (task), with 4 cpu's, and use local memory space for communications, it would be Shared Memory OpenMPI job (parameters: -n 1, -c 4). All requested CPU's have to be from the same node and maximum size for the job equals to the number of CPU's sharing the same memory space. Eg. if a node has 24 cores, that would be the maximum size for the job.

First, you would need to compile the program:

gcc -fopenmp wiki_omp_example.c -o fobar.omp

 OpenMPI example requesting one task and 4 cpus for the task:

#!/bin/bash
#
#SBATCH --job-name=foobar_omp
#SBATCH -o foobar-omp.out
#
#SBATCH -n 1
#SBATCH -c 4
#SBATCH -t 10:00
#SBATCH --mem-per-cpu=100

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./foobar.omp

MPI - Consumable resources

Below are some of the most common options to specify when submitting MPI job.

Job Wall Time limit:

#SBATCH -t <Wall Time limit>

Number of Tasks in the MPI job:

#SBATCH -n <number of tasks>

Job CPU count equals to the cores:

#SBATCH -c <Number of CPU's per task>

Job memory limit  (Please Note: Memory request is set per core/CPU):

#SBATCH --men-per-cpu=<MB>

 Job Arrays - Embarrassingly Parallel Jobs

Slurm supports job arrays where multiple jobs can be executed with identical parameters. Each job in the array is considered for execution separately, like any other ordinary batch job and is therefore also subject to the fairness calculations.  Ukko2 supports practically unlimited array size. However, please do pay attention to the special nature of the arrays when constructing your own.

Array Job Usage

If using array jobs, and the job requirements change during array run, it is advisable to divide the array to chunks with identical resource  requirements. Oversubscribing prohibits other jobs from using reserved resources.

Example for array job script:

#!/bin/bash
#
#SBATCH --job-name=emb_arr
#SBATCH --output=res_emb_arr.txt
#SBACTH -c 1
#SBATCH -n 1
#SBATCH -t 10:00
#SBATCH --mem-per-cpu=100
#
#SBATCH --array=1-8

srun ./example.emb $SLURM_ARRAY_TASK_ID

While arrays are extremely efficient way to submit large quantity of jobs, it is not advisable to submit array with entirely different task requirements and base the resource request on the highest array task. (Eg. 2000 1 core tasks using 15Mb of memory and 1000 1 core tasks using 250Gb of memory, and setting the limit to latter for whole 3000 task array. Instead, it is far more efficient to create separate 2000 and 1000 task arrays with appropriate resource requests).


Performance Impact of Small Array Jobs

If the running time of your program is short, creating a job array will incur a lot of overhead and you should consider packing your jobs. For this we can recommend use of somewhat obscurely named option "--exclusive" as in example below. To submit 1000 jobs in packs of 8 (allocating one task for a job, and one core for each task):

#!/bin/bash
#
#SBATCH --ntasks=8
for i in {1..1000}
do
   srun -c 1 -n 1 --exclusive ./example.emb $i &
done
wait

Array Job e-mail notifications

Additionally if array jobs are used, you may wish to receive single notification once the whole array is done (default behaviour). Or by setting ARRAY_TASKS -option, receive mail notification for every task of the array. Latter may be useful for debugging but may be inconvenient if array has thousands of tasks.

#SBATCH --mail-type=ARRAY_TASKS,<other options>

Master/Slave

This is typically used in a producer/consumer setup where one program creates computing tasks for the other programs to perform (for example Hadoop cluster).

Creating reservation for Master/Slave type jobs, requesting four tasks and one core for each task. Example assumes that task placement is not relevant, and they can end up on same node, or different nodes. See below more details about placement:

#!/bin/bash
#
#SBATCH --job-name=test_ms
#SBATCH --output=foobar.out
#SBATCH -n 4
#SBATCH -c 1
#SBATCH -t 10:00
#SBATCH --mem-per-cpu=100

srun --multi-prog master.cfg

File "master.cfg" would then state the intended configuration, for example:

0      echo Master
1-3    echo Slave %t

This instructs Slurm to create four tasks. One of them running "Master" and other three "Slave". 

Task placement to multiple nodes

At times, there may be need to place each task to a separate node (performance, scalability testing, I/O, or other reasons). Example to submit a four task job, one core for each task and each task placement to a separate node:

#!/bin/bash
#SBATCH --tasks-per-node=1
#SBATCH - n 4
#SBATCH - c 1
#SBATCH -o out.put
srun foobar

Advanced Parallel Processing

There are several special MPI hybrid cases that may be very useful in specific circumstances. Below a few examples to show how they can be done in Slurm.

MPI and multithreading

First, it is possible to mix MPI and OpenMPI multithreading in a simple job (job called foobar assumes reservation of 8 tasks, 4 cores each):

#!/bin/bash
#
#SBATCH -n8
#SBATCH -c 4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./foobar

MPI and Array Jobs

Besides of mixing MPI and multithreading, you can also mix array jobs, below a brief example:

#!/bin/bash
#
#SBATCH --array=1-10
#SBATCH -n 8
#SBATCH -c 4
module load OpenMPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun ./fobar $SLURM_ARRAY_TASK_ID

GPU Jobs

Example GPU batch script could look like following:

#!/bin/bash
#SBATCH --job-name=GPUbar
#SBATCH -n 1
#SBATCH -c 1
#SBATCH --ntasks-per-node=1
#SBATCH --time=1:00:00
#SBATCH --mem-per-cpu=1000
#SBATCH -p gpu
#SBATCH --gres=gpu:1

module load <application/version>

executable <input.dat>


GPU Usage

If you need GPU the queue needs to be specified with -p parameter:

#SBATCH -p gpu

Note that it is always best to request at least as many CPU cores as GPUs.

Advanced GPU optimization

Resource affinity may have significant impact on the GPU performance. For this purpose "--accel-bind" was added to Slurm. Following options are supported:

  • g → Bind each task to GPUs which are closest to the allocated CPUs.
  • m → Bind each task to MICs which are closest to the allocated CPUs.
  • n → Bind each task to NICs which are closest to the allocated CPUs.
  • v → Verbose mode. Log how tasks are bound to GPU and NIC devices.

Here is an example srun command:

srun --gres=gpu:2 --ntasks-per-socket=2 --accel-bind=g -n 4 ./foobar-MPI 


  • No labels