New HPC cluster ukko2 is now ready for production. To be able to use the cluster, you need to either
- be member of CS staff or
- belong to the IDM group grp-cs-ukko2
Please note that it might take up to 2 hours after you have been added to either group before your home directory gets created.
Connections are only allowed from the helsinki.fi domain, however using VPN or Eduroam, is not sufficient. To access ukko2 outside of the domain (e.g. home), add following to your ~/.ssh/config
ProxyCommand ssh email@example.com -W %h:%p
Note: with OpenSSH versions > 7.3, you can substitute the ProxyCommand with more human readable line: ProxyJump melkinpaasi.cs.helsinki.fi
The login node is meant for batch job management and for compiler/development environment. Do not execute any jobs on it.
To access the login node:
Batch scheduling system is the most prominent difference between ukko and ukko2. Instead of logging into computing node and executing jobs there interactively, users now log in to a login node and submit the jobs via batch scheduling system. Batch system increases the usability of the system by considering job resource requests and scheduling jobs to available resources.
Another change is a Module System. Software packages, compiler environments etc are handled by modules. This allows user to load, unload or change between environments flexibly.
Ukko 1 Nodes
- At this time there is a single Ukko 1 Cubbli Linux node available, see instructions. Node has 4 cores, 32GB of RAM.
Ukko 2 Nodes
- Single login node ukko2 serves logins, compiler environments and all batch scheduling functions
- 31 regular compute nodes (ukko-02 - ukko-32): 28 cores, 2 threads and 256 GB RAM.
- 2 big memory nodes (ukko2-pekka, ukko2-paavo): 96 cores, 2 threads and 3 TB RAM.
- 2 GPU nodes (ukko2-g01, ukko2-g02) : 28 cores, 2 threads, 512 GB RAM and 4 Tesla P100 GPU cards.
I/O, disks and filesystems
The home directory is the same as it is in the old ukko: all user files will be available on the new cluster. User can also access the files in ukko from the department computers using the path /cs/work/home/username.
Jobs are managed with the SLURM batch scheduling system and runtime environment is managed with modules which allow selection of correct compilers, libraries and tools. The jobs are started from the ukko2 login node (ukko2.cs.helsinki.fi)
There are six production queues and one short queue to test jobs. Some queues overlap others for better system utilisation:
Wall Time limit
Memory per core
|short||24h||1032||8GB - 32GB|
|extralong||60 days||28||8GB||Single node|
|bigmem||7 days||192||32GB||Two nodes|
|gpu||7 days||56||18GB||Have to reserve with #SBATCH p gpu and --Gres:gpu=<nbr# gpu's>|
|test||1h||112||8GB||Have to reserve with #SBATCH -p test|
|cubbli||1 day||4||8GB||Have to reserve with -p cubbli|
Creating a Batch job
Simplest way to submit job into the system is to do it as a simple serial job. Following example is requesting 1 core and 100M of memory for 10 minutes and placement in a test queue. At the end of the script, srun command is used to start the program.
Note that batch script needs to start with shebang (#!/bin/bash) and the batch parameters have to be set in the script before the actual program.
sbatch syntax for time may not be obvious. Normally times are set as DD-hh:mm:ss where DD=days, hh=hours, mm=minutes, and ss=seconds. However, you can also use 2-0 to represent 2 days, while 10:00 would indicate 10 minutes.
#!/bin/bash #SBATCH --job-name=test #SBATCH -o result.txt #SBATCH -p test #SBATCH -c 1 #SBATCH -t 10:00 #SBATCH --mem-per-cpu=100 srun hostname srun sleep 60
Following command submits the test.job into the system and , and the scheduler takes care of the job placement (sbatch accepts additional options on the command line):
If you need only some environment variables to be propagated, or none, you can choose export option (Note that default is ALL. See a special case for Cubbli Linux nodes):
--export=<environment variables | ALL | NONE>
Also note that Slurm sets number of environment variables which can be used for job control. See this page for a full compendium of the available variables.
Serial - Consumable resources
Below are some of the most common batch job options for serial jobs. If no values are given, system defaults are used. These values are used to determine the job priority.
Job Wall Time limit:
#SBATCH -t <Wall Time limit>
Job CPU count equals to the cores:
#SBATCH -c <CPU count>
Job memory limit (Please Note: memory reservation is per core):
Further job control
Show queue information:
If you want to cancel your job:
To check the status of your jobs that are in the queue:
squeue -l -u yourusername
For information about a job that is running:
scontrol show jobid -dd <jobID>
For information about a completed job's efficiency. Output of seff is automatically included in the end of job mail notifications, if notifications are set to be sent in the batch script.
Job control summary of less common commands:
|sacct||Displays accounting data for all jobs.|
|scontrol||View SLURM configuration and state.|
|sjstat||Display statistics of jobs under control of SLURM (combines data from sinfo, squeue and scontrol).|
|sprio||Display the priorities of the pending jobs. Jobs with higher priotities are launched first.|
|smap||Graphically view information about SLURM jobs, partitions, and set configurations parameters.|
Comprehensive Slurm Quick Reference & Cheat Sheet which can be printed out.
Serial or Parallel Job?
Serial job is any program that runs on a single machine. In case of Ukko2, it means a program running on a single core.
Parallel job is composed of multiple processes which run on multiple machines. Simplest case would be a job that uses two cpu's and sets of related processes. Processes talk to each other through a medium shared between the cpu's, like a local memory space.
Testing and Development
Before running jobs on the production queues, resource requirements, and in case of MPI jobs, scalability should be tested. 1h test queue is available for this purpose. "-p" parameter is mandatory for test jobs.
#SBATCH -p test
When submitting job for test queue, following parameters could be set in the batch -file. Mail parameters can be used with any jobs, they are not limited to test:
#!/bin/bash #SBATCH --job-name=test // Job name to be displayed in queue #SBATCH --output=foobar.out // Job output at the completion #SBATCH -p test // Request test partition #SBATCH -c 1 // Request single core #SBATCH -e foobar.err // Define error file #SBATCH --mail-type=END // Defining END of job mail notification #SBATCH --firstname.lastname@example.org // mail recipient srun hostname // commands to be run srun sleep 60
Below an example script for GPU usage, assuming single GPU (note that for this you need to use --gres:gpu=1), two cores and 100M of memory to be used for default time:
#!/bin/bash #SBATCH --job-name=test #SBATCH -o result.txt #SBATCH -p gpu #SBATCH -c 2 #SBATCH --Gres:gpu=1 #SBATCH --mem-per-cpu=100 srun hostname srun sleep 60
Running as a deamon
You should not run batch jobs as deamons. For example when deploying Spark.
Setting up batch job e-mail notifications
You can set up e-mail notifications for batch job. If set, changes in the job status will be sent to specified user. Default is the user who submits the job.
Most commonly chosen mail options are: NONE, BEGIN, END, FAIL or ALL. To set the option, following line is needed in the batch script. Multiple options can be set as comma separated list:
User may also specify mail address other than default:
There is no need for direct access to node to start interactive session. Slurm allows interactive sessions to be started with srun. After entering the srun command, interactive job request is sent to the normal queue to wait for resources to become available. Once resources are available the session starts on a compute node, and you are put into the directory from which you ran the launched the session. You can then run commands.
Environment you get on compute node is determined by:
- The environment as set in your session from which you launch the srun command.
- Any extra variables set by Slurm
- Settings from your .bashrc file
Example of starting 1 core, 1 task interactive session with bash -shell. If values are not set, they are inherited from the system or queue defaults.
srun -c 1 --ntasks-per-node=1 --pty bash
To show slurm variables when session starts:
export | grep SLURM
Interactive node availability
Slurm supports Advance Reservations. You or your group may ask for a specific resources for a dedicated time slot. However, because Advance Reservations are not ordinary user option to choose, specific request needs to be submitted to email@example.com to enable the reservation. Advance Reservations are disruptive to the system operation (jobs need to be drained from the system to enable empty slot at the given time) and the resource requirements have to be justified.
Actual Resource Utilisation
Slurm features simple utility to provide job utilisation details from any job that has completed. Using this utility helps to determine actual resource needs, of the job for future reference:
seff <completed job ID>
Job Accounting Data
Slurm has powerful accounting feature with myriad options to choose from. Below a line featuring some of the more useful details:
Provides easy to read list formatted output, where fields are:
JobID: Job identification number
JobName: Job name given in the Slurm batch script
ExitCode: Exit code once job was terminated
NNodes: Node count
NCPUS: CPU's (Core) reserved by the job
MaxRSS: Memory peak usage during job execution, returns value when job has finished. This value can be used to adjust the requested memory value in the batch script accordingly.
Elapsed: Time batch job was in execution
End: End time of batch job
Lists details when JobID is known:
sacct -j <jobID> -oJobID,JobName,ExitCode,NNodes,NCPUS,MaxRSS,Elapsed,End
Jobs listed by UserID:
sacct -u <userID> -oJobID,JobName,ExitCode,NNodes,NCPUS,MaxRSS,Elapsed,End
sacct man page
Job execution priorities depend upon user resource requests. If no resource limits are requested in the batch script, then system and queue defaults are used. Job priority and scheduling decisions are based on available system resources. Fair Share is applied to allocate everyone near equal share of the system. Below a list of resources considered with most "expensive" resource on the top:
- GPU requested
- Memory requested
- Wall Time requested
- CPU's requested
CSC's Taito cluster's documentation may be useful