Lustre User Guide

Last modified by Javier Nunez-Fontarnau on 2024/02/14 08:48

1.0 Introduction

Lustre is a de facto standard high performance parallel filesystem associated to nearly every HPC installation. it is Posix compliant, very scalable, supporting thousands of clients, petabytes of storage and throughput measured in giga- or even terabytes per second. Key components of Lustre are Object Storage Targets (later referred as OST), which are the actual disks, RAID -arrays etc, Object Storage Servers (later referred as OSS), which are used to group the storage targets together and Metadata Servers (MDS) which keep the whole together. Metadata Targets (MDT), hold the file names, permissions, types etc. Lustre software then presents the whole to a user as a single storage space. Lustre has some special features, and useful commands / along with certain not-to-do's which will be discussed below. But first, let's introduce some basic concepts.

1.1 Lustre Mount

Lustre has been mounted to Ukko2 through Infiniband and is available at $WRKDIR in both Ukko2 and Kale. If you actively use both systems, you can keep working data in both work directories in sync. $WRKDIR is the work directory users use to run the codes from. You can have group folders for your IDM group, just let us know the folder name, owner and a group and it will be created.

1.1.1 Vakka $WRKDIR access from Kale

You can access Ukko2 $WRKDIR files directly from Kale. This is not meant to be working directory but to ease data movement between systems. Since connection is rather on the slow end at the moment, and limited by a gigabit ethernet. We are aware of this and working on to improve the throughput.

Ukko2 $WRKDIR contents are in:

/vakka/users/$USERNAME
/vakka/group/

1.1.2 Kappa $WRKDIR access from Ukko2

This is yet to be done

1.1.3 Vakka Samba share

Ukko2 Lustre $WRKDIR is also available to workstations and other servers through Samba. Use ukko2-smb.cs.helsinki.fi to access the filesystem from your desktop. In this case the throughput is limited by a Gigabit Ethernet connection which gives theoretical peak of 116MB/s.

1.2 Quota

We are now enforcing quotas on Lustre FS. Users have 50TB hard quota limit and 10TB soft limit to protect other users from possible runaway processes. Inode quotas are 5M soft and 10M hard quota. We are also limiting the inode count single user can take up to restrain the issues caused by millions of files. If you require more working space, let us know and if you need a substantially more for more permanent basis on a cluster, then we have some options.

See your quota:

 lfs quota -hu $USER /wrk

1.3 Periodical Cleanup

Users are obliged to remove unused files from the $WRKDIR. Users leaving research group, or university are obliged to secure access to the relevant research data for other group members. Failure to do so may cause irrevocable loss of data. We do reserve right to run scripts to remove files which have modified date older than 60 days to make sure that the filesystem is not clogged up with old files. We will give couple of days advance notice before cleanup is run. 

Cleanup priorities:

  • Expired users > 6 months ago
  • Expired users > 1 month ago
  • Current users, resident data aged >60 days

Users can reduce the need for global cleanup by removing old data when it is no longer in use. Should user use any artificial mechanism to keep files current to avoid cleanup, user doing so is subject to percussions, including but not limited to removal of all data. If you need to have exemption, please contact us at helpdesk@helsinki.fi.

Storage Investors are exempted from cleanup process up to the invested capacity.

2.0 Striping

We have dynamic file striping in place. File stripe count is progressively increased as the file is being written. Files under 4MB are automatically striped to 1, and files from 4MB to 4GB are striped to 4. Larger than that, by all available OST's. Under normal use, you have no need to concern yourself with stripe count.

File striping is a powerful feature of Lustre but if abused, it can also be dangerous. Striping allows file to be divided in number of pieces, each being written in parallel on a separate Storage Targets (physical disks, or storage arrays, later OST). This feature provides two benefits, a load balancing and substantial performance gain in situations where parts of the file are being written simultaneously over multiple OST's. Additionally, it allows user to write a single file which is larger than the size of the single OST.

it is important to note, that striping does not automatically, or always provide benefit. Striping a very small file (that means less than 4MB) results increasing overhead due to number of objects and network traffic. In this case, increasing stripe count will hurt the performance.

You can set stripe count for file, or a directory. In case of file, you would need to create an empty file with stripe count, and then start writing data in it. In latter case, each file written and each subdirectory in the directory will inherit stripe count from the parent. This enables directory structure which has varying stripe sizes for different file sizes.

Note! If you are parallelising I/O, then you are interested on striping and stripe aligning your MPI I/O patterns.

2.1How do I know current Stripe Count of a File or Directory?

You can find a little utility called lls instead of ls -l to see how your file and directory stripe counts look like for files and directories on lustre FS. Feature Deferred

 

lls <Lustre file|Lustre directory>

It will produce output like following, where the first column indicates the stripecount of a file or a directory:

lls ior/
1 total 420
7 -rwxr-xr-x 1 user hyad-all  34532 Sep 21 18:42 config.status
7 -rwxr-xr-x 1 user hyad-all 192262 Sep 21 18:42 configure
7 -rw-r--r-- 1 user hyad-all   3346 Sep 21 18:39 configure.ac
1 drwxr-xr-x 3 user hyad-all   4096 Sep 21 18:43 contrib
1 drwxr-xr-x 2 user hyad-all   4096 Sep 21 18:42 doc
1 drwxr-xr-x 2 user hyad-all   4096 Sep 21 18:39 scripts
1 drwxr-xr-x 4 user hyad-all   4096 Sep 21 18:43 src
1 drwxr-xr-x 2 user hyad-all   4096 Sep 21 18:39 testing

 

2.2 Choosing a Right Stripe Count

We use Progressive File Layout, which means that files are striped to storage targets dynamically as the size grows. If you use mkdir to create directories and do not touch the stripe settings, the PFL will be inherited to every subdirectory. That said, it is still possible to alter the directory and file striping manually, and in some rare cases you may wish to do so. 

2.2 Advanced: Setting and Optimizing the Stripe Count

You can find out the current stripe settings:

 

lfs getstripe -d directory|filename

You can set the stripe counts manually by first creating the directory and then tuning the stripe count. Note that all subdirectories will inherit the stripe count of the parent. For example some very specific HDF5 file writes may benefit from targeted stripe settings.

 

lfs setstripe -c stripe_count directory|filename

Note: You cannot change the striping of an existing file because stripe setting is used when the file is being created and written. Should you need to do that, you would first need to create a new file with proper settings and then copy the existing file to the new one. If you do chance the stripe count for existing file, it will only take effect if the file is recreated.

2.3 Advanced: Setting a Stripe Size

You can also chance the stripe size, but you should not normally need it. However, there may be specific situations where it can be useful. This provides a finer grain control over the I/O operations.

Option -s allows you to set the stripesize in bytes. You can use k, m and g  to specify larger sizes. 

2.4 Finding Things

Use of regular find command is fairly taxing on the metadata and you should really use lfs utility instead: 

lfs find <normal find syntax>

3.0 Lustre Commands

Most commonly used commands which user will encounter are related to checking the OST's and their current state, or available space, setting stripe counts for files and directories as discussed in previous chapter.

To list the usage per OST:

 

lfs df -h <optional directory>

Usage of inodes, either all, or you can add directory:

 

lfs df -i <optional directory>

To list all OST's available on the system:

 

lfs osts <optional directory>

Should Lustre become unresponsive, or should you experience any, however trivial issues, please contact us through helpdesk(at)helsinki.fi. By contacting us in timely manner, we are able to tune the filesystem, provide better instructions and tutorials and catch possible troubles in before they escalate.

3.1 Command Synopsis

Generally it is good idea to avoid regular find (use lfs find instead), and long format ls (ls -la for example), due to metadata operation load it generates on the MDS. Lustre has utility called lfs. Man page is available for details, and below short synopsis of the commands available. 

lfs

lfs check<mds|osts|servers>

lfs df [-i][-h][--pool|-p<fsname>[.<pool>][path]

lfs find[[!]--atime|-A[-+]N][[!]--mtime|-M[-+]N][[!]    --ctime|-C[-+]N][--maxdepth|-DN][--name|-npattern]    [--print|-p][--print0|-P][[!]--obd|-O<uuid[s]>]    [[!]--size|-S[-+]N[kMGTPE]][--type|-t{bcdflpsD}]    [[!]--gid|-g|--group|-G<gname>|<gid>]    [[!]--uid|-u|--user|-U<uname>|<uid>]    <directory|filename>

lfs osts path

lfs getstripe [--obd|-O<uuid>][--quiet|-q][--verbose|-v]    [--count|-c][--index|-i|--offset|-o]    [--size|-s][--pool|-p][--directory|-d]    [--recursive|-r]<dirname|filename>...

lfs setstripe [--size|-sstripe-size][--count|-cstripe-cnt]    [--index|-i|--offset|-ostart_ost_index][--pool|-p<pool>]    <directory|filename>

lfs setstripe-d <dir>

lfspoollist<filesystem>[.<pool>]|<pathname>

lfs quota [-q][-v][-oobd_uuid][<-u|-g><uname>|<uid>|<gname><gid>]<filesystem>

lfs quota-t <-u|-g><filesystem>

lfs quotacheck [-ug]<filesystem>

lfs quotachown [-i]<filesystem>

lfs quotaon [-ugf]<filesystem>

lfs quotaoff [-ug]<filesystem>

lfs quotainv [-ug][-f]<filesystem>

lfs setquota <-u|--user|-g|--group><uname|uid|gname|gid>    [--block-softlimit<block-softlimit>]    [--block-hardlimit<block-hardlimit>]    [--inode-softlimit<inode-softlimit>]    [--inode-hardlimit<inode-hardlimit>]<filesystem>

lfs setquota <-u|--user|-g|--group><uname|uid|gname|gid>    [-b<block-softlimit>][-B<block-hardlimit>]    [-i<inode-softlimit>][-I<inode-hardlimit>]    <filesystem>

lfs setquota-t <-u|-g>    [--block-grace<block-grace>]    [--inode-grace<inode-grace>]    <filesystem>

lfs setquota-t <-u|-g>    [-b<block-grace>][-i<inode-grace>]    <filesystem>

lfs

4. Best Practices

Lustre is somewhat different from other filesystems around, and there are several established practices that help all users. Please note that these guidelines apply to other affiliated instances, such as Aalto or CSC, when running jobs there. Please note that concurrent file access from multiple hosts may become a situation where a lot of lock contetention happens, on a parallel file system.

4.1 File Locking in Lustre is not an advisory to be ignored

Filesystem does all it can to maintain integrity of the files, and cache coherency. Badly planned concurrent file operations can cause significant harm. For example writing, deleting and reading same file(s) at the same time from multiple nodes, and/or in parallel, would, without enforced locking cause data corruption. In this case, instead Lustre lock manager is put in irresolvable situation, which is then visible to all users as extremely slow metadata operations on /wrk. PLEASE Do consider the I/O side of your workload accordingly. Every component in the cluster, clients, MDT's, OST's and all cache coherency is managed by LDLM. This ensures system wide integrity, but also means that users are responsible of doing things right.                                     

  • Do not perform concurrent operations, eg. write and remove same files at the same time. If you fully understand how Lustre handles concurrent file operations, eg. reading or writing to a same file concurrently, then you know what to do. If you don't understand, don't do it. For example Oak Ridge National labs has numerous tutorials.
  • Do not manipulate files and directories of running batch jobs from login node (for example remove "unused" directories & files).
  • Move command (mv) is actually (cp + rm). Consider this carefully when using mv.
  • Always include checks to your scripts which make sure that the previous command (rm, cp, mv, cd) is executed successfully before moving on. If command fails, make entire script/job fails.
  • If you include cleanup in batch jobs, include cleanup portion to the end of the job, or make sure that the cleanup does not touch file structures used by the batch jobs.
  • Do not mix runtime directories such a way many batch jobs use same files/directories on assumption that jobs will execute one-by-one.
  • Never assume that job that run on your laptop using local disk will automatically run fine if you start hundreds of jobs simultaneously, (Common case of all using same files and directories).
  • Please do understand that every file, every directory on the cluster is accessible simultaneously from tens of thousands of processes simultaneously. I/O may scale out of hand unexpectedly.

4.1.1 Examples

You can use slurm environment variables to control the naming conventions of the directories and files. For example, within your batch script you could create directory for temporary files:

mkdir $WRKDIR/temporary-files.$SLURM_JOB_ID

You can also use SLURM file naming conventions to control your file&directory structure:

4.1.1.1 SLURM Filename Pattern

sbatch allows for a filename pattern to contain one or more replacement symbols, which are 
a percent sign "%" followed by a letter (e.g. %j).

\\  Do not process any of the replacement symbols.
%% The character "%".
%A Job array's master job allocation number.
%a Job array ID (index) number.
%J jobid.stepid of the running job. (e.g. "128.0")
%j jobid of the running job.
%N short hostname. This will create a separate IO file per node.
%n Node identifier relative to current job (e.g. "0" is the first node of the running job)    
    This will create a separate IO file per node.
%s stepid of the running job.
%t task identifier (rank) relative to current job. This will create a separate IO file per task. 
%u User name.
%x Job name.

A number placed between the percent character and format specifier may be used to zero-pad 
the result in the IO filename. This number is ignored if the format specifier corresponds 
to non-numeric data (%N for example). Some examples of how the format string may be used 
for a 4 task job step with a Job ID of 128 and step id of 0 are included below:

job%J.out -> job128.0.out
job%4j.out -> job0128.out
job%j-%2t.out -> job128-00.out, job128-01.out, ...

See sbatch man page for more details.

4.2 Do not perform append  operations from multiple processes to a single file

If you do not know exactly what you are doing, do not perform append operations at same time from multiple hosts and processes to a same file.

4.3 Number of Files

Opening a file in the directory locks the parent directory. If thousands of files are opened simultaneously it creates contention. It is more efficient to create multiple subdirectories for very large quantities of files.

4.4 Small Files

Accessing small files on Lustre, or in fact any filesystem is not efficient. If you need to use tens of thousands, or millions of files just a few kilobytes, it is much more efficient to use mechanisms such as HDF5, Fuse archiveFS, Fuse encFS, or database to store the data. If you do need to use very small files on Lustre, do use stripe of 1. If the files are less than 64k, they reside on the MDT only.

We do understand that small files are used because they are convenient and easy - until the moment they become really painful to handle. However, any filesystem will struggle to handle a Big Data composed of small files.

4.5 Metadata heavy I/O operations

Users should avoid metadata heavy I/O operations such as 'ls -l', 'ls -lR', and other recursive file operations for very large numbers of files. Especially batch job scripts doing these kind of operations will hurt overall application performance without much added benefit.

Never use find. Use lfs find instead.

4.6 Backup

We do not have practical means to back up Lustre $WRKDIR. Therefore it is important that you keep your source code copies, and other valuables on project directory /proj/$USER. ($PROJ defaults to /proj/$USER). It is always advisable to take care of backing up unique data.

4.7 Executables

Although executables do work from Lustre, we advice you to keep your executables at /proj/$USER ($USERAPPL and $PROJ) because /wrk is not backed up.

4.8 Repetitive Reads and Writes

If you wish to write to a file multiple times, open the file once at the beginning of the job, then once done, close the file at the end. 

4.9 Checking existence of file constantly

If you are doing "stats" on Lustre excessively, you will hurt the performance of your own application because of the load it induces on the metadata service. If you really need to do this, please use sleep in the logic to slow down the testing process. 

4.10 Multiple processes accessing same file simultaneously

If you have multiple processes attempting to open same file, or files, you may end up in situation where application fails because it cannot find the file. In this case you should utilize sleep function between I/O operations to avoid lock contention. In extreme case of lock contention, where multiple processes from multiple nodes attempt to write, rename, read and remove same files simultaneously, issue may become global. Therefore, make sure that your program makes I/O operations in sensible manner regardless of filesystem you use.

4.11 Databases

Do not use Lustre for databases at this time. Because of the way Lustre handles locks, databases will not function as expected. Therefore, for databases, use /proj/$USER instead. We may offer some ways to handle Lustre as DB platform but we have to investigate possible options.

5.0 Parallel I/O

Lustre users should be aware that rather excellent MVAPICH and standard OpenMPI have build-in Lustre support for parallel I/O. By utilising the MPI I/O hints, application may get substantial performance boost but bear in mind that the hints are advisory and may not be honored. You can set the stripe count, stripe size and number of writers in the MPI code (MVAPICH is available through module system).

In case of Fortran:

call mpi_info_set(myinfo,”striping_factor”,stripe_count,mpierr) 
call mpi_info_set(myinfo,”striping_unit”,stripe_size,mpierr) 
call mpi_info_set(myinfo,”cb_nodes”,num_writers,mpierr)

And in case of C:

mpi_info_set(myinfo,”striping_factor”,stripe_count); mpi_info_set(myinfo,”striping_unit”,stripe_size); mpi_info_set(myinfo,”cb_nodes”,num_writers); 

 

By default the number of writers equals lustre stripes. This is also referred as stripe align.

5.1 Intel MPI and Collective Buffering

Intel MPI library does collective buffering when the I_MPI_EXTRA_FILESYSTEM and I_MPI_EXTRA_FILESYSTEM_LIST variables are set like below:

mpiexec -env I_MPI_EXTRA_FILESYSTEM on
-env I_MPI_EXTRA_FILESYSTEM_LIST lustre \
-np xx a.out

 

That said, please do note that for example replacing MPI_File_Write_all() with MPI_File_write() could provide you same functionality, but with disastrous consequences to the performance. When optimizing Lustre I/O performance, think of Collective I/O.

5.2 HDF/Hadoop/Spark

Most of the BigData tools, for example HDFS, Hadoop & Spark are Lustre aware. You should really consider spending some effort to optimize the I/O performance because of the substantial gains you can make. We would be more than glad to hear your contributions in this field.

6.0 Advanced I/O Profiling

You can use strace to give an idea about the I/O profile of your application. This information can be used then to analyse the behaviour and help you to optimise your application reads and writes. What strace does, it traces the execution of the system calls, including read, write, open etc.

6.1 Basic case of strace

For example:

strace -f -tt -T -e trace=open,close,read,write,lseek –o <output> <executable>

In this case, options are as follows:

-f will trace child processes from fork()-tt tells strace to record time in microseconds-T shows time spent in system call-e trace= will trace only these system calls

Do not create output to the test filesystem. Because you cannot always instrument application with strace, you can use -p parameter to attach it to existing process id:

-p <pid>

 

6.2 MPI Case of Strace

Stracing MPI-IO is somewhat more complicated, and you can find pretty good example here. However, since it is likely that you do have something more important to do, University of Stuttgart has made life easier and made strace I/O analyzer available through GitHub. Essentially it is a wrapper that makes a bit complicated matter a whole lot easier. You can also use it to analyze IO of serial tasks. 

7.0 Lustre File Locking

Lustre file locking is managed by Lustre Distributed Lock Manager (LDLM) for lock coherency across the filesystem. Excellent Oak Ridge National Lab overview tutorial of Lustre file locking. Creating file locking schema is non-trivial on any parallel file system, and hence if you do not have to, do not create your own, but use already existing and proven concepts.

8.0 Additional Reading

MPI-IO Course material from University of Illinois

Lustre Troubleshooting Guide

https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html

https://www.nas.nasa.gov/hecc/support/kb/lustre-basics_224.html 

Version 1.3