Page tree
Skip to end of metadata
Go to start of metadata

DON'T PANIC

1.0 Lustre Components

To understand Lustre operation and troubleshooting, several key components need to be explained. Purpose of this document is to give basic highlights on the components and the situations where things may start going wrong, and how to come out of it without a data loss. 

1.1 MDS

MDS stands for Metadata Server. Metadata server manages the metadata operations, file locations on OST's, naming etc on each of the OST's. Filesystem has only one MDS. MDS does not generate huge amount of network traffic, but MDS cpu's can be loaded heavily with small file I/O operations. 

1.2 MDT

MDT stands for Metadata Target. It is a storage which contains the information metadata server manages. MDT should be as secure as possible, preferably RAID10 since nothing can be recovered from the OST's should MDT fail or become permanently corrupted. Metadata disk usage is insignificant, but having fast read/write speeds helps the performance, especially with small files. Hence SSD RAID 10.

1.3 MGS

MGS stands for Management Service and it is used as a global resource to support multiple filesystems. It stores the configuration information and provides the information to other Lustre hosts. MGS does not require significant resources.

1.4 MGT

MGT stands for Management Target which stores the management data. MGT is very modest and data stored is usually very little compared to MDT.

1.5 OSS

OSS stands for Object Storage Server and they are the storage servers which run the Lustre software stack and provide the I/O services to the end users. OSS handles the network operations and coordinated the file locking with MDS. If you need to know the system bandwidth, you can sum up the bandwidths of the OSS to get good approximate. Reliable network with good bandwidth matters for OSS, and hence Infiniband is a good choice. OSS's can be added to, or removed from a running system without interrupt or data loss. 

1.6 OST

OST stands for Object Storage Target and thy are the disks, RAID Arrays, or other storage mediums. OST's are where all the data is located in. OST's do not need to be identical, nor do they need to be same size. OST's can be added to, or removed from a running system without interrupt or data loss. 

1.7 Clients

Each server, host or node of cluster where Lusre filesystem is mounted is considered client. Lustre can support thousands of clients, and even multiple supercomputer systems simultaneously. Clients can be connected to the Lustre by shared interconnect, eg. Infiniband connection shared between the cluster components and the OSS's, or with help of LNET Router, through other networking protocols, such as Ethernet.

1.8 LNET

LNET is a Lustre Networking, a concept which at it's basic form includes the networking between Lustre MDS, OSS's and Clients. Elaborate LNET configurations can grow out to be very complex to build and troubleshoot. It is advised to keep it simple and understandable. Most prominent features of elaborate LNET configurations are various redundant configurations and routing possibilities. 

To manipulate and view the Lustre network, and whatever Lustre sees at the network you can use following command:

lctl

Or, more advanced, and much more powerful one:

lnetctl

LNET is separate networking concept from what the system sees as networking. Therefore one should not assume that system network is automatically seen by lustre, or that they are routed in expected fashion. If Lustre fails, LNET, or underlying network layer is usually very good suspect.

Lnet global settings (note that it is recommended to increase the transaction timeout to more conservative value of 50, and the retry count to 2):

lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 3
    transaction_timeout: 10
    health_sensitivity: 100
    recovery_interval: 1
    router_sensitivity: 100

To change the values:

lnetctl set transaction_timeout 50
lnetctl set retry_count 2

1.9 DNE - Distributed Namespace

Distributed namespace is an property of Metadata. In short, DNE allows metadata to be expanded in similar fashion to the OSS/OST's without practical upper limit. Besides of housing multiple MDT's under single MDS, one can also house multiple MDT's under multiple MDS's. It is a matter of curiosity wether these can be arranged in failover fashion. Striping up to 8 MDT's can significantly improve the metadata performance, but striping beyond that you could experience a drop.

1.10 DoM - Data on Metadata

One of the drawbacks of Lustre has always been ability to handle large quantities of small file I/O operations. Reasons for this are rather obvious. At least partial solution would be to house the small files (under 1MB) directly in the Metadata Targets. This is very handy especially if you have SSD's on the MDT's. Depending on the MDT size and the stripe sizes, you can limit the size of the files written only on the MDT's. Low value is 64k.

1.11 PFL - Progressive File Layout

Progressive File Layout is an feature that allows file to be written on MDT, and OST's in progressive fashion without Lustre needing to know the size of the file ahead of time. Stripe size increases according to the predetermined fashion as the file keeps growing until EOF. PFL can be adjusted for best possible performance from the system, depending on the capabilities of MDS, OSS/OST's.

2.0 Lustre Resiliency

Lustre supports various levels of failover, below are listed the most common resiliency features.

2.1 OST RAID Level

Most obvious way to provide resiliency is to have OST's either RAID6, or RAID10. Latter is more expensive as far as storage is concerned, but also provide important resiliency and performance benefits, especially cheaper, commodity arrangements. RAID10 performance does not drop as radically in case of failure as RAID6, which may contribute to the overall stability of the system.

2.2 OSS Failover

OSS's can be configured for failover. In practical terms this means that one OSS is configured to take place of one which is failing without interrupt on service. OSS failover can be configured at later time, given sufficient care.

2.3 MDT & MGT RAID Level

MDT and MGT can, and often do reside in same volume. However, in true failover configuration requires these to be separate volumes. RAID level for both should be 10.

2.4 MDS & MGS Failover

Two MDS's can be configured as Failover Pair. In this case one of them will take over should one fail. This does however require that both MDS nodes share same storage. 

2.5 File Level Replication - FLR

File level replication is another feature of Lustre providing fault tolerance. Replication takes Lustre closer to Ceph.

2.6 Nodemap

Nodemapping allows users and groups to be mapped to a different UID/GID's. Most commonly it is used to squash root user in other than named nodes. In Lustre nodemapping is altered with lctl tools, and the information of node mapping is written on the MGT. 

2.7 Posix ACL - Access Lists

Lustre does have support for ragular Posix Access Lists.

2.8 PCC - Persistent Client Cache

PCC is an feature scheduled to be included in 2.13GA in 2Q2019. This allows node local SSD's to be used to accelerate the I/O operations by acting as a cache between the Lustre and the compute node. In some ways PCC has been in use in several instances, such as Tianhe, and in a way on Cray DataWarp.

3.0 Lustre Startup & Shutdown

Referring to the installation documentation of Lustre for actual build of Lustre. Starting Lustre is fairly straight forward business. By mounting MGT, or OST you also start MGS and OSS service etc. Sequence is important. MGS needs to be started first, then have MDS running before you attempt to mount OST's and start OSS's. Lustre mkfs.lustre creates connection to the MDS which is required at the mount time. Usual catch here is either wrong order of sequence, network issue that prevents communications between the MGS and MDS, or OST, or start has been attempted before service is responsive. It is possible that it takes a while for MGS/MDS to start and grace period (sleep N) may be needed between management and metadata service start and OSS services.

3.1 Kernel Modules

Lustre kernel modules can be loaded following way, and it will load let as dependecy. However the behaviour is somewhat different than when loading it separately.

modprobe -a lustre

Should the need rise, removing kernel modules need to be done with lustre-specific tool to do the job gracefully.

lustre_rmmod

Note that rmmod assumes that if lnet is the only one, it is performing routing function and it will not be unloaded. To do this (error saying that the module is busy), you will need to explicitly unconfigure the lnet.

lnetctl lnet unconfigure 

3.2 Startup

Order of startup is:

mount -t lustre /dev/vg00/mgs1 /mnt/mgs
mount -t lustre /dev/vg00/mdt1 /mnt/mdt
sleep 180
mount -t lustre /dev/vg00/ost1 /mnt/ost1
mount -t lustre /dev/vg00/ost2 /mnt/ost2
mount -t lustre /dev/vg00/ost3 /mnt/ost3
mount -t lustre /dev/vg00/... /mnt/...

Theoretically you can opt to start MGS, then OSS, MDS and finally clients, however this expects that configuration is static. In this case you cannot add new storage to the Lustre.

3.2.1 Startup scripts

If OSS or MDS nodes are booted, a following script needs to be executed in each host. First on mds -node, then on oss01 and finally on oss2. It may produce some errors that are not relevant to the start of the Lustre (ZFS related).

/root/lustre-start.sh

After starting Lustre, you should see mds and mdt mounted on the MDS node, and ost01...ost07 on the OSS:s.

3.3 Shutdown

To shutdown Lustre cleanly, you need to first unmount all the clients, then the MDT(s) (unmounting MDT(s) will shut down the service), and finally OSS(es) and followed by MGS. It is important to recognise that MDS is the root of all I/O operations, and by shutting it down first, will prevent any further I/O. New files cannot be created or removed if MDT is offline.

In client host:

umount /lustre/<fsname>

Stop the Metadata Services (MDS host(s)):

umount /lustre/<fsname>/mdt<n>

Stop the OSS Services (OSS hosts):

umount /lustre/<fsname>/ost<n>

Stop the Management Service (MGS host) if separated from the MDS:

umount /lustre/mgt

3.3.1 Stopping scripts

If OSS or MDS nodes need to be shut down cleanly, a following script needs to be executed in each host. First on OSS, following MDS node. 

/root/lustre-stop.sh

4.0 Lustre Logging

Lustre can be very, very verbose and it is not uncommon to see the Lustre start with error messages - usually a lot of them. This however is not an indication of failed start, or that there is anything particularly wrong with the system. Most often the useful clues are found from

dmesg -T

And alternatively from system logs

/var/log/messages

Sometimes error messages are difficult to decipher. Recent example of dmesg log indicating error message about Infiniband connectivity and rejection of client:

 LNetError: 159121:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.2.xxx.xxx@o2ib rejected: consumer defined fatal error

and from MDS node:

LNetError: 19847:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) 10.2.xxx.xxx@o2ib: REJECTED 28

Actually indicated fatal bug in the Linux kernel Infiniband layer (RDMA) and not in Lustre itself. In these kind of circumstances it is important to dig deeper and not assume Lustre or LNET to be the root cause.

4.1 Changelog reader

You can enable Lustre changelogreader. Note that enabling log reader will keep accumulating changelogs even if the service using them, has been stopped. You have to register changelog for all metadata servers. Note that one reader only exits once, eg. you cannot create reader cl1, then remove cl1 and recreate it:

lctl --device lustre-MDT0000 changelog_register

And disable it with following. Disabling changelog reader will clean stored logs. You have to deregister changelog for all metadata servers.

lctl --device lustre-MDT0000 changelog_deregister cl1

Clear indication that you are starting to run into a problem with accumulated changelogs can be found from demesg:

LustreError: 5144:0:(mdd_dir.c:1061:mdd_changelog_ns_store()) lustre-MDD0001: cannot store changelog record: type = 1, name = 'sh-thd-1849266489385', t = [0x24000faf3:0x1d:0x0], p = [0x240000400:0xf:0x0]: rc = -28

4.1.2 Changelog reader and orphan objects

Orphan objects are cleaned at rate of 1 billion entries per two hours, in case it seems that Lustre startup appears hung. Just get another coffee, and wait.

5.0 Load Balancing

Unbalanced load among OST's can become problem when usage is heavy and many users use stripe count of 1. While Lustre attempts to distribute load evenly, manual intervention may be required to even out the OST usage. Lustre is more concerned about load balancing in the write performance, than actually keeping eye on the OST usage. Load balancing problems are not common, but they may occur for example when user creates unintentionally very large files with stripecount of 1. Lustre data can be levelled with lfs_migrate -command. Command without options uses the default stripe count and hence it is important to acknowledge the amount of small, under 4MB files because they may cause contention if striped for more than 1 OST.

lfs_migrate /lustre/

Of course, some of the common load balancing and contention issues also depend on the number of clients.

6.0 Removing OST or OSS

There are number of occasions where removal of entire OST, or OSS is necessary, wether it be complete overhaul, or mitigation of data loss risk. Removing OSS or OST is never entirely risk free operation, however it can be done relatively safely.

6.1 Deactivation

First step in OSS or OST removal is deactivation of the decided component. Please do pay attention, deactivation is done at MDT. Deactivation does not prevent read operations, but do prevent further writes to the OSS, or OST. To find out which OST's for example active at this time, you can use 

lfs osts 

And following to find out the device id:

lctl dl | grep osc

To deactivate the device you would need to do following. In this case we are deactivating OST device 15:

[root@mds0-test ~]# lctl --device 15 deactivate
[root@mds0-test ~]# lctl dl | grep osc
10 UP osp lustre-OST0001-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
11 UP osp lustre-OST0002-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
12 UP osp lustre-OST0003-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
13 UP osp lustre-OST0004-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
14 UP osp lustre-OST0005-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
15 IN osp lustre-OST0006-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5

Now that the device has been deactivated, you can proceed to data migration.

6.2 Data Migration

To keep the data safe and sound, deactivated OSS/OST needs to be emptied and data migrated to other OST:s. Please do check that you have sufficient capacity available to perform the migration before proceeding. As before, lfs_migrate -command can be used to move data away from chosen OST, or OSS. Fist you need to find out which files are striped to the OST6:

[root@oss1-test lustre]#lfs find --ost 6 /lustre

Migrate files away from ost6:

time lfs find --obd lustre-OST0006_UUID /lustre | lfs_migrate -sy

Check again which files have content on ost6:

lfs find --ost 6 /lustre

If migration has been successful, OST should be now empty. Ordinarily, there should be no issues with the migration.

Although lfs find is fast, performing recursive operations and migration can be very, very time consuming, especially when a lot of small files are involved. We are looking ways to do it more efficiently.

6.3 Removal

Once the OSS/OST is deactivated, and the contents migrated out, the component is ready to be removed from the configuration, and finally physically from the system if so desired. 

7.0 Adding OST or OSS

In principle process to add new Storage Targets, or Storage Servers, they are build, mounted, started and then data is migrated over the new layout. This can be done during normal operation without any problems. Below an example of newly added OST's 5 & 6 and data migration:

[root@oss1-test lustre]# lfs df
UUID 1K-blocks Used Available Use% Mounted on
lustre-MDT0000_UUID 588032 4036 533176 1% /lustre[MDT:0]
lustre-OST0001_UUID 1933276 105568 1701708 6% /lustre[OST:1]
lustre-OST0002_UUID 1933276 133224 1670828 7% /lustre[OST:2]
lustre-OST0003_UUID 1933276 448320 1354712 25% /lustre[OST:3]
lustre-OST0004_UUID 1933276 121900 1683428 7% /lustre[OST:4]
lustre-OST0005_UUID 1933276 25772 1786264 1% /lustre[OST:5]
lustre-OST0006_UUID 1933276 25776 1786260 1% /lustre[OST:6]

Migrating the data to new OST's with default stripecount (1):

[root@oss1-test lustre]# lfs_migrate /lustre/
[root@oss1-test lustre]# lfs df
UUID 1K-blocks Used Available Use% Mounted on
lustre-MDT0000_UUID 588032 4036 533176 1% /lustre[MDT:0]
lustre-OST0001_UUID 1933276 113852 1698184 6% /lustre[OST:1]
lustre-OST0002_UUID 1933276 84064 1727972 5% /lustre[OST:2]
lustre-OST0003_UUID 1933276 36920 1775116 2% /lustre[OST:3]
lustre-OST0004_UUID 1933276 80340 1731696 4% /lustre[OST:4]
lustre-OST0005_UUID 1933276 467600 1340316 26% /lustre[OST:5]
lustre-OST0006_UUID 1933276 87144 1720772 5% /lustre[OST:6]

8.0 Migrating MGT or MDT

Migrating MGT/MDT to another device can be done with dd when and if devices are same size. However if their size differs, you may be able to do resize, even that it has not been throughly tested. Since you do not do anything else besides of read from the old device, it will remain as a useful backup in case dd fails.

9.0 A Risk of Data Loss

Under normal operating conditions, if Lustre detects a fault, wether it is related to network connectivity, or loss of servers or targets, it will seize operations which are involving the failed components. See detailed descriptions in the Crash Recovery Section. This happens even if failover has not been configured.

9.1 Permanet loss

In the unfortunate event that entire OST is lost due to some permanent cause and becomes unrecoverable, it does not mean that everything in the Lustre are gone. Considering various stripe counts, it is possible that at least some files are recoverable.

Loss of MDT will cause the entire Lustre to fail and cause irretrievable data loss.

9.2 Temporary Loss

Temporary loss, say power failure will not cause permanent data loss, since Lustre will stop access to files which are striped to non-existent OST's. Files which have striped parts that are not available because the OST or OSS is out of action are represented by ? when listing files.

Temporary loss of MDS or MGS is unlikely to cause irretrievable data loss.

9.3 Loss of OSS or MDS

Loss of OSS or MDS/MGS does not cause data loss on the OST's even if no failover has not been configured. Only result is that replacing the OSS, MDS or MGS will cause longer than normal service break.

9.4 Loss of MGS or MGT

Loss of MGS or MDT does not constitute as a fatal event. System can lose MGS/MGT during normal operation and only things affected would be lctl tools and possibility to mount new clients. All mounted services would resume operations normally. MGT can be rebuilt again when all services and clients have been shut down by issuing mkfs-lustre with appropriate parameters. It is important to note that you will need to run tunefs.lustre to all targets, both OST and MDT

 tunefs.lustre --writeconf <target>

Using tunefs.lustre is inherently dangerous and could lead to data corruption, especially with --writeconf/--erase-param -flags. Do not run it against started service. Executing it does not prompt or give any warning. Recovery can be complicated.

10.0 RAID Volumes

It is unlikely event that the volume mapping information becomes scrambled in the RAID controller, however there are at least few recorded events where this has occurred. As long as Volumes are kept intact, and only volume mapping is scrambled, it is possible to correct the mapping and return the system back without data loss. Therefore, it is important that the volume mappings are documented correctly at all times.

Once volume mappings are corrected, Lustre will be able to start the OST's. There is a good chance that OST's fail to start without writing anything if the mappings are wrong.

11.0 Lustre Recovery

Lustre crash recovery depends on the failure type. Root failure analysis needs to be performed, else it may not be possible to take correct action to recover system functionality. In this section I will go through the analytical steps and failure recovery processes. Most important aspect of Lustre failures and recovery is to keep calm and avoid hastily constructed conclusions about the cause. 

11.1 Possible failures

Lustre failures may be classed in following groups.

    • Network failures
      • This may constitute packet loss
      • Can be request, or reply
    • Lustre Server or Client down
      • Possible due to SW or HW crash, power outage etc...
    • Distributed State Inconsistencies
      • This implies that multiple nodes are out of sync

11.2 Transactions

MDS and OSS execute transactions in parallel in multiple threads. Operations have two stages, first commit to memory and then to disk. Disk commit does happen at the same order, but at later time. Transactions become batched.

11.2.1 Request and Response

Client sends a request which is allocated an transaction number at the server side. MDS responds with the transaction number back to the client. Clients then keep on sending request and replies until MDS confirms disk commit.

Each ptlrpc is mapped to unique Xid which is used for an import/export pair and Xid is assigned to the client making the request. Under normal circumstances client makes request and gets unique Xid assigned. MDS responds and allocates transaction number and commits. Client gets a reply and checks out the transaction number from the list. 

11.2.2 Resending

If client has not received an reply from MDS, either request or reply was lost and request may have been executed, or not. Client does not know that. MDS goes out and checks if the request has already been processed and if not, then MDS will execute the request. If however the request was processed, then - and this is the clever bit - MDS (and only MDS) reconstructs the reply.

Curious minds wonder how does MDS know if request has  been already processed? This is where the Xid comes to play. Since MDS processes Xid's in order, and keeps record of the last one comitted, it knows wether request has been processed or not. MDS stores this information in last_rcvd -file.

11.3 Persistent State Recovery Explained

In case of unclean shutdown (SW or HW crash for example), servers restart and filesystems do need a recovery (ldiskfs). During the crash, MDS and OSS may lose some incomplete transactions and they do rely heavily on the clients to rebuild the state before the crash. Clients therefore keep requesting and replying until servers confirm the commit.

11.3.1 Client Recovery

During the client recovery process, clients that have been disconnected, connect back to the server and server responds with the last transaction number it has committed. After this step, requests are replayed and every request which has replies are resent. Server then merges and sorts the requests to rebuild correct sequence and re-executes requests. After this, server replays locks and then should all go as intended, there should be only a few requests left.

Client eviction messages on the logs are result of clients that have version mismatch after recovery times out

13.3.2 Transaction Gaps

Gaps may occur during replay of transactions. Clients may have missed transactions, or failed to offer one. During the restart, server waits for the clients to join, but no new clients are allowed to do so before replay is complete. Correct replay requires all the clients to connect during recovery window and recovery starts when first client has connected.

11.4 Version Recovery Explained

Version recovery (VBR) improves reliability where client requests (RPC) fail to replay during recovery window. VBR tracks the changes on inode versions. It means that each inode stores a version (practically a transaction number) and when the inode is about to chance a pre-operation version of the node is saved in the client's data. Client is aware of the pre- and post transaction version numbers and sends them in case of failure.

12.0 Event Logging

Mind that Lustre can be very, very verbose. During startup each of the server provide informational message on the logs about successful mount of MDT or OST. Messages start with string "Now serving <object> on <device>". Additional string to look at is "Lustre: Server" and look around it for ay messages that would indicate a mount failure. Should any server fail mounting - barring obvious case of server being hung/offline, you should attempt to mount (start) lustre manually by issuing

mount -t lustre 

During the startup MDS and OSS try to connect each other and it is common to see error messages originating from the fact that servers are starting with different pace. Most common startup timeout problems are related to the delayed start of service. Eventually, however every OSS should report that: "Received MDS connection from...".

After the servers are up and running, then every client should report "...client has started" message.

If neither happens, do check if any of the servers are down.

12.1 Quotas

Set quota up by setting it on each MDT and each OST individually.

MDS[1-2]: tunefs.lustre --param=mdt.quota_type=ug /dev/mapper/vg00-mdt[X-Y]
OSS[1-2]: tunefs.lustre --param=ost.quota_type=ug /dev/sd[X-Y]

Adjust quota with following for example:

for i in $(ls -l /wrk | awk '{print $9}'); do lfs setquota -u $i -B 50T -b 10T -I10M -i5M /wrk ;done

If quotas are enabled, it may be possible that filesystem starts, but quotas fail to enable. Should this happen, logs should indicate the error by following entry: "abort quota recovery".

12.2 Software Issues

You can find dmesg to be very useful tool to see what is going on in case of failures as well as console logs. Whenever looking the events, pay attention to following which indicate more or less fatal software bug:

    • LBUG
    • oops
    • ASSERT
    • Call Trace

Seeing these, additional debugging would be needed.

12.3 Client Eviction

Client eviction messages are fairly common with Lustre and there are many reasons that may cause such a message. Below an example where client was evicted due to networking issue, and it timed out:

[879449.823151] Lustre: lustre-OST0005: haven't heard from client 4417cd67-88cf-19da-57b5-0dac13825982 (at xxx.xxx.xxx.xxx@tcp) in 231 seconds. I think it's dead, and I am evicting it.

This happens as a result of client failing to report back in timely manner (ldlm_timeout). Client eviction may happen on all, some, or only one server(s). 

    • If client eviction occurs on all servers (MDS, OSS), there is very good reason to suspect that the client has some real issues.
    • If eviction occurs only on one, or some of the servers, then it could indicate that there is a problem with the communications path.
      • If the communication path has been ruled out, and only one server is involved, then there is god possibility that there is something wrong on the server itself.
    • Note that eviction my also occur on MDS but not on OSS, meaning that files can be committed, but not listed.

Client eviction may occur because of Out Of Memory condition, in which Linux kernel attempts to write pages, and Lustre needs to allocate memory for RDMA.  This rare situation can be mitigated by issuing /tmp and limiting the node memory usage by slum to terminate runaway jobs.

12.3.1 Recovery 

If client is evicted due to networking issues, the recovery of the problem has following signatures:

[879498.335859] Lustre: lustre-OST0006: Client 4417cd67-88cf-19da-57b5-0dac13825982 (at xxx.xxx.xxx.xxx@tcp) reconnecting


[879498.335922] Lustre: lustre-OST0006: Connection restored to 4417cd67-88cf-19da-57b5-0dac13825982 (at xxx.xxx.xxx.xxx@tcp)

12.4 Lost Connection

Error messages with signature of "-107" and "-108" indicate communication problem somewhere in the Lustre stack. It can lay anywhere, but there will always be information from the Lustre Network Driver, for infiniband: "o2iblnd". If the LND does not directly point to the problem, look for a downed node. Note that one cause can be Out Of Memory condition causing dropped connections and reconnect cycles. It may be complex to figure out the root cause for network related problems.

1.5 RPC Debug Messages

One very common message in the log files refers to some activity taking longer than expected and related possibility of timeout. 

Such as example below. Highlighted key parts of the message. These are most common indicators for user to debug the root cause:

Lustre: 10763:0:(service.c:1393:ptlrpc_server_handl e_request()) @@@ Request x1367071000625897 took longer than estimated (888+12s); client may timeout. req@ffff880068217400 x1367071000625897/t133143988007 o101 - >316a078c - 99d7 - fda8 - 5d6a - e357a4eba5a9@NET_0x40000000000c7_UUID:0/0 lens 680/680 e 2 to 0 dl 1303746736 ref 1 fl Complete:/0/0 rc 301/301

To decipher the message, you can note the eye-catcher, and req@ which indicates the transaction number (x.../t...) following the memory address from ptlrpc_request. o101 LDLM (Lustre Distributed Lock Manager) requeue request, which are very common. o400is indicative of obd ping. Last part is the request/reply status, which is normally an errorno, but higher values are lustre specific. Status rc 301 refers to lock aborted. 

Lustre manual has all of the codes explained. 

12.6 Lustre Log Eye-catcher 

It is common to find strings ### (LDLM_ERROR) and @@@ (DEBUG_REQ) on the logs to act as an eye-catcher. When looking for error messages potentially related to the event at hand, it is useful to seek these strings.

13.0 Experiences

  • Users who create tens of thousands of very small files per hundreds or thousands of small serial jobs may cause performance issues. It is advisable to guide users to use local disk drives such as SSD for this kind of workload. Even better, use memory to handle these kind of operations.
  • Lustre can stress backup system significantly, which is the reason to think storage in tiers and migrate unused data in stages instead of backing up everything.
  • Performance monitoring is important with Lustre. When PM schedule allows, one can use dd to verify. Additionally, it is important to keep eye on the OST usage rates and migrate when necessary. 
  • Finding heavy hitting users is also beneficial, since it allows targeted education of the stripe count usage etc.
  • Most common Lustre related occurrence is MDS stalling because of lost heartbeat or other similar cause. Failover does mitigate these types of errors readily. MDS loss does not indicate data loss (see shutdown sequence reason why) . 
  • Most common Lustre issues are related to the Storage end and it's performance problems, incapability to handle the I/O load, or failure prone hardware. Mitigating mid level storage issues with RAID10 and other similar approaches is important (For example RAID5 or RAID6 reconstruct can kill the performance quite badly, just like in any other use).
  • It is likely that critical Lustre service failure does not cause clients to hang permanently, especially If interrupt is short and load is modest - see recovery above.

13.1 Network layer

We have encountered an issue where Infiniband card was stuck in "Initializing state" immediately after negotiating speed with switch. This occurred when system was rebooted by possible power outage, or spike. Initially it looks a lot like some issue with firmware gone bad, or even card malfunction. However at the end it was discovered that there are two different ways to initialize IB card and this one resolved the issue:

yum install opensm && service opensm start

14.0 Scalability and Tuning

Lustre is quite versatile, and there are number of ways to improve reliability, but also to improve performance. Some of the features below are quite essential.

14.1 DNE

Above there was a brief discussion about building MDS redundancy by having two MDS servers each sharing an single MDT. However, not everyone prefer the redundancy at the cost of performance. DNE answers to the question of how to scale up the metadata performance by creating structure not very dissimilar of OSS/OST's, but for metadata. In very basic terms you could create directory on the Lustre FS and say that the contents of the directory has their associated metadata written to a specific MDT. That said, a structure of DNE dictates that MDT0000 is the root, and other MDT's are then attached to that MDT. For example, you could create example directory that has the associated metadata written to a MDT that has index number of 1:

lfs mkdir -i1 /mnt/lustre/example

That could have it's uses, but it would be far better to automate things such a way that files would stripe between the two MDS's and MDT's (/wrk being the lustre client side mount point):

lfs setdirstripe -c 2 -i 1 -H all_char /mnt/lustre/example

Now we will see that the contents of the /wrk/users is now striped to both mds's:

[root@vk-oss02 dir1]# lfs getdirstripe /mnt/lustre/example
lmv_stripe_count: 2 lmv_stripe_offset: 1 lmv_hash_type: all_char
mdtidx           FID[seq:oid:ver]
       1           [0x240000400:0x4:0x0]  
       0           [0x200000401:0x4:0x0]

If you write files to the Lustre, the metadata of the files are evenly distributed to both MDT's and you gain the near linear performance boost from both MDS's. However, now that you are able to write effectively to both MDT's, how could one benefit a small file I/O that can struck down any filesystem?

14.1.1 Removing DNE

As all the other components, DNE can be removed in quite similar fashion to removing OST from a live system. First you would have to move the Metadata (and possible payload if DoM is in use) away from the MDT that has is not MDT0000. One cannot remove MDT0000 without destroying the whole filesystem. So, if you would need to remove MDT0001 from structure where you have MDT's 0000 and 0001, then you would have to move the data from MDT0001 to MDT0000 before unmounting MDT0001, hence stopping the service. Unmounting is not however a permanent operation and for that you would need to use

tunefs.lustre --writeconf <target> 

against all the targets (MGS, MDT and OST).

Note that you would also need to make sure that no new data is being added to the MDT you are about to remove.

14.2 Data on MDT - DoM

DoM is a concept where metadata is used to store files that are very small, like the metadata entries themselves. When data is stored directly on the MDT, it reduces the unnecessary I/O requests to the storage targets. This can be helpful if there are large amounts of small file I/O (comparable in size to the metadata entries themselves). If DoM is used in conjunction with PFL, and file size grows larger than DoM policy would allow, then the extend of the data will grow on the OST's as per the PFL policy. See DoM explained.

14.3 Progressive File Layouts - PFL

Oak Ridge has evaluated several methods to overcome the dynamic file striping, or progressive file layouts. Essentially this means that no user input is needed in regular occasions to set the file stripe counts, but system would dynamically increase the file stripe count as the file size increases. Documentation of Phase 2 PFL is available and PFL instruction are included in Lustre version 2.12 documentation.

...and then we do additional striping, effectively implementing PFL and DoM together:

0-1MB on DoM MDT
1-64MB on 1 stripe of 1MB stripesize
64M-8GB on 4 OSTs at default stripesize
8GB-EOF across all OSTs at 4MB stripesize

lfs setstripe -E 1M -L mdt -E 64M -c1 -S 1M -E 8G -c4 -E -1 -c -1 -S 4M /mnt/lustre/example/

If we look at the directory stripe structure, we can see that files smaller than 1MB are written to MDT's (based on the MDT striping above) and once file size grows larger, it is progressively written to more and more OST's. Minimum file size for MDT file content 

oss02]# lfs getstripe .
  lcm_layout_gen:    0
  lcm_mirror_count:  1
  lcm_entry_count:   4
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 0
    lcme_extent.e_end:   1048576
      stripe_count:  0       stripe_size:   1048576       pattern:       mdt       stripe_offset: -1

    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 1048576
    lcme_extent.e_end:   67108864
      stripe_count:  1       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1
 
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 67108864
    lcme_extent.e_end:   8589934592
      stripe_count:  4       stripe_size:   1048576       pattern:       raid0       stripe_offset: -1
 
    lcme_id:             N/A
    lcme_mirror_id:      N/A
    lcme_flags:          0
    lcme_extent.e_start: 8589934592
    lcme_extent.e_end:   EOF
      stripe_count:  -1       stripe_size:   4194304       pattern:       raid0       stripe_offset: -1

14.4 Policy and FS management - Robin Hood

Once you have considered that you Lustre operates as desired, it would be rather good idea to keep eye on the Filesystem and perhaps create policies to control the usage. Answer for this is Robin Hood, which could live in same client node that is used to share samba and NFS out to the clients, however the database itself may have to reside on outside NFS server due to size constraints.

14.5 Lustre Performance Tuning

MDS nodes should be set up without the use of a kernel I/O scheduler when using SSD's. I/O scheduler unfortunately only slows things down. Default is likely noop:

[root@mds01 ~]# cat /sys/block/sdb/queue/scheduler 
noop [deadline] cfq

OSS nodes should have I/O scheduler - parameter set to “deadline”:

echo deadline > /sys/block/<blk_dev>/queue/scheduler

Additional filesystem metadata performance tuning required:

tune2fs -O dirdata /dev/vg00/<volume>

Optimized Locking (lockahead) can be considered for performance tuning and done properly, could give comfortable performance edge. See CUG evaluation paper for details.

Note that if you are using DoM for I/O optimization may provide up to four times the performance for small file I/O.

Resiliency Tuning

Some more conservative tuning can be done as follows for optimising the resiliency:

Setting params (-P sets permanent values, must be executed in MGS):

lctl set_param -P at_min=40 (default 0)
lctl set_param -P at_max=400 (default 600)
lctl set_param -P timeout=60 (default 100)

Setting Lnet params:

lnetctl set transaction_timeout 50 (default 10)
lnetctl set retry_count 2 (default 3)

Robinhood tuning

Robinhood client does not benefit from cache:

lctl set_param llite.*.xattr_cache=0

14.6 OST Pools

OST Pools provide powerful tool to manage the filesystem space by virtualisation of Lustre. Possibilities include but are not limited to remote OST's etc.

Creating one global pool at start helps to avoid situation where newly added OSS/OST will automatically included to the existing FS. 


15.0 Failure Isolation

Correct root cause analysis is not always trivial with Lustre. However, thankfully we do know the most likely causes. In event of non-responsive Lustre Filesystem, you should consider following steps.  Most of the cases there is something dramatically wrong with network, or storage hardware and will therefore not require excessive diagnostics and wizardry. Do not rush, nor attempt to restore system before you have good inkling about the cause of  the outage.

  • Check wether filesystem is available, and that all OSS & OST's are responsive. Not all error messages point you to a right direction.
  • Check the components, each OSS, Disk Shelfs etc that they are powered on, and accessible. 
  • Check that the filesystems and devices are present (power outage for example may drop the SAS devices offline and therefore your OSS would not see the devices)
  • Utilize dmesg -T heavily.
  • Check that lustre modules are loaded in right places.
  • Check if you can ping, and lctl ping between the storage servers.
  • Check that nids are present on the servers.
  • Attempt to shut down remaining services gracefully by unmounting them. First Metadata and then remaining storage targets.
  • Do not touch the clients at this stage, see recovery above. Most likely you cannot unmount them and even if you could, you would compromise recovery.
  • Attempt to bring back the services, starting with Metadata, and then each OST
  • If you are able to start the services, there is good chance that your clients will recover when connection is restored if left untampered.
  • If you cannot mount client, make sure that the proper lnet is being served by the MGS.

16.0 Checking Filesystem Integrity

Lustre lfsck has been rewritten. It can be run concurrently in production system, and could be included in a cron job that runs periodically.

To query the running status of lfsck:

lctl lfsck_query -M lustre-MDT0000

An example to start lfsck:

lctl lfsck_start -M lustre-MDT0000 -A -t all –r

17.0 ldlm issues and DNE Bug

One of the troubles of version 2.13 ZFS has been Segfaulting binaries on Lustre. There seems to be no simple way out of the trouble except to wait for 2.14. By default max is set to 65 minutes in milliseconds, and we have found out that if the value is increased substantially, then the Segfaults are less frequent.

lctl set_param ldlm.namespaces.*.lru_max_age=3888000000

However, not before lru_size is cleared:

lctl set_param ldlm.namespaces.*mdt*.lru_size=clear

DNE related issues can happen, and if  rename() is used by users, then it is pretty tough to disable remote_rename if already enabled for a long time. If however there is need to disable remote_rename  (this seems to open up an entire can of worms if enabled). To see the value:

lctl get_param mdt.*.enable_remote_rename

To set the value:

lctl set_param mdt.*.enable_remote_rename=0

We have also noticed that if one of the MDT's have remote_rename on, and other does not (this is why it is better idea to use -P flag, and do the command on the MGS node), the directories get into a interesting state where every old directory becomes owned by user 99 (nobody). Once touched, the directory ownerships are restored recursively.


Stay tuned, to be continued periodically when new information is needed....


Additional References

CUG Debug paper from 2011 by Cory Spitz and Anne Koehler

Lustre manual

Performance analysis and optimisation

Lustre Benchmarking

SAS Device Multipathing

Version 1.1