...
Actually indicated fatal bug in the Linux kernel Infiniband layer (RDMA) and not in Lustre itself. In these kind of circumstances it is important to dig deeper and not assume Lustre or LNET to be the root cause.
4.1 Changelog reader
You can enable Lustre changelogreader. Note that enabling log reader will keep accumulating changelogs even if the service using them, has been stopped. You have to register changelog for all metadata servers. Note that one reader only exits once, eg. you cannot create reader cl1, then remove cl1 and recreate it:
lctl --device lustre-MDT0000 changelog_register
And disable it with following. Disabling changelog reader will clean stored logs. You have to deregister changelog for all metadata servers.
lctl --device lustre-MDT0000 changelog_deregister cl1
See current readers
lctl get_param mdd.<lustre>-MDT0000.changelog_users
Clear indication that you are starting to run into a problem with accumulated changelogs can be found from demesg:
LustreError: 5144:0:(mdd_dir.c:1061:mdd_changelog_ns_store()) lustre-MDD0001: cannot store changelog record: type = 1, name = 'sh-thd-1849266489385', t = [0x24000faf3:0x1d:0x0], p = [0x240000400:0xf:0x0]: rc = -28
4.1.2 Changelog reader and orphan objects
...
Lustre lfsck has been rewritten. It can be run concurrently in production system, and could be included in a cron job that runs periodically.
...