Page tree
Skip to end of metadata
Go to start of metadata

Oct 14th Session Questions

  • This is a late answer regarding the VSCode question. I primarily use VSCode. I have been using it to monitor jobs in the cluster and run small notebooks for visualization after results completion through remote-ssh. What I haven’s spent time to get working is spinning up an interactive job and then connect vscode remote to that particular node rather than the login node. I will be willing to try and test out any good practices.
    • Thanks for heads up. This is on our high priority list. We’ll keep VSCode discussion alive here.

Module command not found in interactive Vorna

  • Hey, I got the problem bash: module: command not found again in interactive sessions, only on Vorna. On Turso03 and Ukko2 it works fine. The following is in my .bashrc, as we discussed last time:
# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
  • I forgot how to continue diagnosing the thing. Something “lmod”.
    • You are right. There is a bug in the image. Will check it up asap.
      • This should be fixed now. Thanks for the bug report.
  • Also, I accidentaly removed the bashrc text above from my .bashrc at some point, and had to dig into the HPC logs to find it. Should it be in the regular Wiki?
    • Will add this to FAQ asap.

Feedback for HackMD

I’ve also noticed that sometimes I follow issues here on HackMD, but due to workload peaks I may not be able to be here exactly everyday, causing me to miss some of the discussion.
I have collected some issues below:

  • Hard to follow threads: logs are cleaned.
  • Hard to follow authorship: only visual cues are plain bullets.
  • Hard to follow the Q&A sequence: same as above.
  • Not scalable: what if we ever see more than a couple of users here? problems above would worsen
  • No thread response notifications.
  • I think the ideal solution is to have a proper forum system, such as an instance of https://github.com/discourse/discourse. But this requires deploying yet another thing, unless this is not a problem.

  • A more pragmatic idea could be to follow Aalto and have a gitlab “repo” that is used for discussion (maybe see https://scicomp.aalto.fi/triton/help/). When I mentioned this in on the HPC garages it was understood as having a bug tracker, which doesn’t really work since most user issues cannot be directly translated into bug reports. I mean to use it as a general issue tracker, whether for enhancements or the issues that are often commented through HackMD.

  • Of course, there can be other solutions which could be explored, these are only a few ideas. But with any of these proposed ideas, the issues at the beginning of this post would be solved.

    • Thanks for the input, we will take a look. Yes, you are right about couple of points and I agree.
    • Original intent was to have HackMD as an alternative channel for the Zoom sessions. I also wanted to filter the HackMD questions and answers into FAQ - and while it has been done, there are unfortunate delays.
    • Additionally, the ticket system is very cumbersome to provide answers to cases that affect many people (in case of unexpected event, often same question is asked by many people). It is far more efficient (and presumably user friendly) to provide updates of ongoing progress here, than, say on a-mails to over 800 users.
    • Unfortunately the actual bug reporting system we use is not open for all users, and that is an another parallel system.

Jobs on Vorna that get killed by the OOM-Killer leave nodes in the “drain” state.

  • I’ve run a couple of large-ish Vlasiator runs on too small an amount of nodes, resulting in them getting OOM-Killed. Afterwards, sinfo shows the nodes remaining in the “drain” state indefinitely (until, I suppose, they are reset by manual intervention).

Oct 18th Session Questions

  • MPI_Bcast error on Vorna:
[vorna-457:09797] *** An error occurred in MPI_Bcast
[vorna-457:09797] *** reported by process [1289251673,9]
[vorna-457:09797] *** on communicator MPI_COMM_WORLD
[vorna-457:09797] *** MPI_ERR_OTHER: known error not in list
[vorna-457:09797] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[vorna-457:09797] ***    and potentially your MPI job)
  • This is the 1st time I’ve seen this. It first showed error on vorna-456. Then if I exclude it the same error shows up in the subsequent vorna-457. As I didn’t see any other error messages, I would assume this is not a bug with the code but the system?
    • I have put the nodes down with label “Fatal bcast error”.
    • Core file dumped even when the job finished successfully
  • I have a new Vlasiator executable compiled with GCC10. Even after the 30-node job finishes its execution successfully, there are still core files as well as error msgs:
[vorna-514:8797 :0:8797]       ud_ep.c:253  Fatal: UD endpoint 0x312d700 to <no debug data>: unhandled timeout error
BFD: DWARF error: could not find variable specification at offset 9038e
  • Could this be a cryptic message about undefined reference to __imp_function (instead just function)?

    • Removed bunch of debug messages below.
  • However, I did not see this on a 1-node test.

    • Will start working on this on Oct 19th. Could this be related to the memory/swap cgroup enforcement that was turned on at the end of last week…
    • You have used srun instead of mpirun, right?
    • Yes, this is with srun
      • And with srun option (this enables UCX as the OpenMPI4.x transport) --mpi=pmix_v3
      • I tried with pmix_v3, but it showed not found?
        • Wow. This is a news. Which node, or all?

Oct 20th Session Questions

  • So I will continue here for the errors even after a successful run (Maybe we should use something like discourse to track the same issues in a single thread better, and for future reference?)
    • Setup:
module purge
module load GCC/10.2.0
module load OpenMPI/4.0.5-GCC-10.2.0

srun --mpi=pmix --cpu-bind=none ./executable
  1. Vlasiator test 1, 1 node, 16 cores, ok
  2. Vlasiator test 2, 4 nodes, 16 cores (4 per node) It finished ok, but with core files dumped and error messages
[vorna-516:20449:0:20449] rc_verbs_iface.c:63   send completion with error: transport retry counter exceeded qpn 0x21fe wrid 0x12 vendor_err 0x81
BFD: DWARF error: could not find variable specification at offset 9038e
BFD: DWARF error: could not find variable specification at offset 9039c
BFD: DWARF error: could not find variable specification at offset 903aa
BFD: DWARF error: could not find variable specification at offset 8ebfd
BFD: DWARF error: could not find variable specification at offset 9038e
BFD: DWARF error: could not find variable specification at offset 9039c
BFD: DWARF error: could not find variable specification at offset 903aa
BFD: DWARF error: could not find variable specification at offset 8ebfd
==== backtrace (tid:  20449) ====
 0 0x0000000000020943 ucs_debug_print_backtrace()  /home/smaisala/ukko2-easybuild/build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/debug/debug.c:656
 1 0x000000000002826a uct_rc_verbs_handle_failure()  /home/smaisala/ukko2-easybuild/build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/uct/ib/rc/verbs/rc_verbs_iface.c:63
 2 0x0000000000028eac uct_rc_verbs_iface_poll_tx()  /home/smaisala/ukko2-easybuild/build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/uct/ib/rc/verbs/rc_verbs_iface.c:104
 3 0x0000000000028eac uct_rc_verbs_iface_progress()  /home/smaisala/ukko2-easybuild/build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/uct/ib/rc/verbs/rc_verbs_iface.c:138
 4 0x0000000000023aca ucs_callbackq_dispatch()  /home/smaisala/ukko2-easybuild/build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucs/datastruct/callbackq.h:211
 5 0x0000000000023aca uct_worker_progress()  /home/smaisala/ukko2-easybuild/build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/uct/api/uct.h:2346
 6 0x0000000000023aca ucp_worker_progress()  /home/smaisala/ukko2-easybuild/build/UCX/1.9.0/GCCcore-10.2.0/ucx-1.9.0/src/ucp/core/ucp_worker.c:2040
 7 0x00000000000019e4 opal_common_ucx_wait_all_requests()  common_ucx.c:0
 8 0x0000000000001f91 opal_common_ucx_del_procs_nofence()  ???:0
 9 0x0000000000001fb9 opal_common_ucx_del_procs()  ???:0
10 0x0000000000003af9 mca_pml_ucx_del_procs()  ???:0
11 0x00000000000555b2 ompi_mpi_finalize()  ???:0
12 0x0000000000436958 main()  /home/hongyang/proj/vlasiator_newBC/vlasiator.cpp:1054
13 0x0000000000022555 __libc_start_main()  ???:0
14 0x000000000043aa0c _start()  ???:0
=================================
[vorna-516:20449] *** Process received signal ***
[vorna-516:20449] Signal: Aborted (6)
[vorna-516:20449] Signal code:  (-6)
[vorna-516:20449] [ 0] /lib64/libc.so.6(+0x36400)[0x7ffb15819400]
[vorna-516:20449] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ffb15819387]
[vorna-516:20449] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ffb1581aa78]
[vorna-516:20449] [ 3] /appl/opt/UCX/1.9.0-GCCcore-10.2.0/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7ffb12276da5]
[vorna-516:20449] [ 4] /appl/opt/UCX/1.9.0-GCCcore-10.2.0/lib/libucs.so.0(+0x227c8)[0x7ffb1227a7c8]
[vorna-516:20449] [ 5] /appl/opt/UCX/1.9.0-GCCcore-10.2.0/lib/libucs.so.0(ucs_log_dispatch+0xd4)[0x7ffb1227a8f4]
[vorna-516:20449] [ 6] /appl/opt/UCX/1.9.0-GCCcore-10.2.0/lib/ucx/libuct_ib.so.0(+0x2826a)[0x7ffb116ee26a]
[vorna-516:20449] [ 7] /appl/opt/UCX/1.9.0-GCCcore-10.2.0/lib/ucx/libuct_ib.so.0(+0x28eac)[0x7ffb116eeeac]
[vorna-516:20449] [ 8] /appl/opt/UCX/1.9.0-GCCcore-10.2.0/lib/libucp.so.0(ucp_worker_progress+0x5a)[0x7ffb122f1aca]
[vorna-516:20449] [ 9] /appl/opt/OpenMPI/4.0.5-GCC-10.2.0/lib/libmca_common_ucx.so.40(+0x19e4)[0x7ffb123319e4]
[vorna-516:20449] [10] /appl/opt/OpenMPI/4.0.5-GCC-10.2.0/lib/libmca_common_ucx.so.40(opal_common_ucx_del_procs_nofence+0x161)[0x7ffb12331f91]
[vorna-516:20449] [11] /appl/opt/OpenMPI/4.0.5-GCC-10.2.0/lib/libmca_common_ucx.so.40(opal_common_ucx_del_procs+0x9)[0x7ffb12331fb9]
[vorna-516:20449] [12] /appl/opt/OpenMPI/4.0.5-GCC-10.2.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_del_procs+0x89)[0x7ffb12338af9]
[vorna-516:20449] [13] /appl/opt/OpenMPI/4.0.5-GCC-10.2.0/lib/libmpi.so.40(ompi_mpi_finalize+0x582)[0x7ffb160ee5b2]
[vorna-516:20449] [14] /wrk/users/hongyang/debug/magnetosphere/../vlasiator_20211015[0x436958]
[vorna-516:20449] [15] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffb15805555]
[vorna-516:20449] [16] /wrk/users/hongyang/debug/magnetosphere/../vlasiator_20211015[0x43aa0c]
[vorna-516:20449] *** End of error message ***
srun: error: vorna-516: task 1: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=69507469.0
slurmstepd: error: *** STEP 69507469.0 ON vorna-516 CANCELLED AT 2021-10-20T10:33:03 ***

This indeed looks similar to the OpenUCX issue, but the backtrace to Vlasiator 1st points to main(), so I have little idea what to look for next.

  • We know now more. There is an issue, pmix_v3 is missing. More details follow.
  • Vorna and ukko2 require option:--mpi=pmix_v2
    • The fact that we cannot run RHEL8.x on old nodes because lack of IB support creates these issues where on RHEL8.x version nodes have, and must use pmix_v3 and Centos 7.x nodes must use pmix_v2 in order to enable proper functionality of UCX for OpenMPI 4.x.x.
    • We are trying to find out if RedHat will at some point include support for FDR IB nics, but so far in no avail.

Oct 25th Session Questions

  • I’m getting /proj/jyrilaht/opt/bin/plumed: /lib64/libm.so.6: version `GLIBC_2.27' not found (required by /proj/jyrilaht/opt/lib/libplumedKernel.so)on ukko2. The program that uses the plumedKernel library runs fine on the login node, but submitting it to cluster gives this error. Advice?
    • Yes, note that Login nodes are RHEL 8.4, while Vorna and Ukko2 nodes are running Centos7.x. There is distinct difference on the glibc versions.
    • Hence if you compile against Vorna or Ukko2, you have to compile on the compute nodes, not on the login nodes.
    • This is caused by RedHat who pulled out the support from FDR Infiniband fabric. We are fully aware of the added complications and try to find solution but so far in no avail.
      • Alright, thank you! I’ll try this method.

Oct 27th Session Questions

Ukko cluster has been added to the production, initially with some new GPU nodes. Specify -M ukko to utilize.

We thank for the feedback about HackMD and GitHub. It is very likely that we move on to GitHub, since the Garage has grown beyond the initial phase.

More about this development as soon as we have path and times set. Until then.

Nov 1st Session Questions

  • Hi, I’m posting this here incase it will help someone else in the future.
    Someone in my research group had issues running programs on the cluster which were compiled on the login node using libtorch due to the GLIBC version difference in the compute nodes (See Oct 25th 2021 session).
    So we ended up needing to build libtorch from scratch on a compute node so that it can then be used to compile code that would run. After some trials and mostly errors, I managed to compile the most recent version of libtorch on a node using the following batch file:
#!/bin/bash
#SBATCH -M ukko2
#SBATCH -t 1-0
#SBATCH --job-name=libtorch_build
#SBATCH -o %x.txt
#SBATCH -e %x.txt
#SBATCH -c 24
#SBATCH --mem=2G


## Batch Script for building the latest libtorch from the pytorch repository

# requires CMake and GCC
module load CMake
module load GCC
# Instructions from https://github.com/pytorch/pytorch/blob/master/docs/libtorch.rst#building-libtorch-using-cmake
# moified based on https://discuss.pytorch.org/t/errors-when-building-static-linked-libtorch/117186
#   for parallel building and turning off python while building

git clone -b master --recurse-submodule https://github.com/pytorch/pytorch.git
mkdir pytorch-build

cd pytorch-build
cmake -DBUILD_SHARED_LIBS:BOOL=ON -DUSE_CUDA:BOOL=OFF -DCMAKE_BUILD_TYPE:STRING=Release -DCMAKE_INSTALL_PREFIX:PATH=../pytorch-install ../pytorch -DCMAKE_CXX_FLAGS:STRING=-fPIC -DBUILD_PYTHON:BOOL=OFF
cmake --build . --target install --parallel 24

Note, this was specifically for a CPU only version of libtorch.
This script builds the library and creates a python-install folder that can be used for linking.

  • On a side note. could something like this be done (in a more professional manner) and made into a module that can then be loaded by people who need it?
    • Thanks for the info. We’ll see what can be done.

Nov 3rd Session Questions

  • Hi it looks like there is an issue with interactive jobs on ukko2:
srun: job 136508790 queued and waiting for resources
srun: job 136508790 has been allocated resources
srun: error: Task launch for StepId=136508790.interactive failed on node ukko2-01: Unspecified error
srun: error: Application launch failed: Unspecified error
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
  • Thanks for the notice. We noticed the issue and it was related to a configuration change which was intented to be pushed in later time.
  • Secondarily, there were certain dead jobs. All clear now, we hope.

Nov 12th Session Questions

  • Weird output while compiling in interactive session:

    g++: fatal error: Killed signal terminated program cc1plus
    compilation terminated.
    This was happening due to the job going out of memory after all.
    
    • You can adjust memory request by issuing --mem= option on the srun line. Default values are quote small. For example: --mem=20G

Nov 16th Session Questions

  • Is it on purpose one can ssh to ukko3 nodes even when not having a job there?
    • No, this is a bug. Thanks for reporting it. We will fix it asap. Meanwhile we strongly advice against abusing it.

Oct 11th Session Questions

Comment

  • Slurm jobs got stuck in “COMPLETING” STATE
CLUSTER: vorna
JOBID         USER               NAME       START_TIME         END_TIME  TIME_LEFT NODES CPUS MIN_MEMORY   PRIORITY STATE     REASON
69399126      USER1         sensei              N/A 2021-10-04T16:32    2:00:00     1 1 32G        395 COMPLETING       None
69398715      USER2   visit.user.1 2021-10-04T16:06 2021-10-04T16:12      14:08     1 480 512M       3183 COMPLETING NonZeroExi
69391091      USER3  2Dtimevarying 2021-10-02T20:40 2021-10-02T20:42   23:28:13     1 960 55G        240 COMPLETING       None
69385518      USER2   visit.user.2 2021-09-28T23:08 2021-09-28T23:10      18:40     1 160 512M       3182 COMPLETING       None
  • Investigating. I have also noticed the completion state stuck. This however applies to only certain jobs, not all.
    • All nodes had to be rebooted. We have to change the scheduling configuration management to a stateless mode, since it seems that certain concurrent changes may cause node fallout. More info will follow asap.

There will be a short disturbance on Vakka, have to do several swithes of services between failover pairs. 111021T1342 - Interference over.

We have discovered fixes for the issues of ldlm lock timeouts and application ldlm mmap segfaults. Fix requires us to switch Lustre from 2.13 to either 2.14 (Alma & RHEL8.4) or 2.12.7 LTS (Centos 7.x).
We can immediately fix the ldlm lock timeouts, but this will have potential cost of application segfaults if binaries are executed from /wrk.

Wish to contribute? Write your opinion below if you are for or against. Do you use Lustre for binaries at this time?


If you are using, or would like to use VSCode on the cluster, as interactive developement tool rise your hand. We are looking for best practices.

  • Personally I don’t use it often, but when I need to deal with Jupyter notebooks, I’d rather use VSCode than the native jupyter interface (due to good notebook support + vim emulation). With I this I mean I upvote the proposal, but won’t be able to give much feedback.
    • You would still be willing to test any good practices we come up with?
      • You have me in if it’s not a huge workload and it supports VSCodium (the binaries without MS telemetry). I wouldn’t want to install the proprietary version.

Oct 13th Session Questions

  • Feedback for interactive mode: due to bugs, system changes or other issues, I find that the way to access interactive mode is very unreliable. I understand the good intentions behind the “interactive” wrapper, but unless I’m mistaken, is there not much point to it vs. instructing users to set an alias? I mean, an alias would be linked to slurm commands instead, with very low expected breakage.
    E.g. in my case I always have alias int=‘interactive’ or int=‘srun …’. I always type “int” to access interactive mode at the end of the day, so I’m wondering if the added complexity layers with the wrapper are justified.
    E.g. wiki could instruct printf 'alias interactive="srun..." >> $HOME/.bashrc'

    • If interactive is still lingering someplace, let me know the hostname. It should not be since salloc and srun no longer work as expected. Current mechanism (from slurm) is to use srun --interactive <resources> --pty bash.
      • srun works :). But see above for feedback. It’s been changing a lot from “interactive” to “srun”
    • I understand your point. However, earlier to have proper interactive session, it wasn’t a matter of srun, you had to first create allocation with salloc, then institute the session inside the salloc reservation and it was much too complex for 90% of users. This is why we wrapped the salloc & srun into “interactive”. Now, after slurm upgrade the entire salloc issue is no longer there, and the reserve with srun --interactive and claim with srun is far more simple.
    • This said, most people actually never used salloc to start the reservation and then to claim the reservation with srun, but used srun instead - which actually did not place the processess well. :) It was way too complex.
    • I think I did write the reasoning once someplace, but that was a long time ago and probably most users missed it anyways. :)
    • This said, yes, one can create aliases but this is quite generic linux and I am not sure if it belongs to the user guide. Perhaps we can add a line of note in the FAQ.
      • Thanks for the detailed reply. My note on setting aliases was because I thought that the whole point of having the interactive wrapper was to provide a system-level alias-like system for requesting an interactive session. In that line of thinking, it was duplicated funcionality with native unix aliases, but prone to break. Now, I get that before current srun method, it really wrapped more functionality to it. Although the end goal seem to still enhance user experience.
    • Yes, we did get some feedback of the complexities and it was intended to cure that then-current issue.
    • On that topic, did you know that you can reserve number of resources, but use different sruns to use the resources at the same interactive shell? :D Eg. you can claim less than what you reserved with one srun, and then claim the other part with second. Don’t know if anyone wants to use that, but it’s there now. :)
      • I think my use case could make use of that in a way, so that the endless stream of tmux panes don’t reserve more resources than needed. But still, my interactive resources need is so tiny that I wonder if there’s a practical point – for me. Still, cool functionality.
    • Well, I just came by it accidentally myself. One can always experiment if time and resources. Anyways, it seems to work now more-like-batch job. It could help to get batch job debugging easier.
    • Oh before I forget. If now srun suffices to successfully get into interactive mode and claim the resources, plus having the wrapper is convenient for other users in some way, one intermediate solution could be to still allow to run the srun command even if the interactive wrapper exists. Thus I can be happy with my alias and whoever can run the wrapper :)
  • Is zoom-link somewhere? Login to kale fails (due to emergency?) but also login to turso fails (connection closed by …) port 22. Now that I cannot access clusters I don’t get it from the login screen.

    • Turso does work, there is no known issues there. Kale has issues.
      $ ssh turso.cs.helsinki.fi
      Last login: Wed Oct 13 09:47:56 2021 from 128.214.227.210

      • Ok, so then turso problem must be related to my account. It still worked few weeks/month ago so I wonder what has changed. I already checked the groups I belong in and they should be still the same.
        • This got solved: my turso access depended in belonging in a group that didn’t exist anymore.
  • I’ve been periodically struggling with jobs getting stuck in the “launch failed requeue held” status

    • Has this happened in past 7 days? If no, then the reason was interference of parallel old slurm master node running in the same environment concurrently with the new one. This has been shut down now.
    • I did periodically started the jobs.
      • The most recent struggle was this morning. Eventually I managed to get my job through by excluding nodes one by one in caese one of them was faulty. “exclude=ukko2-21” finally made the job running (on node ukko2-18).
        • Do you have jobid?
        • 136466502
          • I’ll check this as soon as I can.
          • Think this may have to do with a bug in slurm, which is related to stateful configuration. on 20.x versions the default mechanism has changed to stateless configuration which we are yet to implement. This is our next target (will require node reboots).

Oct 1st Session Questions

Comment

  • Does anything change to srun on Vorna? When I tried to submit a MPI job with
srun --mpi=pmix executable

it returned the error message

srun: error: task 272 launch failed: Invalid MPI plugin name
...
slurmstepd: error:  mpi/pmix_v2: pmixp_p2p_send: vorna-517 [16]: pmixp_utils.c:471: send failed, rc=2, exceeded the retry limit

The same job script used to work earlier this week. (Maybe my memory was wrong, but for sure it was ok last week at some point.)

  • No, not this week. What does srun --mpi=list give you? What module you use? Sorry, cannot replicate:
turso03:/wrk/foo/ior/src$ /usr/bin/srun --interactive -n 4 -Mvorna --pty bash
@vorna-547:/wrk/foo/ior$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmix_v2
srun: pmi2
srun: pmix

@vorna-547:/wrk/foo/ior/src$ srun -n4 --mpi=pmix ./ior
IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Fri Oct  1 15:26:38 2021
Command line        : /wrk/foo/ior/src/./ior
Machine             : Linux vorna-547
TestID              : 0
StartTime           : Fri Oct  1 15:26:38 2021
Path                : testFile
FS                  : 1241.3 TiB   Used FS: 65.5%   Inodes: 448.7 Mi   Used Inodes: 25.0%
  • Are you sure you actually used it on the login node? I doubt that it has ever worked… but can of course be wrong.
    • No, I just showed the srun command I use within a job script, not from login node.
    • I tried the same job script but using mpirun. This one worked so I may just stay with this temporarily.
  • I just realized that! Sorry! I need either coffee or stop working for today :D
  • mpirun will not work predictably…
  • is there specific node that returns the error? I tried several times over, but cannot fail this way… it is possible that there is something odd someplace of course.
    • Hmm… I saw error messages from multiple nodes, so maybe this is not specific to one node.
    • I may try again during the weekend with srun. Will let you know more about this next week.
  • Just tried multi-node ior, no faults. OpenMPI 3.x, pmix, and pmix_v3. If I do this on the login node, it gives expected error:
turso03:/wrk/users/juhaheli$ srun --mpi=pmix hostname

srun: error:  mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types
  • If there are nodes such as 518, or 503? Hmm, 503 seems not to behave… HA! I think I found it.
    • I don’t see neither 518 nor 503 in my error report, so maybe there are more nodes affected? Not sure.
 turso03:/wrk/foo$ cat slurm-69386874.out
SLURM_NTASKS_PER_NODE=3
SLURM_NTASKS=9
srun: error: Couldn't find the specified plugin name for mpi/pmix looking at all files
srun: error: cannot find mpi plugin for mpi/pmix
srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types
  • Removed 503 from the system for now. It was the only one that - when present, caused the error.
  • Let me know if this resolves the issue.

Garage will go for a weekend break. Have a good weekend! You can still write here, but will answer next week.

Oct 4th Session Questions

  • srun --mpi=pmix is still not working:
srun: error: task 272 launch failed: Invalid MPI plugin name
srun: error: task 273 launch failed: Invalid MPI plugin name
srun: error: task 274 launch failed: Invalid MPI plugin name
srun: error: task 275 launch failed: Invalid MPI plugin name
srun: error: task 276 launch failed: Invalid MPI plugin name
srun: error: task 277 launch failed: Invalid MPI plugin name
srun: error: task 278 launch failed: Invalid MPI plugin name
srun: error: task 279 launch failed: Invalid MPI plugin name
srun: error: task 280 launch failed: Invalid MPI plugin name
srun: error: task 281 launch failed: Invalid MPI plugin name
srun: error: task 282 launch failed: Invalid MPI plugin name
srun: error: task 283 launch failed: Invalid MPI plugin name
srun: error: task 284 launch failed: Invalid MPI plugin name
srun: error: task 285 launch failed: Invalid MPI plugin name
srun: error: task 286 launch failed: Invalid MPI plugin name
srun: error: task 287 launch failed: Invalid MPI plugin name
slurmstepd: error:  mpi/pmix_v2: pmixp_p2p_send: vorna-517 [16]: pmixp_utils.c:471: send failed, rc=2, exceeded the retry limit
slurmstepd: error:  mpi/pmix_v2: _slurm_send: vorna-517 [16]: pmixp_server.c:1578: Cannot send message to /var/spool/slurmd/stepd.slurm.pmix.69391091.0, size = 3850, hostlist:
(null)
slurmstepd: error: *** STEP 69391091.0 ON vorna-417 CANCELLED AT 2021-10-02T20:42:04 ***

I tried again on Sunday night, and it worked this time. My guess is that there are still certain nodes not being set properly.

  • Certain Vorna nodes report impossible error when running with mpirun:
Sat Oct  2 20:47:08 EEST 2021
Filtering is off and max number of Passes is = 	 0
Input file sw.dat is empty!
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 274 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[vorna-424:39089] 15 more processes have sent help message help-mpi-api.txt / mpi-abort

where I am sure sw.dat is not empty. (This is an internal error check in Vlasiator.) Although this looks like a different error, I believe it is caused by the same problem as in the srun case.

  • Vorna-517 seems to be okay. You wouldn’t have a list of suspects? It may be that we have to reboot the whole system.

  • I found several possible nodes. I rebooted 132 of those that were free now. -> Rebooted.

  • Node hanging
    I (ykempf) have another job which left a node in a job hanging in COMPLETING state: 69395710 test visit.yk ykempf COMPLETI 21:24 20:00 1 vorna-506
    These are multi-node visit script jobs. They do what they’re supposed to do but despite my efforts so far the compute engine on the compute nodes doesn’t die. It times out after about 5 min, then that triggers an abort/segfault in the visit code and the job ends. Out of the over 40 jobs today only this one is hanging. My current VisIt install was built and installed on Sep 8. Have there been changes/updates to SLURM/MPI/etc since? Should I rebuild?

  • Currently running slurm version is 20.11.7. I think that the majority of the UCX issues were resolved around that time. You could rebuild, since the UCX component alone changed a whole lot of things. If you can, opt for OpenMPI4.x.

    • Well this is with GCC 10.2 and OpenMPI 4.0.5. When was the last time you updated slurm/UCX?
  • Most recent reboot for Vorna fresh image was today, and affected: vorna-[302-307,309-360,403-405,408,411,414,417-418,420-421,423-424,427,430,433,436-437,439-440,443,446,449-455,504-517,519-531,534,537,538-541,547-548,552-555,557-560]

  • There are fair number of nodes that were booted 14 days ago. Third grouping is vorna-[442,445,434,435] have age of 25 days.

  • I shut down the old side services, they may have caused issues with wrong protocol.

  • OK thanks, I’ll attempt a rebuild in the next days then. (Yann)

    • I have also noticed recently that under some unknwon condition this version of slurm pushes jobs into “hold” state. This is porbably not related, but there is definitely something odd.
  • Slurm queuing rules

    • Are there any rules for the queuing system on Vorna?
  • Well, it depends on definition of “rules”. There are quite few things that regulate the scheduling mechanism, and would inherently constitute as “rules”. That said, there are not so much “limits”.

    • If none, is it possible to add some rules?
  • I am not a great fan of rules (read as in “limits”). What did you have in mind? We can always review change opinions. :)

Oct 5th Session Questions

  • Yann et al here, there’s some (few hundred) extra jobs on Carrington. :-)

  • Ouch!

    • Okay, I get it, let’s create temporary fix. Can you send me the allowed username list in mail and I will create manual account block?
    • This is fixed. No more unwanted usage.
  • /proj seems to be very slow or hanging, I have both an rm and an sshfs that have been unresponsive for several minutes. Edit now the rm returned, but sshfs is still in progress. But that might also have issues with going through the proxyjump hoops.

    • We’ll have a look. Thanks for the heads up!
  • Sure, let me know if it’s again me doing things in a silly way. (Yann)

    • I think there may be some processess trying to use /proj as working directory for heavy small file IO… We’ve seen this occassionally. Naturally that is an issue. If I’d only invent a way to emphase that it is not good idea in so many ways (including the job performance).
  • Oh no… Can you make /proj read-only when seen from a queued job? Ah but this fails already when I try to compile on a compute node.

    • No, we did seriously think about that, but it would not work. Instead we are trying to think other way aroud, eg. have lustre serve /proj.
    • Can you use /wrk for compiling? Or does the hang cause fatal hangs on the whole interactive session?
  • I am compiling several things (visit, vlasiator) now without issues on /proj. Just sshfs refuses to cooperate, but maybe it’s not a /proj issue. I can confirm, sshfs also refuses to mount /wrk, so it’s something else on the way to turso I suspect. I switched the proxyjump from pangolin to melkinkari, still not going through.

    • Not sure if this is some generic networking issue, some VDI machines have also been reported to be sluggish.
    • Garage checks out for the night. Back tomorrow.

Oct 6th Session Questions

  • Hi, I’m trying to run a small job (-M ukko2,vorna -t 00:30:00 -p short -n 1 -N 1 -c 2 --mem=256M --array=1-14) and it is pending. I used sinfo -l and it shows a lot of idle nodes on vorna, so what am I not understanding here? I thought that idle means they are available for use. I read more about sinfo and slurm states, but maybe there is something that I’m not getting. Could you explain a bit how to interpret these states?
    • Can I have your jobid please? I’ll take a look.
    • Here is one: 136458743
    • I also just submitted one more job (more demanding, but still not bad -t 01:00:00, -c 14 --mem-per-cpu=512M --array=1-14) and its ID is 136458745
      • Great, you have discovered another bug. :) It seems, and is replicable as well, that if you specify (as it used to be) -Mukko2,vorna, then only ukko2 appears to be considered. Wondering if this is actually intended… I think I can change those for you. Shall we try (worst case is that you have to resubmit)?
      • Thanks! Well, for testing purposes, maybe it is better if I scancel -u $USER and then I submit again but with -M vorna,ukko2? This way, the job should run on vorna and we get our answer.
    • certainly.
      • Ok i will do it now and write you ASAP. Also, i certainly intended it to submit to whichever was available first so someday fixing that bug would be nice. But i know there is a lot of work you all have already so at least good to know about the bug ;-)
    • I’ll file bug, but not sure the fix is within our reach, I have to check with SchedMD wether this is intended mode of behaviour now.
      • Did not seem to fix things. both new jobs are pending. IDs are: 136458749 and 136458751. The SBATCH had: -M vorna,ukko2 -t 01:00:00 -p short,bigmem -n 1 -c 14 --mem-per-cpu=512M --array=1-14… ok maybe bigmem doesn’t exist on vorna…let me try again. But the other job only says -p short…and it is also pending!
    • yes it does not… But nevertheless, if there is short,bigmem, it should choose either.
    • I can replicate the behaviuour with simple interactive session. Eg. for now you’d have to specify explicitely the cluster -M vorna, and -p short.
      • OK! trying now
    • This is annoying “feature”. Sort of defeats the purpose of federation. I’ll file a bug now.
      • haha. indeed. Everything is running on vorna now! thanks for your help.
        • Critical Bug#IT4SCI-1147 filed for this issue. Thanks for reporting it.
  • Hi, vorna-512 and 546 are still hanging after some of my visit runs. Can be cleaned up when you have time. I built a new visit for vorna yesterday, I’ll report next time I run things if I still cause such node hangs on exit. Job IDs 69398715 and 69385518, respectively.
    • Nodes vorna-512 and 546 set to drain. Added to swapbook. Will replace.

29 Sep 2021 Session Questions

Comment

You might benefit this material that was published some time ago.

Issue using pip in virtualenv
  • Hey, I got the following output when running pip install inside Python virtualenv. Same error for other packages (not only seaborn):
(python365) carlosto@turso03:~$ pip install seaborn
WARNING: pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
Could not fetch URL https://pypi.org/simple/seaborn/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/seaborn/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)) - skipping
ERROR: Could not find a version that satisfies the requirement seaborn (from versions: none)
ERROR: No matching distribution found for seaborn
WARNING: You are using pip version 19.3.1; however, version 21.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(python365) carlosto@turso03:~$ python --version
Python 3.6.8
(python365) carlosto@turso03:~$ pip --version
pip 19.3.1 from /proj/carlosto/python365/lib/python3.6/site-packages/pip (python 3.6)
  • However, doing the equivalent for module Anaconda3 is successful:
(python365) carlosto@turso03:~$ deactivate
carlosto@turso03:~$ module load Anaconda3
carlosto@turso03:~$ pip install --user seaborn
Requirement already satisfied: seaborn in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (0.10.0)
Requirement already satisfied: pandas>=0.22.0 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (1.0.1)
Requirement already satisfied: numpy>=1.13.3 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (1.18.1)
Requirement already satisfied: scipy>=1.0.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (1.4.1)
Requirement already satisfied: matplotlib>=2.1.2 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (3.1.3)
Requirement already satisfied: pytz>=2017.2 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (2.4.6)
Requirement already satisfied: cycler>=0.10 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (0.10.0)
Requirement already satisfied: six>=1.5 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn) (1.14.0)
Requirement already satisfied: setuptools in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) (45.2.0.post20200210)
  • Load Python module (and check for SSL module). Note that you should not use FGCI modules.
  • See at the bottom of the page about Python/SSL


  • I ran another array job in MATLAB after a late chat with Juha in which he found that HOME was getting full, so we made a symbolic link in WRK USER (dir called matlab-profile) to stick all of these MATLAB java log files and other stuff there. It appears to work quite well, so this was likely the cause of the parallel pool failure. BUT, out of 32 jobs, there are still 2 that failed last night due to MATLAB being unable to open a parallel pool. Critically, for those two jobs, it says that my local job cluster file (which is now correctly pointing to the WRK folder) was corrupt. Even though were deleted it all and created a new one launching a fresh MATLAB from login node yesterday. The error says: Error using parallel.Cluster/createConcurrentJob (line 1145) Unable to write to MAT-file /wrk/users/$USER/matlab-profile/local_cluster_jobs/R2019a/Job6.in.mat. The file may be corrupt.
    • Don’t copy it here for now. Have you checked the file that Matlab points to? I am wondering if it possible that /wrk/users/$USER/matlab-profile/local_cluster_jobs/R2019a/Job6.in.mat is actually subject to concurrency of sorts… Or race condition as in: https://www.mathworks.com/matlabcentral/answers/545174-how-can-i-work-around-a-race-condition-on-a-parallel-computing-job-storage-location
      • Thanks I’ll read this link
        • i can look. by the way, we can do this now or i can just come to Garage at 945. what do you prefer? whatever is most convenient for you
      • Garage will be fine too. Based on the reading this is not related to Array job, but it does probably make it more likely. This race coondition could theoretically occur (by chance) every time you submit more than one concurrent Matlabs, depending on timing. I think google is a good help in this case to check how to avoid Matlab parallel run race condition. There is a patch for this issue, please review that it does not do anything bad before using it. :) Will close Bug #IT4SCI-1132.
        • Yep. This solves the problem and appears to me to not do anything “bad”

  • Hi this is me again, who learned a lot yesterday about what should and what shouldn’t be done in Lustre. I modified the input folder structure of my jobs that kept getting stuck, but still this issue remains: Quite often the slurm status of the job will go from regular PENDING to PENDING with nodelist(reason) “(launch failed requeued held)”

    • I’ve been wondering about that. It is not related to Lustre, but something to do with the job submission…
    • Would be helpful to know the exact job script (we can look at this in Garage if you have time today).
      • Today is hard for garaging again, but my script isn’t long. I can copy it here if it’s ok?
  • That’ll be good.

    • Here goes
      #SBATCH -J shrt_mid_111_1.0_10
      #SBATCH -o data.out
      #SBATCH -e data.err
      #SBATCH -p long
      #SBATCH -N 1
      #SBATCH -n 1
      #SBATCH -c 1
      #SBATCH -t 14-00:00:00
      #SBATCH --mail-type=ALL
      #SBATCH --mail-user=<censored>
      
      RANDOM=24710
      seed1=$RANDOM
      seed2=$RANDOM
      seed3=$RANDOM
      
      source /proj/jyrilaht/modules_for_femocs_lammps.sh
      /proj/jyrilaht/lammps_femocs/src/lmp_serial -v seed1 $seed1 -v seed2 $seed2 -v seed3 $seed3 -in in.LAMMPS
      
    • Hopping off train now, will check this page in 25 min
      • Straight away, there is nothing wrong with your script as such. I will have to take a look at the system logs (I have manually released your jobs btw) for any hints why system decides that your jobs are to be held.
      • Ah. Try to add option -M ukko2,vorna (or just either of the two subclusters if you have preference. Eg. -Mvorna or -Mukko2). It seems that somehow federated JobID provokes qhold. Not entirely sure yet why that happens.
        • I’m always submitting these with sbatch -M ukko2 submit.sh. (vorna doesn’t have a 14-day queue/partition)
        • And the behaviour is a little unpredictable. Usually I’ve canceled jobs like these and resubmitted, and they have sometimes started running normally… Maybe I’ll wait a bit now before spamming again, while the investigation is underway. These are the last ~15 jobs that I’d like to get running, so there is no tremendous hurry.
      • They start mostly because I release them with this: for i in $(squeue -Mukko2 -O JobID,ReasonList |grep launch | uniq |sort -n|awk '{print $1}'); do scontrol release $i; done
        • Oh so cancelling =/= releasing?
      • They are held by system, but now I really wonder why this occurs… For a moment I thought that there is some kind of user-based limiter, but that is not the case.
      • Actually, it states: Reason=launch_failed_requeued_heldwhich would indicate that the job has been requeued…
        • Hmm yeah, but they seem to be stuck in this state indefinitely… And when I launch some other jobs in the same partition, they usually start immediately, so it’s not like the partition is crowded.
      • I check this as soon as possible. Have to check something else for now…
  • I’m also having the same issue with Python virtual environments. It seems like the SSL or openssl libraries are missing after some recent updates?

(virtual_python3.8.2) hongyang@turso03:~/proj/virtual_python3.8.2$ pip list
Package    Version
---------- -------
pip        20.0.2 
setuptools 46.1.3 
wheel      0.34.2 
WARNING: pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.

Or if I just follow my previous workflow for loading the virtual environments without module load OpenSSL/1.1.1e-GCCcore-9.3.0

hongyang@turso03:~/proj$ virtualenv virtual_python3.8.2/
Traceback (most recent call last):
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/bin/virtualenv", line 5, in <module>
    from virtualenv.__main__ import run_with_catch
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/site-packages/virtualenv/__init__.py", line 3, in <module>
    from .run import cli_run, session_via_cli
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/site-packages/virtualenv/run/__init__.py", line 9, in <module>
    from ..seed.wheels.periodic_update import manual_upgrade
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/site-packages/virtualenv/seed/wheels/__init__.py", line 3, in <module>
    from .acquire import get_wheel, pip_wheel_env_run
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/site-packages/virtualenv/seed/wheels/acquire.py", line 13, in <module>
    from .bundle import from_bundle
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/site-packages/virtualenv/seed/wheels/bundle.py", line 6, in <module>
    from .periodic_update import periodic_update
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/site-packages/virtualenv/seed/wheels/periodic_update.py", line 10, in <module>
    import ssl
  File "/appl/opt/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/ssl.py", line 98, in <module>
    import _ssl             # if we can't import it, let the error propagate
ImportError: libssl.so.10: cannot open shared object file: No such file or directory
  • In case this helps debugging either or both issues, these are the modules I’m loading with source /proj/jyrilaht/modules_for_femocs_lammps.sh

    • OpenMPI/2.1.2-GCC-6.4.0-2.28
    • CMake/3.10.2-GCCcore-6.4.0
  • Still looking for the reason why system sends jobs into hold. There is pretty good chance that this is related to some obscure configuration bug (pronbably related to GRES - Slurm Generic Resource setup).

Case Python Virtualenv & SSL (pip & virtualenv)

First need to determine:

  • Are these dysfunctional Pythons from fgci repository? If yes, note that many of the modules have internal conflicts and other issues. Many problems originate from the use of the old repository.

  • Also note that you do not want to use system Python. Always use modules.

  • Do you experience the issues if you explicitely load openssl? module load OpenSSL/1.1.1e-GCCcore-9.3.0

  • Unfortunately, problem remains after loading a module for Python andloading the openssl module:

carlosto@turso03:~$ module purge
carlosto@turso03:~$ module load Python/3.6.6-foss-2018b
carlosto@turso03:~$ source python365/bin/activate
(python365) carlosto@turso03:~$ module load OpenSSL/1.1.1e-GCCcore-9.3.0

The following have been reloaded with a version change:
  1) GCCcore/7.3.0 => GCCcore/9.3.0     2) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-9.3.0

(python365) carlosto@turso03:~$ pip install seaborn
WARNING: pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)': /simple/seaborn/
Could not fetch URL https://pypi.org/simple/seaborn/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/seaborn/ (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available.",)) - skipping
ERROR: Could not find a version that satisfies the requirement seaborn (from versions: none)
ERROR: No matching distribution found for seaborn
WARNING: You are using pip version 19.3.1; however, version 21.2.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(python365) carlosto@turso03:~$
  • This starts to look like an inssue of RHEL8.4 vs. Centos7 and Python module compatibility. Ukko2 and Vorna nodes are still running Centos7, and hence this should work there (for example in interctive session).

  • Connecting and querying system Python for Vorna

carlosto@vorna-502:~$ alias int
alias int='srun --interactive -n4 --mem=4G -Mvorna --time=05:00:00 -pshort --pty bash'
carlosto@turso03:~/preprocessing$ int
srun: job 69386708 queued and waiting for resources
srun: job 69386708 has been allocated resources
carlosto@vorna-502:~$ source python365/bin/activate
(python365) carlosto@vorna-502:~$ python --version
python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
(python365) carlosto@vorna-502:~$ python
python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
(python365) carlosto@vorna-502:~$ pip --version
/carlosto/python365/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
  • Loading Python module on Vorna
carlosto@vorna-502:~/preprocessing$ module load Python/3.6.6-foss-2018b
(python365) carlosto@vorna-502:~$ python --version
python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
(python365) carlosto@vorna-502:~$ python
python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
(python365) carlosto@vorna-502:~$ pip --version
/carlosto/python365/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
(python365) carlosto@vorna-502:~$ module load Python/3.6.6-foss-2018b
(python365) carlosto@vorna-502:~$ python --version
Python 3.6.6
(python365) carlosto@vorna-502:~$ pip --version
Error processing line 1 of /home/carlosto/.local/lib/python3.6/site-packages/matplotlib-3.1.2-py3.6-nspkg.pth:

  Traceback (most recent call last):
    File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site.py", line 168, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
    File "<frozen importlib._bootstrap>", line 568, in module_from_spec
  AttributeError: 'NoneType' object has no attribute 'loader'

Remainder of file ignored
pip 18.0 from /appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip (python 3.6)
(python365) carlosto@vorna-502:~$ pip install seaborn
Error processing line 1 of /home/carlosto/.local/lib/python3.6/site-packages/matplotlib-3.1.2-py3.6-nspkg.pth:

  Traceback (most recent call last):
    File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site.py", line 168, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
    File "<frozen importlib._bootstrap>", line 568, in module_from_spec
  AttributeError: 'NoneType' object has no attribute 'loader'

Remainder of file ignored
Collecting seaborn
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f1612ea88d0>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/seaborn/
^COperation cancelled by user
^CTraceback (most recent call last):
  File "/appl/opt/Python/3.6.6-foss-2018b/bin/pip", line 11, in <module>
    load_entry_point('pip==18.0', 'console_scripts', 'pip')()
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/__init__.py", line 310, in main
    return command.main(cmd_args)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/basecommand.py", line 183, in main
    pip_version_check(session, options)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/utils/outdated.py", line 113, in pip_version_check
    all_candidates = finder.find_all_candidates("pip")
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/index.py", line 452, in find_all_candidates
    for page in self._get_pages(url_locations, project_name):
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/index.py", line 597, in _get_pages
    page = self._get_page(location)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/index.py", line 715, in _get_page
    return HTMLPage.get_page(link, session=self.session)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/index.py", line 824, in get_page
    "Cache-Control": "max-age=600",
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/requests/sessions.py", line 525, in get
    return self.request('GET', url, **kwargs)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_internal/download.py", line 396, in request
    return super(PipSession, self).request(method, url, *args, **kwargs)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/cachecontrol/adapter.py", line 53, in send
    resp = super(CacheControlAdapter, self).send(request, **kw)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/requests/adapters.py", line 445, in send
    timeout=timeout
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/urllib3/connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/urllib3/connectionpool.py", line 849, in _validate_conn
    conn.connect()
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/urllib3/connection.py", line 314, in connect
    conn = self._new_conn()
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/urllib3/connection.py", line 171, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip/_vendor/urllib3/util/connection.py", line 69, in create_connection
    sock.connect(sa)
KeyboardInterrupt
  • I interrupted because it was taking too long, just to continue erroring. Now loading the openssl module on Vorna as well:
(python365) carlosto@vorna-502:~$ module load OpenSSL/1.1.1e-GCCcore-9.3.0

The following have been reloaded with a version change:
  1) GCCcore/7.3.0 => GCCcore/9.3.0     2) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-9.3.0

(python365) carlosto@vorna-502:~$ pip install seaborn
Error processing line 1 of /home/carlosto/.local/lib/python3.6/site-packages/matplotlib-3.1.2-py3.6-nspkg.pth:

  Traceback (most recent call last):
    File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site.py", line 168, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
    File "<frozen importlib._bootstrap>", line 568, in module_from_spec
  AttributeError: 'NoneType' object has no attribute 'loader'

Remainder of file ignored
Collecting seaborn
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7ff6929c16d8>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/seaborn/
^COperation cancelled by user
(python365) carlosto@vorna-502:~$
  • Repeating on Ukko2. System Python:
(python365) carlosto@vorna-502:~$ sed -i "s/vorna/ukko2/g" ~/.bashrc
(python365) carlosto@vorna-502:~$ exit
srun: error: vorna-502: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=69386708.interactive
carlosto@turso03:~/preprocessing$ source ~/.bashrc
carlosto@turso03:~/preprocessing$ alias int
alias int='srun --interactive -n4 --mem=4G -Mukko2 --time=05:00:00 -pshort --pty bash'
carlosto@turso03:~/preprocessing$ int
srun: job 136456935 queued and waiting for resources
srun: job 136456935 has been allocated resources
carlosto@ukko2-21:~/preprocessing$ source ../python365/bin/activate
  • Python module on Ukko2:
(python365) carlosto@ukko2-21:~/preprocessing$ module load Python/3.6.6-foss-2018b
(python365) carlosto@ukko2-21:~/preprocessing$ python --version
Python 3.6.6
(python365) carlosto@ukko2-21:~/preprocessing$ pip --version
Error processing line 1 of /home/carlosto/.local/lib/python3.6/site-packages/matplotlib-3.1.2-py3.6-nspkg.pth:

  Traceback (most recent call last):
    File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site.py", line 168, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
    File "<frozen importlib._bootstrap>", line 568, in module_from_spec
  AttributeError: 'NoneType' object has no attribute 'loader'

Remainder of file ignored
pip 18.0 from /appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site-packages/pip-18.0-py3.6.egg/pip (python 3.6)
(python365) carlosto@ukko2-21:~/preprocessing$ pip install seaborn
Error processing line 1 of /home/carlosto/.local/lib/python3.6/site-packages/matplotlib-3.1.2-py3.6-nspkg.pth:

  Traceback (most recent call last):
    File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site.py", line 168, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
    File "<frozen importlib._bootstrap>", line 568, in module_from_spec
  AttributeError: 'NoneType' object has no attribute 'loader'

Remainder of file ignored
Collecting seaborn
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fb6b7e2b908>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/seaborn/
^COperation cancelled by user
  • Openssl in Ukko2:
(python365) carlosto@ukko2-21:~/preprocessing$ module load OpenSSL/1.1.1e-GCCcore-9.3.0

The following have been reloaded with a version change:
  1) GCCcore/7.3.0 => GCCcore/9.3.0     2) zlib/1.2.11-GCCcore-7.3.0 => zlib/1.2.11-GCCcore-9.3.0

(python365) carlosto@ukko2-21:~/preprocessing$ pip install seaborn
Error processing line 1 of /home/carlosto/.local/lib/python3.6/site-packages/matplotlib-3.1.2-py3.6-nspkg.pth:

  Traceback (most recent call last):
    File "/appl/opt/Python/3.6.6-foss-2018b/lib/python3.6/site.py", line 168, in addpackage
      exec(line)
    File "<string>", line 1, in <module>
    File "<frozen importlib._bootstrap>", line 568, in module_from_spec
  AttributeError: 'NoneType' object has no attribute 'loader'

Remainder of file ignored
Collecting seaborn
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f49f6561668>: Failed to establish a new connection: [Errno 101] Network is unreachable',)': /simple/seaborn/
^COperation cancelled by user
  • Is there any way to quickly solve this issue, even if temporary? I really need to pip install autoreject I’m really running out of time.
    • This should work on compute nodes on interactive sessions. Issue is relarted to RHEL8.4 version we are running on the login node.
      • Please read the logs I provided. They contain some extensive testing on Vorna and Ukko2 interactive sessions, replicating the error there.

You may need to set proxies if not set:

export https_proxy=http://www-cache.cs.helsinki.fi:3128
export http_proxy=http://www-cache.cs.helsinki.fi:3128
export HTTP_PROXY=http://www-cache.cs.helsinki.fi:3128
export HTTPS_PROXY=http://www-cache.cs.helsinki.fi:3128
  • Additionally, if you explicitely log in to turso01, which has centos7 base image, this works.
    • Will try setting proxies and/or logging in turso01 and report back soon.

Sep 30th Session Questions

  • by the way, first thing they recommend here is to stop MATLAB from writing logs to $HOME ;-) https://scicomp.aalto.fi/triton/apps/matlab/

    • Well, well, this is great. :)
  • Hi, is there anyway I can get ZSH and Oh My ZSH on the cluster? I read a few things on how to do this without sudo access and wanted to cross-check if there is some exisitng guide for the clusters.

    • If you have way to do it without privileges, go ahead. However, there is not so much demand to warrant global install at this time.
    • Thank you :)
  • I would also very much like to have ZSH, even as the default. If you spend so much time around the terminal, these small things really add up. I’m quite happy to have it available, and even more as a default when I use Triton. I think because of these things I generally consider using Triton more friendly than our cluster e.g. ZSH, admins were also happy to add small general utilities like FZF… Please take this as kind feedback, I really appreciate our cluster as well :)

    • I would presume that it would not be sufficient for login node alone. Since slurm inherits your environment(variables), and pretty much everything else from the login session, use of zsh may be a bit more complex than expected.
    • I’ll create feature request. Bug# IT4SCI-1139 filed for ZSH request.
    • Thank you! :)
      • Login node has zsh package installed for test purposes. Not tested nor guaratees given. Compute nodes don’t have zsh, since that is a way bigger operation.




The matlab code which used to run doesn’t run now at all. It throws an error trying to open the parallel pool. The errors are slightly different reasons, but it centers on somthing in my profile related to MATLAB installation being corrupt. It even says to contact MATHWORKS. I’m going to remove the parallel pool and start letting this run as a batch job and I hope it finishes without taking too much time. I hope we can figure out what is happening since a big reason for use to use the HPC is obviously to run parallel pool in MATLAB. Really strange error. Let me know if I can do anything to help you figure it out. If you need to reach me, please email or call to catch me since I might not keep an eye here. HEre are the error reports for jobs submitted individually (but rapidly one-by-one so they were all running at the same time). Of note: one of the errors says something quite interesting: “The job storage metadata file ‘/home/$USER/.matlab/local_cluster_jobs/R2019a/matlab_metadata.mat’ is corrupt. For assistance recovering job data, contact MathWorks Support Team. Otherwise, delete all files in the JobStorageLocation and try again.”

Comment

JOB 1
-rw-r–r-- 1 $USER grp-$USER-lab 725 Sep 28 16:45 err-136455687.txt
Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using save
Unable to write file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job239.in.mat: No such
file or directory.

JOB 2
-rw-r–r-- 1 $USER grp-$USER-lab 774 Sep 28 16:45 err-136455686.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using parallel.Cluster/createConcurrentJob (line 1145)
Unable to write to MAT-file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job239.in.mat.
The file may be corrupt.

JOB 3
-rw-r–r-- 1 $USER grp-$USER-lab 774 Sep 28 16:45 err-136455685.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using parallel.Cluster/createConcurrentJob (line 1145)
Unable to write to MAT-file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job239.in.mat.
The file may be corrupt.

JOB 4
-rw-r–r-- 1 $USER grp-$USER-lab 774 Sep 28 16:45 err-136455684.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using parallel.Cluster/createConcurrentJob (line 1145)
Unable to write to MAT-file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job239.in.mat.
The file may be corrupt.

JOB 5
-rw-r–r-- 1 $USER grp-$USER-lab 964 Sep 28 16:46 err-136455694.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using
parallel.internal.cluster.FileStorage/addConstructorToMetadata (line
613)
The job storage metadata file
‘/home/$USER/.matlab/local_cluster_jobs/R2019a/matlab_metadata.mat’ is
corrupt. For assistance recovering job data, contact MathWorks Support
Team. Otherwise, delete all files in the JobStorageLocation and try
again.

JOB 6
-rw-r–r-- 1 $USER grp-$USER-lab 730 Sep 28 16:46 err-136455698.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using save
Unable to write to MAT-file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job240.in.mat.
The file may be corrupt.

JOB 7
-rw-r–r-- 1 $USER grp-$USER-lab 688 Sep 28 16:46 err-136455696.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using save
Can not write file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job240.in.mat.

JOB 8
-rw-r–r-- 1 $USER grp-$USER-lab 774 Sep 28 16:46 err-136455697.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using parallel.Cluster/createConcurrentJob (line 1145)
Unable to write to MAT-file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job240.in.mat.
The file may be corrupt.

JOB 9
-rw-r–r-- 1 $USER grp-$USER-lab 707 Sep 28 16:46 err-136455700.txt

Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using parallel.Job/submit (line 346)
The operation “submit” can only be performed on jobs in state
“pending”.

  • I think that something went awry with the /home/$USER/.matlab. Guess, you filled up quota of $HOME. Second guess, there is something else wrong.

    • How can I check quote of HOME? Can you please send me the exact command I should run? thanks!
  • quota

  • That said, you could move /home/$USER/.matlab for example /wrk/users/$USER/matlab.rescue-28.09.2021 and then initialize matlab again (I’d guess it would do it at your first run if you do not have .matlab at your home directory).

    • Do you mean from turso sign-in, I should run this line: mv /wrk/users/$USER/matlab.rescue-28.09.2021 /home/$USER/.matlab
  • Other way around, first source, then destination:
    mv /home/$USER/.matlab /wrk/users/$USER/matlab.rescue-28.09.2021

  • Second matter is that you might want to either symlink (ln is the command to do it, see man page for exact syntax) .matlab from your $HOME to either /proj/$USER/suitable-directory-here or /wrk/users/$USER/suitable-directory-here, since as said, home quota is rather small and for example cache files may fill it up quickly.

    • OK. I will read about ln now and may then post question here
    • This (/proj/$USER/suitable-directory-here) should be where chmod= is pointing for that sbatch ?
  • If you decide to create symlink for .matlab and point it to someplace else, it just means that $HOME will have link to other location, which then contains all the actual data. This is a way to avoid .matlab directory contents exceed quota, if this is the root cause.

    • OK. I just ran this line: mv /home/$USER/.matlab /wrk/users/$USER/matlab.rescue-28.09.2021
    • Should i try again immediately an array job? or a single job with parallel pool?
    • Or should I now do ln /wrk/users/$USER/matlab.rescue-28.09.2021 /home/$USER/.matlab
    • Or is it that I should cp /home/$USER/.matlab /proj/group/grp-$USER-lab/ and then do ln /home/$USER/.matlab /proj/group/grp-$USER-lab/
  • Wait!

    • OK!
  • You can create yourself an empty directory which will be the destination of .matlab contents. Make it such that is tells you what it is. For example mkdir /home/$USER/matlab-profile. Make then symbolic link to this directory in your home directory, I think it is something like (lower case -s) ln -s /home/$USER/matlab-profile .matlab

    • OK, so I am doiing this from within the dir /home/$USER/?
  • Yes, you get to your home directory by typing just cd

  • ln example:

ln -s  /wrk/users/juhaheli/foobarbar/ foobarbar
lrwxrwxrwx    1 juhaheli hyad-all     30 Sep 28 19:50 foobarbar -> /wrk/users/juhaheli/foobarbar/
  • I think you got the idea? :)

    • yep. trying now…
  • I wasn’t sure if it should be with a slash but i thought so based on your example, so I ran this code:
    $USER@turso03:~$ mkdir /home/$USER/matlab-profile
    $USER@turso03:~$ ln -s /home/$USER/matlab-profile .matlab
    $USER@turso03:~$
    $USER@turso03:~$ ln -s /home/$USER/matlab-profile/ .matlab

  • now if, in your home directory you wirte ls -ltrad .matlab
    $USER@turso03:~$ ls -ltrad .matlab
    lrwxrwxrwx 1 $USER hyad-all 26 Sep 28 19:53 .matlab -> /home/$USER/matlab-profile

  • Hehe… you should have create the /home/$USER/matlab-profile to /proj/$USER/matlab-profile or /wrk/users/$USER/matlab-profile. Besides of that little twak you have the idea. :) You can remove symbilic link just with rm.

  • OK. So mkdir /proj/group/grp-$USER-lab/matlab-profile ?

  • Not group. No, to your own. This is your own .matlab profile.

  • AH, I forgot that there is /proj/$USER/ and then I repeat ln with that. Do i need to remove the link I made above?

  • Yes, remove the link. And the unnecessary directory (/home/$USER/matlab-profile).

    • can i just do rm -R /home/$USER/matlab-profile ? or i need to so something with the ln command?
  • just these: rm .matlab and rmdir /home/$USER/matlab-profile.

    • So, this is rm . matlab from within HOME dir? or within /home/$USER/matlab-profile
  • Just your homedir.

  • wow is it normal to blink at me in red on this last line?!?!?
    $USER@turso03:/proj/group/grp-$USER-lab/AttentionProject$ mkdir /wrk/users/$USER/matlab-profile
    $USER@turso03:/proj/group/grp-$USER-lab/AttentionProject$ cd
    $USER@turso03:~$ rm .matlab
    $USER@turso03:~$ rmdir /home/$USER/matlab-profile
    rmdir: failed to remove ‘/home/$USER/matlab-profile’: Directory not empty
    $USER@turso03:~$ rmdir -R /home/$USER/matlab-profile
    rmdir: invalid option – ‘R’
    Try ‘rmdir --help’ for more information.
    $USER@turso03:~$ rmdir -R /home/$USER/matlab-profile/
    rmdir: invalid option – ‘R’
    Try ‘rmdir --help’ for more information.
    $USER@turso03:~$ rmdir /home/$USER/matlab-profile/
    rmdir: failed to remove ‘/home/$USER/matlab-profile/’: Directory not empty
    $USER@turso03:~$ rm -R /home/$USER/matlab-profile/
    $USER@turso03:~$ ln -s /wrk/uers/$USER/matlab-profile/ .matlab
    $USER@turso03:~$ ls -ltrad .matlab
    lrwxrwxrwx 1 $USER hyad-all 31 Sep 28 20:04 .matlab -> /wrk/uers/$USER/matlab-profile/

  • Come to garage.

  • ok
    You might benefit this material that was published some time ago.

Issue using pip in virtualenv
  • Hey, I got the following output when running pip install inside Python virtualenv. Same error for other packages (not only seaborn):

(python365) carlosto@turso03:~$ pip install seaborn
WARNING: pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘SSLError(“Can’t connect to HTTPS URL because the SSL module is not available.”,)’: /simple/seaborn/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘SSLError(“Can’t connect to HTTPS URL because the SSL module is not available.”,)’: /simple/seaborn/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘SSLError(“Can’t connect to HTTPS URL because the SSL module is not available.”,)’: /simple/seaborn/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘SSLError(“Can’t connect to HTTPS URL because the SSL module is not available.”,)’: /simple/seaborn/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by ‘SSLError(“Can’t connect to HTTPS URL because the SSL module is not available.”,)’: /simple/seaborn/
Could not fetch URL https://pypi.org/simple/seaborn/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host=‘pypi.org’, port=443): Max retries exceeded with url: /simple/seaborn/ (Caused by SSLError(“Can’t connect to HTTPS URL because the SSL module is not available.”,)) - skipping
ERROR: Could not find a version that satisfies the requirement seaborn (from versions: none)
ERROR: No matching distribution found for seaborn
WARNING: You are using pip version 19.3.1; however, version 21.2.4 is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.
(python365) carlosto@turso03:~$ python --version
Python 3.6.8
(python365) carlosto@turso03:~$ pip --version
pip 19.3.1 from /proj/carlosto/python365/lib/python3.6/site-packages/pip (python 3.6)

  • However, doing the equivalent for module Anaconda3 is successful:

(python365) carlosto@turso03:~$ deactivate
carlosto@turso03:~$ module load Anaconda3
carlosto@turso03:~$ pip install --user seaborn
Requirement already satisfied: seaborn in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (0.10.0)
Requirement already satisfied: pandas>=0.22.0 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (1.0.1)
Requirement already satisfied: numpy>=1.13.3 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (1.18.1)
Requirement already satisfied: scipy>=1.0.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (1.4.1)
Requirement already satisfied: matplotlib>=2.1.2 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from seaborn) (3.1.3)
Requirement already satisfied: pytz>=2017.2 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (2.4.6)
Requirement already satisfied: cycler>=0.10 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (0.10.0)
Requirement already satisfied: six>=1.5 in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.22.0->seaborn) (1.14.0)
Requirement already satisfied: setuptools in /appl/opt/Anaconda3/2020.02/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) (45.2.0.post20200210)

27 Sep 2021 Session Questions

Comment

  • JupyterHub claims that I do not have account affiliations set correctly.
    Error: HTTP 500: Internal Server Error (Error in Authenticator.pre_spawn_start: RuntimeError sbatch: error: invalid partition specified: jupyter sbatch: error: Problem with submit to cluster ukko2: Invalid partition name specified sbatch: error: Problem with submit to cluster vorna: Invalid account or account/partition combination specified sbatch: error: Can't run on any of the specified clusters sbatch: error: There is a problem talking to the database: Invalid account or account/partition combination specified. Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.)

    • Please log in to Turso, this recreates (or at least should do so) you the affiliations (slurm upgrade included cleanup of existing database entries - also your share usage).
  • When I run an interactive session and load MATLAB, I get thru stages of my MATLAB code that simply involve searching for a file (dir) on /wrk/users/USER and loading it (small size, <512MB) very quickly. However, when this MATLAB code is submitted as a batch job with -c18 and -mem-per-cpu=16G, the code runs through this part much more slowly than during the interactive session. I tried moving the opening of the parallel pool in MATLAB to the point before the parfor loop (which comes after this file loading stuff) instead of at the beginning of the script, but that did not help. Any ideas why this might be the case and how to improve it?

    • What node is the batch job using (it is handy to include ‘hostname’ in the job script)? In case job lands for example on Vorna, then it will be slower than on Ukko2.
  • All tests were made with Ukko2

    • Do you also use srun to claim the resources on the batch job? Eg. do you start the matlab process with srun? If not, by default you only get one cpu core inside your batch job reservation.
  • yes. Here is my payload:
    umask 007
    module purge
    module load MATLAB
    srun matlab -nodisplay -nodesktop -r “run AttentionGranger_cluster(R,F).m; exit(0)”

    • Add in your payload following line (just under umask), and then execute again, and see how it behaves. This may help to identify ill behaving node, or other such matter:
      hostname
      * ok I will do this
    • Curious minds wonder what would happen if you actually say following in the batch script
      srun -c18 matlab -nodisplay -nodesktop -r "run AttentionGranger_cluster(R,F).m; exit(0)"
      * let’s see! I’ll try
      *
  • seff appears to not work again. you can see the job ID there and when it comlpeted (last write time to out-136452522.txt). But seff fails to return any info

-rw-r--r-- 1 $USER grp-$USER-lab 4.6K Sep 25 00:10 out-136452522.txt
$USER@turso03:/proj/group/grp-$USER-lab/AttentionProject$ seff 136452522 -M ukko2
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Usage: seff [Options] <Jobid>
       Options:
       -h    Help menu
       -v    Version
       -d    Debug mode: display raw Slurm data
       -M    Cluster name
  • It is a bit tricky…but I figured out that if I srun an interactive session on ukko2, then I can run seff on a job ID that had completed on ukko2, but seff won’t work just from the sign-in onto turso
    • Try seff -M ukko2 136452522 as it states in the usage part above seff [Options] Jobid. Eg. -M before the jobid.
  • Oops! ok…that works.

28 Sep 2021 Session Questions

  • Well, I ran my first array batch job. Feels good. My array.txt had 32 lines. My directory has 32 err-%j and 32 out-%j.txt files. So that’s good. What strange is that some results files (should be on WRK) are missing and squeue -u USER shows nothing running for me. Now, some of the err files appear to have an error… any idea what happened? Here is the error:
$USER@turso03:/proj/group/grp-$USER-lab/AttentionProject$ cat err-136454388.txt
/appl/opt/MATLAB/2019a/bin/matlab: line 1335: lsb_release: command not found
slurmstepd: error: execve(/usr/bin/epilog): Resource temporarily unavailable
slurmstepd: error: execve(/usr/bin/epilog): Resource temporarily unavailable
And here is my batch:
#!/bin/bash
#SBATCH --chdir=/proj/group/grp-$USER-lab/AttentionProject/
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nelson.$USER@helsinki.fi
#SBATCH -M ukko2
#SBATCH -t 12:00:00
#SBATCH -p bigmem,short
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 17
#SBATCH --mem-per-cpu=4G
#SBATCH -o out-%j.txt
#SBATCH -e err-%j.txt
#SBATCH --array=1-32

n=$SLURM_ARRAY_TASK_ID                  # define n
line=`sed "${n}q;d" arrayparams.txt`    # get n:th line (1-indexed) of the file

umask 007
module purge
module load MATLAB
srun matlab -nodisplay -nodesktop -r "$line run AttentionGranger_cluster(R,F).m; exit(0)"
  • Based on the above setting of --chdir, this would indicate that the outputs etc. would be written to the location spcified by the --chdir. Basically this sets the working path (“chdir” means “change directory”).
  • You can use sacct to dig through the old jobs (example shows start time -S): sacct -u <username> -Mukko2 -X -S 2021-09-27
  • Based on this, there has been mostly completed jobs, but some have failed. Jobs 1-4 were cancelled, and on second attempt they failed on timeout. Now, if you know the node this occurred, it would be very useful, since the error you saw may indicate that there is an issue with epilogue script (slurm configuration component) on one node.
    • I am going to remove all out and err files and rerun this new to get a clear picture of what is happening. My out reports hostname now too
  • In my WRK dir i have a lot of files like this fa1c-5726-5599-8156 and like this mex_HyI1bNh9Uv_device.o Maybe I used some program from an external source that generated them, but before I remove them, I want to make sure they aren’t something that need to be there. Do you have any ideas?
    • Unfortunately I cannot say what those files are. You can check the file type with file <filename>. If the file type is ASCII and not binary, then you could look the contents with cat (at least in sensible way). *.o appears to be object file.
    • That said, maybe they are part of the compilation?

29 Sep 2021 Session Questions

  • Hi all, we have a very strange bug. I sent an array job that uses some MATLAB code. It ran 10 of 32 files just fine,but then threw errors on remaining files (same code, just different data files in teh array). The error has to do with MATLAB parallel pool. Notably, I had -c10 in SBATCH and also tried srun -c10 matlab in the payload. 10 files…-c10. the errors were:
    $USER@turso03:/proj/group/grp-$USER-lab/AttentionProject$ cat err-136454546.txt
    /appl/opt/MATLAB/2019a/bin/matlab: line 1335: lsb_release: command not found
    Error using parallel.Cluster/parpool (line 86)
    Parallel pool failed to start with the following error. For more detailed
    information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using parallel.Job/preSubmit (line 592)
Unable to read MAT-file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job224.in.mat. File might
be corrupt.

Obviously, this is MATLAB internal issue - not the code, which runs fine.
Now, I decided to run it all again with -c10 to see if it replicates using only srun matlab… (no -c10 there because up until yesterday i never put resources in the payload and everything worked fine). AFter this test, NO files successfully run and they all have the error:
$USER@turso03:/proj/group/grp-$USER-lab/AttentionProject$ cat err-136454723.txt
/appl/opt/MATLAB/2019a/bin/matlab: line 1335: lsb_release: command not found
Error using parallel.Cluster/parpool (line 86)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile ‘local’ in the Cluster Profile Manager.

Error in AttentionGranger_cluster (line 412)
PP = parpool(cluster,cluster.NumWorkers);

Error in run (line 91)
evalin(‘caller’, strcat(script, ‘;’));

Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
678)
Failed to start pool.
Error using parallel.Cluster/createConcurrentJob (line 1145)
Unable to write to MAT-file
/home/$USER/.matlab/local_cluster_jobs/R2019a/Job225.in.mat.
The file may be corrupt

  • are there possibility that you had some concurrent write issue?
    • Not that i know of unless there is something different about array job. My chmod is PROJ but that is only where i keep the sbatch code and MAT code and array txt file. The data are written to WRK with unique file names for each job. The err and out are written to PROJ but have unique names with -%j. The input data are stored on WRK and each job pulls in unique files from there. The only thing i can think of is that there is an excel file in the working directory that is a single file. It is conceivably possible that all jobs try to read this file at once. Although, they do not write to it. And that reading in of a file (it is an excel file) occurs before the parallel pool is opened in matlab.
    • I didn’t look at all 32 jobs, but all i checked ran on ukko2-pekka. This was in my out-%j.txt from hostname
    • I will now try a single job using this code, without an array
      • Ok.
    • If I run without an array - just a single job - then everything runs fine. Maybe it is requesting -c10 per instance of matlab and with 32 jobs (32 matlabs) you need -c320!??!!?! Any ideas? Otherwise, i can submit this as separate jobs.
      • Well, Array jobs are all separate. For every line in your array.params file the script starts another job. This is why it is called “Array”.
      • I think that somehow there is now some concurrecncy that occurs and causes errors but I cannot quite figure out what that is. Of course you can send the jobs individually, no issue there.
      • I will have to check if the way array jobs are processed has somehow changed from the old verskion of slurm. Certainly I did not expect this kind of behavior.
    • Ok. I will for now run each file individually. Do you want to open a ticket for this? Basically, we should figure out why a parallel pool cannot open in an array job. we just need a batch script that runs a .m file and that file just opens a parallel pool and closes. here is my typical code for this:
    • Batch (need to change m file name):
      #!/bin/bash
      #SBATCH --chdir=/proj/group/grp-$USER-lab/AttentionProject/
      #SBATCH --mail-type=ALL
      #SBATCH --mail-user=nelson.$USER@helsinki.fi
      #SBATCH -M ukko2
      #SBATCH -t 01:30:00
      #SBATCH -p bigmem,short
      #SBATCH -n 1
      #SBATCH -N 1
      #SBATCH -c 10
      #SBATCH --mem-per-cpu=1G
      #SBATCH -o out-%j.txt
      #SBATCH -e err-%j.txt

umask 007
hostname
module purge
module load MATLAB
srun matlab -nodisplay -nodesktop -r “R=4;F=4; run AttentionGranger_cluster(R,F).m; exit(0)”
* Mat file (only needs this, but name of function should be changed and no need for R and F variables anymore):
function AttentionGranger_cluster(R,F)
sz = getenv(‘SLURM_CPUS_ON_NODE’);
sz = str2num(sz);
cluster = parcluster(‘local’);
cluster.NumWorkers = sz;
PP = parpool(cluster,cluster.NumWorkers);
PP.IdleTimeout = 4500;

  • Major Bug IT4SCI-1132 tracks Matlab parpool issue.

  • I managed to reproduce the bug where the node can’t read some input files!!

    • I noticed a job was running abnormally on ukko2-04

    • I ran a test job there, with these essential parts of the submit script:

      #SBATCH --nodelist ukko2-04
      ...
      ls
      cat in.fem
      
    • in.fem would be one of the input files, that my job wasn’t reading

    • The result: in.fem was listed in ls, but stderr said this
      cat: in.fem: No such file or directory

    • So this thing happens at least on ukko2-04, at least sometimes.

      • Could you include full path please?
        • To in.fem?
      • Yes, please.
        • Here: /wrk/users/jyrilaht/CVHD_Cu/CVHD_paper/tip_slices/110_tip_thin/110/tall_top/field_0.02/4/in.fem
      • Cannot say for certain at this time why this would occur. Have to investigate.
        • Great, thank you! I can try avoiding this node in the meantime. Can inform you if I find this on other nodes.
      • One thing, it should not prevent seeing the file, but are you sure that tehre are no concurrent access to the file at the time you tried to open it with cat? (eg. file being written at the time etc).
      • Note that when you open file handle for writing, the file is not necessarily available for other process until you close the file.
        • Good point. I had the job that should read in.fem still running while I tested, but I’ll try again now after cancelling that job.
      • Sometimes when running process, and then handling files that are used by running job from the login node may cause interesting situations. Hence, never for example to tail -f against file that is used by running process, since you gain lock to the file. Same applies to removing files that some process is writing at the same time.
        • Actually, this concurrent reading thing might explain another issue I was encountering
      • /wrk is lustre based system, and it does have a strcict locking schema. Eg. file locks are not advisory but enforced.
        • Yes, I see now that I had not read this part of the mandatory read carefully enough. I suspect most of my issues will resolve when I take this into account.
      • Good!
      • We have the locking in place to protect user data, previous experiences have shown that possibility of uncontrolled concurrent access was way too prone to accidental oops -moments. You can still do many interesting things, but just have to think how to do them properly (eg. simultanous writes to large files over many nodes etc)
        • This wasn’t explicitly mentioned in the Lustre User Guide: is it safe to copy output files to local machine using e.g. rsync, while the job is still running? For long jobs I am used to doing this to monitor that nothing goes wrong. But if it also locks output files, this monitoring is likely to be the exact reason for jobs going wrong in the first place…
      • There are really two questions here. 1) is it really necessary to monitor the porogress during runtime? 2) would you rather have steps in the execution that write you a marker file that hte job has reached point X and this is the current state.
      • If you want I can jump to garage to have a chat.
        • Thanks, but I’m in a shared office and don’t have a headset even, so voice chatting is not an option right now.
      • No trouble. I’d recommend not to touch constantly used files. Instead I would then opt to write temporary “checkpoints” which are opened, written, and closed, and the process would never touch them again once created.
      • For I/O efficiency, I would actually advice to open files for writing at the time you start the job, and then close them once you are done, either completion or error. This is typical practice with Lustre, since open/close operations can be (in their millions) very expenisve metadata operations.
      • However, in this scenario you would could, and should not touch those files / directories during runtime.
        • I see. I don’t have that much control over how the program I use opens and closes files, since it is an extensive piece of software written by many people over many years, elsewhere. It is open source so I could check how the files are handled, but maybe modifying that could still easily break the software… As for writing temporary checkpoint files, that should be possible. I’ll add that to the script in the next jobs I submit. ls is still safe to see them while the job is running, right? :D
      • Yes. Actually, many operations are “safe”. However, the thing here is that you can only write from one place at one time. Now, defining “write” is the hard part. Let’s say that you have a job that creates file with “touch step-1-complete”. You process will not touch that file again for example with rm. All is well.
      • Now, say that you then put “tail -f step-1-complete” for indefinite time, and then elsewhere other process without your direct knowledge (a batch job) does rm “tail -f step-1-complete”. Tail is holding lock on the file. rm tries to remove it. neither can do it, so Lustre has to do only thing it can do. Wait.
      • Now, multiply this process to, say 1000 fold. Each “lock contention” leaves process open on the metadata service, meaning that there are 1000 processess waiting for the contention to resolve. Now, this is an easy situation, if you do actual write like writing output to a file, and then same time try to remove the file, this would create irresolvable conflict.
      • Long way to say that short read operations are safe. ls is never really an issue. cat a file is seldom issue, but holding file open with vi for example will become an issue because the other process cannot write to the file your vi holds locked.
      • Hence, to simplify your own life, it is good idea not to directly touch files you use (you can never be 100% sure at what state the batch job(s) are.). :)
        • Alright, thanks for the very clear explanation! This is way more complicated than I thought. I’ll be much more careful in the future
      • It seems so now, but it will become easier in time. There are tons of possibilities that can be done, because of the mechanism, such as accessing (for writing and reading) very large files from multiple different positions at the same time etc. Will cover this at some point in user guide. :)

20 Sep 2021 Session Questions

  • Kale gpu-short queue became drained due to “kill task failed”. Resolved.

21 Sep 2021 Session Questions

  • Module command does not work on Vorna and ukko2. Eg. lmod path points to wrong version on Centos7.9.
    • Check your ˜/.bashrc, and make sure that you have following set up (use your favorate editor, or vi to edit the file):
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi
  • slurm command does not work on turso03
    • Some commands are no longer available/supported on version 20.11.7. At this moment, slurm is one of those.
  • Interactive session srun hangs unexpectedly.
    • Interactive session mechanism has chnaged. To submit interactive session one has to do:
      srun --interactive -M vorna (other resources) --pty bash
      To use the reserved resources, one has to do:
      srun (command)
      If you do not use --interactive, and try to use srun, the srun will not do anything, but neither will it ever report any error message. Ctrl-c will break this deadlock.
  • Old --workdir option no longer works.
    • This has chnaged: --workdir is now --chroot
  • Interactive sessions (compute nodes) have no man pages.
    • Yes, they do not, and probably will not. Reason is that nodes are stateless. Eg. everything extra eats up the total available memory.

22 Sep 2021 Session Questions

  • I was just using an interactive session that I opened using srun --interactive -M vorna --pty bash. One thing that was not clear if if I can add -c 4 and -mem 1G to set up resources. That gave an error. Also, when I ran the above srun line, after about 5 to 10 min i got this error: srun: Force Terminated job 69352853 slurmstepd: error: *** STEP 69352853.interactive ON vorna-502 CANCELLED AT 2021-09-21T13:53:39 DUE TO TIME LIMIT *** srun: error: vorna-502: task 0: Killed srun: launch/slurm: _step_signal: Terminating StepId=69352853.interactive srun: Force Terminated StepId=69352853.interactive
    • Not sure I got what you meant, but like this?
turso: srun --interactive -M vorna --mem=1G -c4 --pty bash
srun: job 69354719 queued and waiting for resources
srun: job 69354719 has been allocated resources
vorna-513:~# srun -c4 hostname
vorna-513
  • thanks. I ran that before with a space after the -c and maybe that was the problem. the 2nd issue was that if I just put srun --interactive -M vorna --pty bash, then I got a timeout error printed and kicked off which was unexpected

    • Ah, srun still needs the reservation arguments. It is just that inside that session you will need to use srun again to utilize the resources.
    • srun --interactive -M vorna --mem=1G -c4 -n4 -t1-0 -pshort --pty bash
    • srun -c4 -n4 --mem=1G hostname
  • OK, so here is what we discussed. the old interactive cpu 1 1 was a wrapper that doesn’t work anymore. Therefore, we need to do the following: We first need to request some resources using srun --interactive (resources…for example -M vorna --mem=1G -c4 -t1-0 -p short). Then the second step is to claim those requested resources with srun -c4 --mem=1G (note that you cannot claim time or p…those things are already requested and aren’t something you need to “claim” from the hardware)

  • It is also helpful to know that fixing the module issue above people should log in and then do vi . bashrc and then insert those lines, save, log out and log back in.

    • Added note above.
  • Finally, there was an open issue about whether Jupyter could load a 35GB variable (or perhaps let’s say 50GB for safety) when you use MATLAB there. It is very helpful to be able to process large files on the turso, but then those large files need to be loaded in Jupyter for plotting

    • I will transfer this to a bug report.
  • When trying to run a Jupyter notebook, I get the following error:
    Error: HTTP 500: Internal Server Error (Error in Authenticator.pre_spawn_start: RuntimeError sbatch: error: invalid partition specified: jupyter sbatch: error: Problem with submit to cluster ukko2: Invalid partition name specified sbatch: error: Problem with submit to cluster vorna: Requested node configuration is not available sbatch: error: Can't run on any of the specified clusters sbatch: error: There is a problem talking to the database: Requested node configuration is not available. Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.)

    • This error originates from the fact that vorna has been removed from the federation where JupyterHub gets the resources from. I will fix the issue today (22.09.2021) by moving GPU resources and point Hub configuration to the new side. Thanks for the note!
      • OK. Thanks. It turns out that the (new!) uni computer from Dustin is according to helpdesk “broken” so it would be great if my students could load that large file into JupyterHub for plotting, since I don’t have any spare local PCs. One note from our end (specific to us obviously) is that we just need to be able to plot, so the GPUs isn’t necessary, just lots of memory ;-) If you could drop me a note over email to let me know when we should try again, that would help because I won’t always have an eye here. And really thansk for all of this!
    • Critical Bug #IT4SCI-1118 tracks this feature request. I’ll drop a line as soon as ready.
  • In case others encounter the same issue: jobs submitted in Turso via the old system (login nodes turso01 and turso02) were not visible to me with squeue on the default Turso login node, which is now turso03. Issue was fixed by specifying the login node when ssh’ing, as in ssh username@turso01.cs.helsinki.fi

    • Thanks for the pointer! I will add notice to old login node MOTD’s about this.
    • MOTD updated.
      • Great! I just realized I’m not seeing MOTD when I ssh onto the new side…? Old side MOTD is fine
      • Also, what will happen after 23th Sep to the jobs submitted on the old side? After the system “will have no resources”? The jobs I have there are running for almost 2 week still… I see this was answered in the update section!
  • small related question…is there a reason a user should choose turso01 versus 02, versus 03 when logging in? Maybe there are resource differences, but it is a bit opaque to people like me who are perhaps novice users

    • By default login to turso. Do not specify number unless you know that you have something on the old side. We will shut down the rest of the resources on the old side very soon.
      • OK!!
  • I just tried to log into turso03 and got this message. Sorry, but it is too much out of my expertise to know what to do! (base) lm0-970-22770:~ $USER$ ssh turso03.cs.helsinki.fi

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:1ajfpj2SuPu1NieK2DvPHHh63EzLW8jPTxRcTNoUixU.
Please contact your system administrator.
Add correct host key in /Users/$USER/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/$USER/.ssh/known_hosts:5
ECDSA host key for turso03.cs.helsinki.fi has changed and you have requested strict checking.
Host key verification failed.
  • This is a bit annoying. It should not happen if you log into turso (without number, it redurects to turso03). However, you’d have to take out the turso03 related line from ~/.ssh/known_hosts file.

  • Eg. vi ~/.ssh/known_hosts, move to the appropriate line and remove the line (on vi, dd does remove whole line).

    • I also had a similar issue. Every time I logged into Turso, ssh would complain about an offending key. This helped! To clarify to the op, Offending ECDSA key in /Users/$USER/.ssh/known_hosts:5 means that you have to remove line 5 in .ssh/known_hosts
    • I will follow your advice turso
  • I also have the following issue: occasionally, the node where a job lands does not see/have access to all input files that it should. I’m trying reproduce consistently, just a sec

    • Note that GPU nodes (ukko2-g01 and ukko2-g02) are now being moved to the new side. You can access them from turso.
    • You cannot do this at this time. I will update this asap as the nodes boot up and are ready to be used on the new side.
      • Alright, great! The last time this happened when submitting from the old side, where test queue jobs apparently sometimes land on GPU nodes. Although – ukko2-g01 and ukko2-g02 are listed in sinfo on the test queue on the new side, too…
    • Just a moment now…
    • GPU nodes just booted to the new side.
      • Can’t reproduce as of now. I recall the last time this happened was on a GPU node, but now they seem to read everything fine. I’ll update if I encounter this again.
    • This may have been issue on the old side. Will keep eye on the issue if it reappears.

23 Sep 2021 Session Questions

  • thanks for the jupyter notebook with larger memory. it works great for loading the variable from PROJ or WRK. One issue is that plotting in MATLAB kernel doesn’t work. In the past, on my local machine, I used this MATLAB kernal (https://am111.readthedocs.io/en/latest/jmatlab_install.html) and followed their instructions and was able to plot without any issues, but in the kernel installed on this Jupyer I was not able to do any plotting, which is mainly why we are using it. Maybe try this other kernel? I didn’t see any guidance on the user guide about how to set this up.

    • Unfortunately (unless there is other Matlab/Jupyter expertise here to answer) I do not have answer to this one at this time. I think the Jupyter instance has some pre-installed Matlab kernels, if you browse around.
  • i will keep trying, but at least the one i link to above should work.I’m also not an expert in hupyter

  • I just saw that vi Batch_xxxx on PROJ opens a blank page. I also got some argument about a swap file and I recovered and then deleted the swap. Any idea what hapepend there? - Please ignore this!!! I figured it out. I had forgotten about a dettached screen that had it open. Oops.

  • bug report: (unless i have something wrong…but the manual says this didnt change #SBATCH --mail-type=ALL
    #SBATCH --mail-user=xx.xx@helsinki.fi) I think the email function is not working with sbatch. at least i haven’t received any for start, complete, or failure

    • I have the same issue. I thought my jobs hadn’t started yet, but thanks to this I went and checked :D Thanks!

23 Sep 2021 Session Questions

  • bug report: I think seff isn’t working. HEre is what I get:
seff -M ukko2 136451454
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = (unset),
	LC_ALL = (unset),
	LC_CTYPE = "UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
Can't locate Slurmdb.pm in @INC (you may need to install the Slurmdb module) (@INC contains: /usr/lib64/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/share/perl5) at /usr/bin/seff line 12.
BEGIN failed--compilation aborted at /usr/bin/seff line 12.
  • Thanks for the report will file a bug shortly.
    • Bug IT4SCI-1124 tracks this issue.
    • Bug IT4SCI-1124 has been resolved (240921T08:34)

24 Sep 2021 Session Questions

  • This is Nelson. I need to ask a quick question today at the garage, but I have a meeting 9 to 10:30. I think i can leave that one a bit early. could i come in right at the end of garage? 10:10? My question is quick! Just a bit of planning about resources and best place for my job
    • Sure, no trouble. We’ll probably hang there beyond 10:30 anyway.

Always write new questions to the bottom of the page (which means above this line :).

You can write questions here anytime before the sessions to be answered during the Zoom session. If you are not typing, please leave HackMD in view mode.

2 Sep 2021 Session Questions

Comment

  • Kale GPU nodes down (2nd September) I just tried to get any GPU in both the gpu and gpu-short que, but I can not get any. I suspect that it may be a problem with the gpu nodes being down.
    • There has been notice on the motd, and wall message. GPU nodes have returned back to service
    • Rest of the compute nodes are waiting for a reboot. Reason for this is a severe issue with glibc. We will reboot nodes and return them to service as soon as previous workload has finished. Meanwhile nodes do not accept new jobs.

1 Sep 2021 Session Questions

Comment

  • Recently the Vlasiator code I’ve been running has reported some huge memory imbalance between nodes on Vorna. For example, in a small 2 nodes test
(MAIN) Starting simulation with 32 MPI processes and 1 OpenMP threads per process
(MEM) Resident per node (avg, min, max): 1.10637 0.140255 2.07249
(MEM) High water mark per node (GiB) avg: 1.10637 min: 0.140255 max: 2.07249 sum (TiB): 0.00216088 on 2 nodes
---------- tstep = 0 t = 0 dt = 0.0361007 FS cycles = 1 ----------

where we observed 0.14 GB memory usage on 1 node and 2.07 GB on the other, 15x difference. After searching for the causes of this, we found the imbalance is somehow related to the slurm job scheduler. In the imbalanced run, I specify nodes like this:

#SBATCH -N 2               # Total # of nodes
#SBATCH -n 32              # Total # of mpi tasks

while in another test we did

#SBATCH --nodes 2
#SBATCH -c 1
#SBATCH --tasks 32
#SBATCH --tasks-per-node 16

In the latter case, the code reported almost perfect memory usage across nodes:

(MAIN) Starting simulation with 32 MPI processes and 1 OpenMP threads per process
(MEM) Resident per node (avg, min, max): 1.16969 1.16612 1.17325
(MEM) High water mark per node (GiB) avg: 1.172 min: 1.167 max: 1.17701 sum (TiB): 0.00228906 on 2 nodes
---------- tstep = 0 t = 0 dt = 0.0199046 FS cycles = 1 ----------

These test results may indicate that in the 1st case, it is not guaranteed that the MPI tasks will spread evenly across nodes. If that is true, it would be very surprising to us since Vorna has 16 cores/node.

  • –mem= is not same as --mem-per-cpu=

  • Remember that in above, you have to think of your total memory requirement and then divide it with number of cpu’s to get the correct value.

  • Another --threads-per-core=2, Options -n (ntasks) and -c (cores) are not same. Eg. If you intend to ask 2 threads per core, -n1 -c2 is not same as -n1 --threads-per-core=2.

  • Former being two cores per task, while second would be two threads (HT) per core.

  • Updates to the previous issue: specifying --mem-per-cpu helps a little, but it does not solve the problem. The key option required for balancing I found is --ntasks-per-node: without it, there are always obvious imbalances in memory usage.

    • Ah, of course. I actually missed this entirely. Yes, if not specified, then slurm packs the nodes more or less by optimizing “free capacity”. Very good catch!
    • I actually read above: #SBATCH –tasks-per-node 16 as #SBATCH –ntasks-per-node 16 and therefore thuought that being already covered.


-- a period of time events were not recorded here --

24 May 2021 Session Questions

Comment

  • I found one failed node from my tests during the weekend and another one in another job script from my collegue. To run with 32 nodes for my job, I added #SBATCH -x vorna-436,vorna-339 to the job script.

    • Thanks, I will add those nodes to suspect list now:
    • suscpect-MPI root 2021-05-24T10:41:30 vorna-[339,436]
  • For the job issue that can be run on login node and interactive mode but not in a batch job script. I did some extra tests. Interestingly enough, on another directory which is not touched by a currently running job, the script can be run successfully. So I think your assumption is correct: for some unknown reasons, slurm job stops us from doing IO under the same working directory. This is new to me because for some other clusters/supercomputers I have used that’s equipped with either slurm or PBS, there is no such restriction. It would be very interesting to learn what mechanism prevents us from doing so on Vorna!

    • There are several different locking schemas for Lustre. We have elected (simply due to need to protect data integrity) to use flock. In many installations elect to use localflock, or no flock when users know exactly how to handle concurrent file access, or where small I/O operation performance needs to be maxed out.
    • flock does have a drawback then, and that is need to consider, and understand lustre locking somewhat more in depth. Basically lock management is handled by VAX/WMS kind of distributed lock manager, which maintains global lock coherency. Nothing however prevents you to use proper functions flock() and fcntl() to lock files & directories in desired manner.
    • We did try localflock in the past, and that turned out to be a grave mistake, because we, somewhat wrongly, assumed that users can handle concurrent operations and design things accordigly without one overseeing agent. This lead to rather severe FS corruption.
    • Additional differece is that unlike majority, we have ZFS backend on both MDT:s and OST:s instead of ldiskfs. This may cause some peculiarities, but should not be related to locking.
    • I also have to point out that majority of supercomputers, especially large installations have node, or socket-based scheduling, while for us, smallest unit is a core. Naturally this changes the in-node behaviour, and also the mechanism how ldlm manages locks.
    • Conclusion: You should probably have a look of the working directory structures and manage the I/O as much job-id based as possible. You can for example create temporary working drectories against jobid:s etc. Some more details.
  • Has anyone used ParaView on Vorna before? I have succeeded in installing and using the headless version (rendering without display), but I am having some issues with the server/client setup with X window.

    • Sounds familiar but not sure if commonly used. It could benefit from VDI, but X is probably too bad. Wondering if that could run with JupyterHub. We have VDI machines that use Kappa for scratch storage for fast IO access.
    • In case you want to experiment, we have JupyterHub availble Hub User Guide. Hub uses Vorna and Ukko2 GPU compute nodes, and we can add more nodes in the pool as needed. Would this work?

7 Apr 2021 Session Questions

Comment

  • I’m doing DTI tractography with a script that worked a while (6 months) ago, but now the outputfiles aren’t saved correctly. It takes a lot of time (too much; my last time limit was 2 days and the script used to run in couple of hours) to write the outputfiles and eventually the job gets time cancelled. Other than that the script seems to run smoothly and there are no errors. The script writes an outputfile but it is too small to be correct. I have tried to increase the memory limit and the time limit, to make sure that there is enough space for the files in my wrkdir and that I’m using the right version of matlab (v7.3), but nothing seems to help. I’m using ExploreDTI (Matlab based program).
    • Some questions, which cluster(s) you are using? Where are you writing to (assuming $WRKDIR)?
    • I am using Ukko2 and yes, I am writing to $WRKDIR
    • Is your script identical to what they were 6 months or so ago?
    • Yes
    • Would you like to join the Zoom session?
    • Yes, just a minute
    • Probable cause is the Infiniband agent issue where wrong configuration file stalled the interconnect network, and was visible to users as very slow I/O.
  • Ongelma näyttäisi jatkuvan, skripti on ajanut nyt 5 tuntia, eikä loppua näy. Outputfile näyttää päivittyvän koko ajan (tiedoston muokkaus aika muuttuu), mutta tiedoston koko ei kasva.
    • Okei, ongelma on nyt siis jossain muualla. Nopeasti testattuna kaiken muun käytön ohella, kirjoitus menee perille n. 8.4GB/s ja luku n. 10GB/s 28GB tiedostoille. Globaalista ongelmasta siis nyt ei liene kyse. Jos vilkaiset output -filettä (cat filenimi) miltä se näyttää?
    • Onhan jokaisella prosessilla oma output -tiedosto, niin etteivät prosessit vahingossa koeta kirjoittaa samaan tiedostoon samaan aikaan?
  • Kyllä on, päivitin ExploreDTI koodi kansion ja kokeilen uudelleen. Outoa on, että ei tule mitään virheilmoituksia, jotka voisi auttaa tietämään mistä on kyse…
    • Oletko kokeillut interaktiivisessa sessiossa? batch -scriptissä #SBATCH -e virhetiedoston-nimi asettaa STDERR:in tiedostoon, jos virheitä tulee.
    • Jos tiedostoon kirjoittaminen ei onnistu, tai jos prosessi ei voi kirjoittaa tiedostoon siitä ei välttämättä tule virheilmoitusta esim silloin jos ko. tiedosto on ollut lukossa yhden prosessin toimesta, jolloin toinen jää odottamaan lukon vapautumista. Tätä ei tosin tapahdu, jos samaa tiedostoa ei käsitellä samanaikaisesti useasta eri paikasta.
    • Interaktiivisessa sessiossa olisi mahdollista ehkä nähdä paremmin suoraan se mitä työ on tekemässä minäkin ajanhetkenä debug -mielessä.
  • Kokeilen interaktiivista sessiota seuraavaksi.
    • Yksi lisäkysymys, millaisia moduuleja käytät? Eg. module list
  • Käytän matlabia. Nyt näyttää vähän paremmalta! Todennäkösesti ongelma johtuu juuri tuosta, että tiedosto on lukossa; minulla on ollut jostain syystä job-tiedostossa komento: srun … run(matlab_skripti.m). Epäilen että tässä on ongelma. Tuo job tiedosto on kuitenkin silloin aiemmin toiminut, mikä on hieman outoa.
  • Ei tämä ratkaissutkaan ongelmaa. Saan uudella yrityksellä samanlaisen ilmoituksen : ‘Job step creation temporarily disabled, retrying̈́’
    • Tuo ilmoitus tulee jos srunilla koettaa vahingossa käynnistää useita job-steppejä päällekkäin jolloin ensimmäinen suoriutuu, mutta muut jäävät jumiin… Esimerkiksi näin: srun srun <prosessi>
  • Joo näin ymmärsin ja siksi luulin, että ratkaisin ongelman. Saan saman ilmoituksen, vaikka (mielestäni) komennossa (srun matlab -nodesktop -nosplash -r skripti.m) ei pitäisi olla mitään vikaa.
    • Saisinko nähdä koko batch -skriptin - ihan siltä varalta että jonnekin on eksynyt jokin typo tms?
  • #!/bin/bash
    #SBATCH -M ukko2
    #SBATCH --job-name=csd5
    #SBATCH -c 24
    #SBATCH -t 1-0
    #SBATCH --mem=60G
    #SBATCH -p short
    #SBATCH --workdir=/wrk/users/elisasah/work
    #SBATCH -o result1.txt
    #SBATCH -e error1.txt

module purge
module load MATLAB
srun hostname
srun matlab -nodesktop -nosplash -r “run(’/proj/elisasah/do_CSD_run1.m’);”
* Kiitos, sellainen asia että meillä on tursolla rikkinäinen IB fabric, ja korjaamme sitä juuri.

  • Tää ongelma voisi johtua siitä?
    • Hyvin paljon mahdollista, näyttää siltä että meillä on useampi käyttäjä joilla IB verkon yli menevä liikenne stallaa. Palaan asiaan asap.
    • tuosta skriptistä, kannattaa varmaan jättää tuo srun hostname pois tai laittaa siihen pelkkä hostname jos haluat tietää millä laskentasolmulla työ suoriutui.
  • Nyt analyysit näyttää menevän läpi ongelmitta! Kiitos avusta!
    • Kiitoksia kärsivällisyydestä, erinomainen uutinen, eli tuolla oli transientti linkkiongelma joka löytyi eilen.

30 Mar 2021 Session Questions

Comment

  • It seems that since the default user rights changed on Vorna, the module list command does not work. Module load as well. It returns with command not found
    • Thanks for the bug report. We have filed critical Bug# IT4SCI-861 for this issue and will resolve asap.
    • Issue has been fixed. Log out and back in and module command works as it should.

29 Mar 2021 Announcements

Comment

  • iWe have received number of questions starting with: “Could you please install (insert your favorite sw packages here)”. Short answer is no, we cannot install every conveible sw package as module, this would make it all but impossible to find anything, not to mention that nobody would be able to keep track what is obsolete and what is not. That said, creating modules is really easy. I mean, really. You can also share them with your colleagues and keep track of versions and dependencies.

  • We have changed the default umask of user directories to reduce the risk of abuse. Default umask for user is now 700. See a user guide notice for details. *This may have affected limited number of user processess if they have used other user’s directories.

  • Updated user guide: Not all module repositories are set as system defaults. You can control the available repositories, see HPC Environment User. SoftwareModules for details.

26 Mar 2021 Session Questions

Comment

  • When using module load command on login node, things work, however same set of modules (for example module restore) does not work on compute nodes. Why is that?
    • We have modules in two repositories, local which always work and cvmfs which has some issues. Unfortunately some modules have default flag on the cvmfs repo. Major Bug# IT4SCI-834 has been filed against this issue. Most likely reason is that the target node has failing automount, or cvmfs repository does not respond.
    • You could drop the cvmfs modules from your environment with: module unuse /cvmfs/fgi.csc.fi/modules/el7/all

30 Mar 2021 Session Questions

Comment

  • It seems that since the default user rights changed on Vorna, the module list command does not work. Module load as well. It returns with command not found
    • Thanks for the bug report. We have filed critical Bug# IT4SCI-861 for this issue and will resolve asap.
    • Issue has been fixed. Log out and back in and module command works as it should.

17 Mar 2021 Session Questions


* Some Vorna nodes report that cannot chidr to /wrk
* This is due to some mounts being dropped. We have fixed a bug related to the node health checking, and now it should catch missing mounts and automatically drain node should this happen preventing user jobs landing on nodes that are not healthy enough to have them.

* Singularity fails immediately due to error about quota exceeded.
* This is because /home quota is very small. Tehrefore you should redirect the .cache, .singularity and .matlab to $WRKDIR. This can be done with symbolic link, or parameter that points to a correct cache location if allowed by the software.

* I am trying to run a tensorflow script. I loaded Python/3.6.6-intel-2018b cuDNN, created a virtual environment and installed tensorflow and tensorflow-gpu on it. Then I have a batch script where I use: module purge, module load Python cuDNN, the source command to activate the virtual environment, and the code to run the .py script. However I get this error when running the batch script: /proj/$USER/myTensorflow/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory srun: error: ukko2-g01: task 0: Exited with exit code 127
* This error is due to using different Python versions when creating virtual env and when running the batch job.
* FAQ stated wrongly a specific version on the section where virtual environment was created and then used defaults for actual launch. This entry has been corrected.


11 Mar 2021 Session Questions

*  Can one use Zarr on Lustre? 

    *  Yes, you can, this this is like FS on a file. This causes no issues at all. You may get some performance difference by adjusting the stripe count, lfs getstripe -c filename shows current striping. See [Lustre user guide](https://wiki.helsinki.fi/display/it4sci/Lustre+User+Guide) for examples how to alter stripe settings.

    *  You can also use HDF5, as mentioned, albeit the performance in this particular case may not be up to par.

* Login has wrong Garage times

    * Thanks! Corrected on every login node.

* rclone no longer works on Turso 

    * This appears to be issue with proxy. Without proxy settings, rclone works, with them it does not. Proxy settings were wrong, HTTPS should not be in the beginning of the environment variable.

* Memory and cpu efficiency question

    * You can use --mem=size to define memory correcrly for a single node. You can define memory usage by cpu, but this is practicable only when you are running jobs where multiple nodes need to guarantee same memory availability for all processess.

    * ONCOSYS -partition has oversubscrioption on, so you can ask more cpu's. You can experiment to find amount of CPU's that provide best performance.

* tmux attach fails

    * Tmux does not like Lustre /tmp dir. If this is changed to other /tmp location, it works. One should not create sockets to Lustre.

    * We need to fix this with global variable/wrapper:

    * alias tmux="tmux -S /run/user/$UID/tmux-socket"

10 mar 2021 Session Questions

* Hello, I can't get that zoom link working

    * Thanks, corrected. Link above points to an documentation, where you find details of the Garage and also current link to the meeting. This documentation is persistent and will remain so, even if there are meeting changes in future.

18 Aug 2020 Session Questions

Comment

  • Is there a multiprocessing zip program in ukko2?
    • Yes, there is a pbzip2 available.
  • is it Ok to run tar/gzip, etc in the login node?
    • Yes, however, you can easily start an interactive session and run them on the compute node. This is much faster.
  • another small issue I am having with my jobs is that I usually get a segfault error in the end, but I know that my python code runs smotly until the end. and returns.
    • This sounds like localflock related issue if the executable is on the WRKDIR. This issue will be corrected on Aug 26th maintenance break.
  • is there one way to monitor memory usage of current jobs?
    • Yes, you can use seff after running a job. This will give you the high watermark, which allows you to adjust memory limits accordingly.
  • which profilers are avaiable for python?
    • Guppy Does well, and has a graphicsl view to the actual memory usage during the course of the program.
  • where is the maintainance message? at the login message?
    • We send them out by e-mail. We’ll add them to the login messgaes as well. Thanks for the input.
  • I tried to use ssh_authorized keys to simplify my login in pangolin from home but I couldn’t. Is it blocked?
I can do it between pangolin and ukko2.
    • do ssh-agent bash, then ssh-add to add your key and then ssh to pangolin. (assuming your public key is in place…)
  • do I have to use ssh-agent;ssh-add everytime before ssh?
    • No, if you start ssh-agent during login (see bash profile for example).


* Hi, thanks for previous answers about IB. A new question is what's the max mem for a single node one can require. According to ukko2 instruction page, seems the max capacity of a node would be 3TB. For some reason I need to test a very large learning space which can not be distributed, is there any node/way I can require larger than 3TB mem for a single node? Thanks
* Largest memory we have available is about 3TB. There are two such nodes on ukko2. Would it be possible to reduce the size to accommodate it to a smaller memory space?
* Your options are (*apart from actually aquiring hardware that has more memory*):
* Try to reduce the size of the data set and the memory requrements. Also use profiling tools to find out if there is actually requirement for such amount of memory, or if you have memory leak instead. If the memory usage is real, and there are no leaks:
* Try to find a mechanism that allows you to parallelize (*MPI*) the workload over multiple nodes.
* Last, we'd have to think of how to find funding for extremely large memory node.
* 3TB of memory is very large capacity, and going beyond this capacity for a single node would soon become prohibitively exppensive (*at least if there is limited usage*)
* Thanks for the quick reply. Ya it sounds a bit stupid but it's more an anti-test, so I'm trying to find out a lowerbound of a single-agent learning performance, like how slow it can be... I'll try to find if there's a mem leak or compress the space a bit. :+1:
* Sounds fair, and pretty much only option at this time. Unfortunately the cost of "unlimited memory" might be a tad too much, unless your project would find a deep pockets. :)
* Haha, I'll try :)
* Administrative question: I'd like to reboot controller(s). Is there some certain way to do this? Or can I just boot one of them anytime, as I have 2 controllers?
* Controllers of what exactly? If we assume that there is a failover present, then one by one. However, this depends on the appliance/system.
* Let's assume this is some LSI/NetApp appliance, with two controllers and inbuild failover, then you can update and reboot one of them at the time without an interrupt (*at least in theory*). Practical implications may differ.


04 Aug 2020 Session Questions
* Hi, thanks for the answers for IB and module creation, could you kindly give a link where talks about using TCP over IB especially in this kinda slurm env? Thanks.
* Each compute node has (*in case of ukko2, there are interfaces named bond0, and for example ib0*). Latter interface and associated addressess are Infiniband.
* Since we do have tcp/ip-over-ib ebabled on a compute nodes by default, you should be able to direct your program to use the infiniband interfaces, instead of ethernet. Specifics depend somewhat of the program you use, and how the workers are connecting the master node (*eg. how do they determine the addressess of the hosts*)
* Presumably the program has some mechanism to determine which address to use for communications (*for example Spark determines the address by issuing ifconfig -a and then reversing the order and taking the first one of the list, presumably to avoid taking a loopback address - which makes a life interesting if the real world interface list differs from the anticipated*).
* Key here is to find out which mechanism the program uses to obtain the node addressess.
* How to use Python virtual environments on the clusters?
* [Comprehensive examples of the virtual enviroment usage can be found at the wiki](FAQ & Scientific Software Use Cases#1.0PythonVirtualENV)
* Note that because of the small quota on $HOME, users are required to redirect cache to other location, such as $WRKDIR (*--cache-dir=cache directory*).

* You recently added samba-sharing for the work directories. I would like to mount it e.g. on melkinpaasi. How do you suggest I do this?
* Umm, so /home/ad/ukko2-wrk/users/$USER was not working earlier today (I have not been able to use it for a while), but it seems to be working again. Nvm.
* Yes, it is working now, because I just started it up again. I did not know it was down before you asked the question about it (it seems that there was an issue with the server sharing the FS and it was rebooted but the services failed to start). Thanks!
* You can also access the directories from your desktop (assuming within CS network) with Samba.

21 Jul 2020 Session Questions
Comment

  • Could we add more detailed instructions for Creating modules for your own software? For instance, using a specific program such as SUMO as an example to showcase how to create the modules and share with public step by step, would be greatly helpful.
    • We have an easy example of how to create your own modules in the Ukko2 User Guide which should provide an easy example with binary that has no dependency trees. Additionally, and especially if dependencies are needed, you can always take a look at the .lua files of the existing modules, such as gcc for example.
    • Modules only set up appropriate paths and environment variables to the correct binaries and libraries. Hence the system is not overly complicated, but really worth learning the basics of.
    • You can easily share modules within a group, if you store the modules on the /proj/group/ directory, or even for wider audience if so desired. Our brief example shows how this is done.
  • As for the inter-node redis communications, I’ve got the answer from the community, it really uses TCP and currently doesn’t support infiniband.
    • Ethernet/tcp communications between the compute nodes are inefficient at best, and not functional at worst, since they are used only for certain NFS traffic, node operating system provisioning and other administrative tasks. You could use tcp over IB though.
  • Is this a good place to post user bug reports/PSAs like the following (possibly spurious) node error reports:
    • You can report issues here as well as through helpdesk. Albeit there is no immediate answer here, it is frequently followed (holidays excepted).
      Quick note about something not working right may be easier to write here. Additional benefit is that other users who combat same issues may well benefit the input. This is especially true if it is not certain that the error, or issue is caused by actual fault on the system.
  • Encountered error with vorna-436 (and vorna-435) from srun, fixed with #SBATCH -x vorna-435,vorna-436:
srun: error: Task launch for 3097413.1 failed on node vorna-436: Invalid node name specified
srun: error: Application launch failed: Invalid node name specified
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3097413.1 ON vorna-435 CANCELLED AT 2020-07-01T15:00:48 ***
srun: error: vorna-435: tasks 0-3: Killed
srun: Terminating job step 3097413.1
  • vorna-520 bugging on 2020-07-06, terminates local process with 127, other nodes terminate as well but jobs remains in queue
    • These two reported events are related to an issue which was resolved on 20th Jul. We discovered an ill mount point and dysfunctional sssd which caused cascading errors over longer time accross Ukko2 and Vorna. Eventually both Vorna and Ukko2 login nodes practically hung. Unfortunately vacation times caused an delay on the responsiveness around the issue.

23 Jun Session Questions

  • Shared-memory backing issue using mpirun, job hanging with following slurm.out. User solution is to resubmit or use srun.
    We’ll talk about this in the session because it is quite common and important case.
    --------------------------------------------------------------------------
    It appears as if there is not enough space for /wrk/users/mjalho/openmpi-sessions-1049308@vorna-513_0/20816/1/6/vader_segment.vorna-513.2 (the shared-memory backing
    file). It is likely that your MPI job will now either abort or experience
    performance degradation.
Local host: vorna-513
Space Requested: 4194312 B
Space Available: 0 B
--------------------------------------------------------------------------
[vorna-513:29632] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[vorna-512:29333] 10 more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs
[vorna-512:29333] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[vorna-512:29333] 4 more processes have sent help message help-opal-shmem-mmap.txt / target full

This is a common case, related to how MPI jobs are launched on the system. You should really use srun because it is far more efficient way to launc tasks than doing it in a hard way with mpirun. That said, mpirun does work, if it is necessary to use it.
MPI Implementation Considerations

OpenMPI 2.x does not require specific options for Infiniband, since it is defined as default.
However, other MPI implementations do require specific settings.

Intel MPI requires user to point out explicitly to the PMI library
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so

More details of Slurm + impi:
https://software.intel.com/en-us/articles/how-to-use-slurm-pmi-with-the-intel-mpi-library-for-linux

OpenMPI 3.x requires user to explicitly set pmix to be set for srun:
–mpi=pmix

  • I’m trying to deploy a distributed machine learning algorithm on ukko2, yet couldn’t let node access to the Redis in another node, would it be the firewall that block this kind of communication?

Are you deploying it in a Docker container like the documentation at https://redislabs.com/blog/introduction-redis-ml/ suggests? Containers are not yet fully supported on the cluster.
Nope. I’m using this https://docs.ray.io/en/master/deploying-on-slurm.html. I think it just inits tasks on different nodes and requires to access the redis in the nodes.
Note that if your communications use ethernet (tcp) you are bound to have issues. You have to use Infiniband (RDMA). (you wouldn’t join the Zoom meeting? The topic is interesting)
ok, I’ll join :)
It is highly likely that the cause is that the Redis is using ethernet/tcp interface instead of the Infiniband, hece failing to communicate between the nodes. Interactive session can be used to debug the master-slave layout, even down to a level of specifying specific nodes you wish to use with -w nodename option.

  • Interactive session launch on ukko2 warns about drained nodes upon allocation. Something to be worried about?

First check if there are any maintenance breaks going to take place on the cluster. These will be normally announced in the cluster message of the day (MOTD) showed upon login. If there is a break coming up, the allocation system will not let you reserve any resources that extend to the time period of the break. You can reserve resources that terminate before the start of the break.
Maintenance break for Ukko2, and Vorna is from 08:00 till 17:00.

  • Can you mount your $WRKDIR from kale to your laptop?

At the moment you cannot. However you can mount $PROJ from within CS. $PROJ are also mounted to pangolin, melkki and melkinkari.

  • gpu-short queue does not respect the parameter -pgpu-short. Job lands on gpu queue instead.

There is an issue with the partition configuration, first of all, gpu-short should have one GPU that is not a member of the gpu queue.
Reason for this error is that the script given had option -p twice. Other specified -p gpu-short and other -p gpu. This means that slurm will place the job to the queue that is first on the list. of queues. In this case gpu.
-p option can take multiple queue names as an argument. If used, then queue eligibility is checked in the order of the arguments.

  • There should be an new module installed but I cannot see it on the module avail list. How can I see it?

You can use module --ignore_cache option to ignore your current cache and reread the module repository instead of getting your cached repository.
This was the last session before the vacations. Thank you for the attendance and will see you again during the next session on 21st Jul.

16 Jun Session Questions:

  • This is regarding the usage of a particular software, GROMACS on Ukko2, Kale and Vorna. The default module installed on these machines appears to be 2018.3-fosscuda-2018b installed at the location /app/modulefiles/all. Other versions are also available and are installed at location /cvmfs/fgi.csc.fi/modules/el7/all. One such version installed at this location is 2016.4 which I would like to use. However when I try to load the module when submitting my job (module load GROMACS/2016.4), it appears to randomly work sometimes and fails to load in other instances with the following error: Lmod has detected the following error. The following modules are unknown: “GROMACS/2016.4”
Please check the spelling or version number. Also try “module spider …”
slurmstepd: error: execve(): gmx_mpi: No such file or directory

Could you please let me know about the reason for this error and how to correct it? Thanks


We know the reason for this behaviour now. This is due to some nodes not being able to mount the cfmfs repositories. The nodes so affected are out of the system now. In case you witness this kind of behaviour, please do report the issues at it4sci<at> helsinki.fi and we can react to them immediately. It also helps if we do have information on the affected node.

  • mpirun reports that “network drive may affect performance”. Is this something to worry about?
  • I’ve started getting segfaults again in Ukko2 and Kale. I’ve used the same code last week without any issues. Is this a problem with the cluster?
Does this apply to python scripts too?

This may happen with certain binaries on WRKDIR that is based on Lustre. In this case, we do recommend that you use PROJ instead.
Yes this applies also to the Python scripts that are sensitive to the caching and striping of Lustre.
This said, please do keep only your executables on the PROJ and leave datasets and actual read-write operations on a Lustre.

  • Is there available instructions how to install Anaconda to my home directory on Windows
ukko2

Yes you can use Anaconda on the clusters. You can find step-to-step instructions for this on the Ukko2 user guide

  • I was having problems using rsync to send files to ukko via pangolin home/ad/ukko2-wrk/ folder link (various folder permission problems). It seems the alternative way via ssh jump pangolin->ukko2 is working ok.at the moment.

Moving files from your home computer (Linux) to ukko2 outside the university network could be done like this (“the alternative way”):

~$ rsync -av --progress -e “ssh -A $user@pangolin.it.helsinki.fi ssh” /my/directory/location $user@ukko2.cs.helsinki.fi:/wrk/users/user/destination

In Windows, use WinSCP On the launch screen click “Advanced” and setup an SSH tunnel there via e.g. pangolin.it.helsinki.fi then login to ukko2.cs.helsinki.fi on WinSCP launch screen. When the connection is established, you can drag and drop your files between your computer and ukko2.

  • icc just halts (last week)

The Intel C compiler license had (elusively) expired on the cluster. This has been renewed now and should be working now.

Session Feedback

  • One thing that could be improved concerns building singualrity containers. if I am not mistaken one needs to run it as root.
there is one way to run as ‘fakeroot’, but it is not activated in ukko2.

  • No labels