Error waiting on exit barrier

Hello

I am using distributed pytorch. The environment is a singularity container, with nccl 2.9.9 . It’s inside nodes with infiniband at HPC with slurm. Normally executing 2 nodes 1 gpu or 2 nodes 4 gpu’s. The code is github Yolov6.

torch 1.12
torchvision 0.13

I init the group like this:
dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", rank=args.rank, world_size=args.world_size, timeout = timedelta(seconds=7200))

The process works with an small dataset. But when the dataset is big, this happens:

Inferencing model in train datasets.:  88%|▉| 1092/1241 [05:07<00:41,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1093/1241 [05:07<00:40,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1094/1241 [05:07<00:40,  **3.62iERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 304.7670805454254 seconds**
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _exit_barrier
    store_util.barrier(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:580 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd8 (0x7fff6e2f2428 in /opt/conda/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xcc (0x7fff6e2ecd8c in /opt/conda/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0x1a0 (0x7fff9f641880 in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4c (0x7fff9f642cac in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x90 (0x7fff9f642dc0 in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7fff9f6052fc in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xb29d38 (0x7fffa2249d38 in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x25622c (0x7fffa197622c in /opt/conda/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x2c4960 (0x132f94960 in /opt/conda/bin/python)
frame #9: _PyObject_MakeTpCall + 0xcc (0x132d4e26c in /opt/conda/bin/python)
frame #10: <unknown function> + 0x2a4bb4 (0x132f74bb4 in /opt/conda/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x8418 (0x132d36038 in /opt/conda/bin/python)
frame #12: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #14: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #15: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #17: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #18: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x6f40 (0x132d34b60 in /opt/conda/bin/python)
frame #20: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x7e18 (0x132d35a38 in /opt/conda/bin/python)
frame #22: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #23: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x7e18 (0x132d35a38 in /opt/conda/bin/python)
frame #25: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #26: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #27: PyVectorcall_Call + 0x90 (0x132d4d8a0 in /opt/conda/bin/python)
frame #28: _PyObject_Call + 0x1a8 (0x132d4dbe8 in /opt/conda/bin/python)
frame #29: PyCFunction_Call + 0x44 (0x132d4dce4 in /opt/conda/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x389c (0x132d314bc in /opt/conda/bin/python)
frame #31: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #32: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x7e18 (0x132d35a38 in /opt/conda/bin/python)
frame #34: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #36: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #37: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #38: _PyObject_FastCallDictTstate + 0x8c (0x132d4e55c in /opt/conda/bin/python)
frame #39: _PyObject_Call_Prepend + 0xd4 (0x132d4e914 in /opt/conda/bin/python)
frame #40: <unknown function> + 0xef654 (0x132dbf654 in /opt/conda/bin/python)
frame #41: _PyObject_Call + 0x94 (0x132d4dad4 in /opt/conda/bin/python)
frame #42: PyCFunction_Call + 0x44 (0x132d4dce4 in /opt/conda/bin/python)
frame #43: _PyEval_EvalFrameDefault + 0x389c (0x132d314bc in /opt/conda/bin/python)
frame #44: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #45: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #46: <unknown function> + 0x5c5f4 (0x132d2c5f4 in /opt/conda/bin/python)
frame #47: PyVectorcall_Call + 0x90 (0x132d4d8a0 in /opt/conda/bin/python)
frame #48: _PyObject_Call + 0x1a8 (0x132d4dbe8 in /opt/conda/bin/python)
frame #49: PyCFunction_Call + 0x44 (0x132d4dce4 in /opt/conda/bin/python)
frame #50: _PyEval_EvalFrameDefault + 0x389c (0x132d314bc in /opt/conda/bin/python)
frame #51: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #52: _PyFunction_Vectorcall + 0xd8 (0x132d4de08 in /opt/conda/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x7104 (0x132d34d24 in /opt/conda/bin/python)
frame #54: <unknown function> + 0x160244 (0x132e30244 in /opt/conda/bin/python)
frame #55: _PyEval_EvalCodeWithName + 0xa4 (0x132e306a4 in /opt/conda/bin/python)
frame #56: PyEval_EvalCodeEx + 0x74 (0x132e30744 in /opt/conda/bin/python)
frame #57: PyEval_EvalCode + 0x48 (0x132e307c8 in /opt/conda/bin/python)
frame #58: <unknown function> + 0x1b0cbc (0x132e80cbc in /opt/conda/bin/python)
frame #59: <unknown function> + 0x1b0e30 (0x132e80e30 in /opt/conda/bin/python)
frame #60: PyRun_FileExFlags + 0x114 (0x132e85b84 in /opt/conda/bin/python)
frame #61: PyRun_SimpleFileExFlags + 0x230 (0x132e85f20 in /opt/conda/bin/python)
frame #62: PyRun_AnyFileExFlags + 0xc8 (0x132e86918 in /opt/conda/bin/python)
frame #63: Py_RunMain + 0xa00 (0x132d3bc60 in /opt/conda/bin/python)

t/s]
Inferencing model in train datasets.:  88%|▉| 1095/1241 [05:07<00:40,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1096/1241 [05:08<00:40,  3.62it/s]
Inferencing model in train datasets.:  88%|▉| 1097/1241 [05:08<00:39,  3.63it/s]
Inferencing model in train datasets.:  88%|▉| 1098/1241 [05:08<00:39,  3.62it/s]
Inferencing model in train datasets.:  89%|▉| 1099/1241 [05:09<00:39,  3.63it/s]

So, what is happening? After 2 epochs comes the evaler. I guess this invocation comes from the evaler when tries to saves the model or checkpoint. Ironically, all the prints of the main training process are printed, so even after this error, the process doesn’t stop and ends. Even the inferencing part progress bar(Inferencing model in train datasets) finish too.

This is the slurm:

#!/bin/bash
#SBATCH --job-name=yolo_test
#SBATCH --qos=debug
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=160
#SBATCH --gres=gpu:4
#SBATCH --output=/gpfs/scratch/X/X/slurm_logs/airurban/yolo_test_%j.out
#SBATCH --error=/gpfs/scratch/X/X/slurm_logs/airurban/yolo_test_%j.err


#export NCCL_DEBUG_SUBSYS=COLL
#export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
export TORCH_CPP_LOG_LEVEL=INFO
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_SHOW_CPP_STACKTRACES=1
export NCCL_IB_TIMEOUT=22

# ---> I'VE USING THIS VARIABLES A LOT WITH SAME RESULTS
export NCCL_ASYNC_ERROR_HANDLING=1
#export NCCL_DESYNC_DEBUG=1
#export ENABLE_NCCL_HEALTH_CHECK=1

echo "SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
echo Node IP: $head_node_ip

export MASTER_ADDR=$head_node_ip
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_GPUS_ON_NODE))

echo "MASTER_ADDR="$MASTER_ADDR
echo "Head Node IP:="$head_node_ip
echo "SLURM_PROCID="$SLURM_PROCID
echo "SLURM_NNODES="$SLURM_NNODES
echo "SLURM_JOB_ID="$SLURM_JOB_ID
echo "SLURM_GPUS_ON_NODE="$SLURM_GPUS_ON_NODE
echo "WORLD_SIZE="$WORLD_SIZE


### Loading environment
module load cuda/10.2 cudnn/8.0.5 nccl/2.9.9 singularity
srun singularity exec --nv /apps/SINGULARITY/images/numpy1.26-torch-vision.sif torchrun \
		--nproc_per_node $SLURM_GPUS_ON_NODE \
		--nnodes $SLURM_NNODES \
		--rdzv_id $SLURM_JOB_ID \
		--rdzv_backend c10d \
		--rdzv_endpoint $MASTER_ADDR:29500 \
		--log_dir "/gpfs/scratch/X/Xslurm_logs/airurban/torchrun_logs" \
		 tools/train.py \
			--batch 32 \
			--bs_per_gpu=8 \
			--conf configs/yolov6n.py \
			--data data/dataset.yaml \
			--fuse_ab \
			--device 0,1,2,3 \
			--workers 32 \
			--eval-interval 1 \
			--epochs 2 

I’ve been reading a lot about this in forums and topics here. According to it timeout is pretty important. My question is, why is not taking into account 7200s. I mean, the process when the dataset is not huge works fine, till the end too without errors. When is huge at somepoint is delaying but the main process is not waiting for childs, isn’t it? So why increasing timeout is not working? 7200 is being well used. If I use max level at the log, :


[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:29500 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:417] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:462] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n03]:48738.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n03]:48740.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n03]:48738.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n03]:48740.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n14]:42492.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n14]:42492.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (10.2.1.39, 29500).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:29500.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [p9r3n14]:42494.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:29500 on [p9r3n14]:42494.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
[I debug.cpp:47] [c10d] The debug level is set to DETAIL.
Using 4 GPU for training... 
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:54679 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:54679 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:710] [c10d - trace] The server socket on [p9r3n03]:54679 is not yet listening (errno: 111 - Connection refused), will retry.
training args are: Namespace(data_path='data/dataset.yaml', conf_file='configs/yolov6n.py', img_size=640, batch_size=32, epochs=2, workers=32, device='0,1,2,3', eval_interval=1, eval_final_only=False, heavy_eval_range=50, check_images=False, check_labels=False, output_dir='./runs/train', name='exp', dist_url='env://', gpu_count=0, local_rank=0, resume=False, write_trainbatch_tb=False, stop_aug_last_n_epoch=15, save_ckpt_on_last_n_epoch=-1, distill=False, distill_feat=False, quant=False, calib=False, teacher_model_path=None, temperature=20, fuse_ab=True, bs_per_gpu=8, rank=0, world_size=8, save_dir='runs/train/exp26')

Initializing process group... 
[I socket.cpp:417] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:462] [c10d - debug] The server socket is attempting to listen on [::]:54679.
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:54679.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n03]:55806.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n03]:55806.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51710.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51714.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51716.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51712.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51718.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51710.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51722.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51714.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51716.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:582] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (p9r3n03.power.cte, 54679).
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:648] [c10d - trace] The client socket is attempting to connect to [p9r3n03]:54679.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51718.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51722.
[I socket.cpp:276] [c10d - debug] The server socket on [::]:54679 has accepted a connection from [p9r3n14]:51720.
[I socket.cpp:725] [c10d] The client socket has connected to [p9r3n03]:54679 on [p9r3n14]:51712.
[I ProcessGroupNCCL.cpp:587] [Rank 7] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 7200000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 7] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 5] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 7200000
USE_HIGH_PRIORITY_STREAM: 0

Why does this keeps giving me error at 300? I’ve read that by default this init timeout is 30 min. So, are there two different timeouts? I think probably my process is also lasting 30 min, but the error marks as the cause the 300s timeout.

I’ve read same issue to other people:

maybe this:

Please, any thought is useful. Thanks.

I forgot to mention that cuda and nccl versions are:
CUDA 10.2
NCCL 2.9.9

Could you update PyTorch to the latest stable or nightly release and check if you would still run into these issues?

Hello
It’s an HPC research facility. It’s not easy, sometimes not possible, to update. By now, this is the case. But, new HPC is “opening it’s doors” in a short period of time (weeks, maybe few months). So, if there is no new advice to track/follow the bug with current configuration, I’ll write in a few weeks with new info here. Otherwise, we can follow welcomed new advises.