Dataloader slowing down and getting stuck

Hi all,
I’m using the pytorch dataloader for 3d medical imaging. The only thing that really happens in my dataset is that cropped patches from 3d volumes (saved as .npy files) are read from the local SSD. However I experienced recently that the dataloader gets stuck or slows down significantly after hours of training. I get messages like

kernel:[438302.215353 watchdog: BUG: soft lockup - GPU#1 stuck for 22s! [python:1397221]

Usually I have four python processes running for the four GPUs in our servers. If one of the processes is stopped with Crtl C the others are running again.
I reduced the number of workers from 8 to 5 and experienced that in the middle of the training the time per epoch went up by a factor of 20. The Crtl C stoppped all processes at the dataloader trying to get data and showed something like

data = self._data_queue.get(timeout=timeout) dataloader.py, line 990 in _try_get_data
self.not_empty.wait(remaining) queue.py line 180, in get
gotit = waiter.acquire(True, timeout) threading.py, line 316 in wait

I’m using a server with four Quadro P6000 GPUs, one AMD EPYC 7352 24-Core Processor CPU and a custom version of Ubuntu 20.02 that our IT created.

Does anyone know what could be going on and how to fix this?

Many thanks!