Dataloader hangs during training

I am developing with torchvision and its infrastructure, like Dataset and Dataloader. During the last months on Win10 there were no issues with that code and how I used it.
Yesterday I moved to a fresh Linux installation and setup the whole env, like CUDA, Python 3.9 and the packages, the requirements.txt created from the venv I used on the Win10 workstation. The GPU memory (2GB) is not sufficient anymore, even when using a batch size of 1 on the dataloaders.
But the strangest things happen with the multiprocessing dataloaders. Regardless if I use only one worker or multiples (4). In that code snippet the timeout is reached almost all the time. Only sometimes a record is returned by the call self._data_queue.get(timeout=timeout)

    def _try_get_data(self, timeout=_utils.MP_STATUS_CHECK_INTERVAL):
        # Tries to fetch data from `self._data_queue` once for a given timeout.
        # This can also be used as inner loop of fetching without timeout, with
        # the sender status as the loop condition.
        #
        # This raises a `RuntimeError` if any worker died expectedly. This error
        # can come from either the SIGCHLD handler in `_utils/signal_handling.py`
        # (only for non-Windows platforms), or the manual check below on errors
        # and timeouts.
        #
        # Returns a 2-tuple:
        #   (bool: whether successfully get data, any: data if successful else None)
        try:
            data = self._data_queue.get(timeout=timeout)
            return (True, data)
        except Exception as e:
            # At timeout and error, we manually check whether any worker has
            # failed. Note that this is the only mechanism for Windows to detect
            # worker failures.
            failed_workers = []
            for worker_id, w in enumerate(self._workers):
                if self._workers_status[worker_id] and not w.is_alive():
                    failed_workers.append(w)
                    self._mark_worker_as_unavailable(worker_id)
            if len(failed_workers) > 0:
                pids_str = ', '.join(str(w.pid) for w in failed_workers)
                raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
            if isinstance(e, queue.Empty):
                return (False, None)
            import tempfile
            import errno
            try:
                # Raise an exception if we are this close to the FDs limit.
                # Apparently, trying to open only one file is not a sufficient
                # test.
                # See NOTE [ DataLoader on Linux and open files limit ]
                fds_limit_margin = 10
                fs = [tempfile.NamedTemporaryFile() for i in range(fds_limit_margin)]
            except OSError as e:
                if e.errno == errno.EMFILE:
                    raise RuntimeError(
                        "Too many open files. Communication with the"
                        " workers is no longer possible. Please increase the"
                        " limit using `ulimit -n` in the shell or change the"
                        " sharing strategy by calling"
                        " `torch.multiprocessing.set_sharing_strategy('file_system')`"
                        " at the beginning of your code") from None
            raise

I started to debug into the details, and what I see is that the Dataset (a Subset) is properly set in the MultiProcessingDataloaderIter:

This is how i setup the dataloaders:

        # split the dataset in train and test set
        train_size = int(len(dataset) * train_val_split)
        val_size = len(dataset) - train_size
        train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

        # define training and validation data loaders
        data_loader_train = torch.utils.data.DataLoader(
            train_set, batch_size=config['batch_size'], shuffle=True, num_workers=4,
            collate_fn=utils.collate_fn)

        data_loader_val = torch.utils.data.DataLoader(
            val_set, batch_size=config['batch_size'], shuffle=False, num_workers=1,
            collate_fn=utils.collate_fn)

I can observe that from start a timeout is reached three times in a row, and then the first record is returned from the dataset. But then I continue getting timeouts after a view records have been returned. And this pattern continues so that training is not feasible, due to the enormous delays.

Why is the behavior with Linux different? When I set num_workers to 0, then the blocking is gone.