Training stops due to Caught RuntimeError in DataLoader worker process 0 with large dataset of files

Hi everyone,

I’m training a model in a SLURM cluster. My dataset is a a folder tree with almost 7000 MRI files (it’s large). I’m getting the following errors (full error trace here at the very end):

  • RuntimeError: unable to open shared memory object </torch_3368351_1991753163> in read-write mode
  • RuntimeError: Caught RuntimeError in DataLoader worker process 0.

I’m using torch.multiprocessing.set_sharing_strategy('file_system') to avoid this. I’ve seen many approaches for solving this error, like these ones. I wolud like to understant what’s really happening and what’s causing the problem in order to avoid it in future developments. After seeing this, I remembered that I am cumputing my loss with the following function:

def weighted_average(
    inputs: Sequence[torch.Tensor], weights: torch.Tensor
) -> torch.Tensor:
    """
    Computes the weighted average of the inputs.

    Args:
        inputs: a sequence of tensors with the same shape.
        weights: a sequence of weights for each tensor

    Returns:
        Weighted average of the inputs.
    """
    _msg_len_ = f"Expected lengths to be equal len(inputs) == len(weights) but got {len(inputs) = } and {len(weights) = }."
    assert len(inputs) == len(weights), _msg_len_
    weighted_sum = [torch.mul(w_i, in_i) for in_i, w_i in zip(inputs, weights)]
    return torch.stack(weighted_sum, dim=0).sum(0) / weights.sum()

Questions

  • The above function uses a list weighted_sum, could be this the root of the problem, or it may be related to the large amount of files?
  • If not, using HDF5 would solve the problem allowing me to used the deafult multiprocessing sharing strategy?

You might be running out of shared memory so you could try to increase it. Also, in case you are using containers, make sure you are providing enough shared mem to their runtime (e.g. via --ipc=host for docker containers).