Hi everyone,
I’m training a model in a SLURM cluster. My dataset is a a folder tree with almost 7000 MRI files (it’s large). I’m getting the following errors (full error trace here at the very end):
RuntimeError: unable to open shared memory object </torch_3368351_1991753163> in read-write mode
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
I’m using torch.multiprocessing.set_sharing_strategy('file_system')
to avoid this. I’ve seen many approaches for solving this error, like these ones. I wolud like to understant what’s really happening and what’s causing the problem in order to avoid it in future developments. After seeing this, I remembered that I am cumputing my loss with the following function:
def weighted_average(
inputs: Sequence[torch.Tensor], weights: torch.Tensor
) -> torch.Tensor:
"""
Computes the weighted average of the inputs.
Args:
inputs: a sequence of tensors with the same shape.
weights: a sequence of weights for each tensor
Returns:
Weighted average of the inputs.
"""
_msg_len_ = f"Expected lengths to be equal len(inputs) == len(weights) but got {len(inputs) = } and {len(weights) = }."
assert len(inputs) == len(weights), _msg_len_
weighted_sum = [torch.mul(w_i, in_i) for in_i, w_i in zip(inputs, weights)]
return torch.stack(weighted_sum, dim=0).sum(0) / weights.sum()
Questions
- The above function uses a list
weighted_sum
, could be this the root of the problem, or it may be related to the large amount of files? - If not, using HDF5 would solve the problem allowing me to used the deafult multiprocessing sharing strategy?