Runtime Error related to shared memory

Hi,

i got an error might relate to dataloader and shared memory, the error below would pop up randomly.

RuntimeError: falseINTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1639180588308/work/aten/src/ATen/MapAllocator.cpp:263, please report a bug to PyTorch. unable to open shared memory object </torch_43891_16> in read-write mode

I am using DDP torch.distributed.run to train a model on imagenet dataset with a multi-node multi-gpu setup. i’ve tried many possible solution such as setting ulimit -n <max value>, or using the command torch.multiprocessing.set_sharing_strategy('file_system'), but neither doesn’t work.

Each node has AMD CPU with 64 processors + 2 A100 GPUs + 250G RAM

I previously set num_workers=24 in data loader since the node has 64 processors in total but still failed

In the latest attemption i set num_workers=12 and ulimit -n <max value>, it works for now but the code is still running and i cannot make sure the error won’t pop up later.

May i ask would the num_workers related to this error? and also what would be a proper way to choose the num_workers if it is the case.

Thanks!

The code worked only once when setting export OMP_NUM_THREADS=32 and num_workers=12 in two nodes training. If I increase the nodes to 4 it fails again.

Thanks for posting the question @yushu. There’re someone else who posted similar problem before, do you think this might help you resolve the issue? How to cache an entire dataset in multiprocessing?

Thanks for your advice. I’ve looked the link you refer, unfortunately it seems not related to my issue.
Current i assume it is related to the PyTorch version. i previously worked on PyTorch 1.10+cuda11.3 on miniconda. Now I switch back to PyTorch 1.9+cuda11, it seems no such error. It is a work around but still can debug why it happens before.