Hi,
i got an error might relate to dataloader and shared memory, the error below would pop up randomly.
RuntimeError: falseINTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1639180588308/work/aten/src/ATen/MapAllocator.cpp:263, please report a bug to PyTorch. unable to open shared memory object </torch_43891_16> in read-write mode
I am using DDP torch.distributed.run
to train a model on imagenet dataset with a multi-node multi-gpu setup. i’ve tried many possible solution such as setting ulimit -n <max value>
, or using the command torch.multiprocessing.set_sharing_strategy('file_system')
, but neither doesn’t work.
Each node has AMD CPU with 64 processors
+ 2 A100 GPUs
+ 250G RAM
I previously set num_workers=24
in data loader since the node has 64 processors in total but still failed
In the latest attemption i set num_workers=12
and ulimit -n <max value>
, it works for now but the code is still running and i cannot make sure the error won’t pop up later.
May i ask would the num_workers
related to this error? and also what would be a proper way to choose the num_workers
if it is the case.
Thanks!