i got an error might relate to dataloader and shared memory, the error below would pop up randomly.
RuntimeError: falseINTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1639180588308/work/aten/src/ATen/MapAllocator.cpp:263, please report a bug to PyTorch. unable to open shared memory object </torch_43891_16> in read-write mode
I am using DDP torch.distributed.run to train a model on imagenet dataset with a multi-node multi-gpu setup. i’ve tried many possible solution such as setting ulimit -n <max value>, or using the command torch.multiprocessing.set_sharing_strategy('file_system'), but neither doesn’t work.
Each node has AMD CPU with 64 processors + 2 A100 GPUs + 250G RAM
I previously set num_workers=24 in data loader since the node has 64 processors in total but still failed
In the latest attemption i set num_workers=12 and ulimit -n <max value>, it works for now but the code is still running and i cannot make sure the error won’t pop up later.
May i ask would the num_workers related to this error? and also what would be a proper way to choose the num_workers if it is the case.
The code worked only once when setting export OMP_NUM_THREADS=32 and num_workers=12 in two nodes training. If I increase the nodes to 4 it fails again.
Thanks for your advice. I’ve looked the link you refer, unfortunately it seems not related to my issue.
Current i assume it is related to the PyTorch version. i previously worked on PyTorch 1.10+cuda11.3 on miniconda. Now I switch back to PyTorch 1.9+cuda11, it seems no such error. It is a work around but still can debug why it happens before.
RuntimeError: falseINTERNAL ASSERT FAILED at
"/opt/conda/conda-bld/pytorch_1639180588308/work
/aten/src/ATen/MapAllocator.cpp":323, │please report
a bug to PyTorch. unable to mmap 8 bytes from file
<filename not specified>: Cannot allocate memory (12)
I get a similar error to you with torch 1.10.1 and cuda 11.3. I’m using torch.distributed.dataparallel to train on a 2 GPU node. The bug only occurs when setting pin_memory=True in the dataloader. However, setting it to false is not a great solution as my gpus are under-utilized without pin_memory=True.