Runtime Error related to shared memory

yushu · January 22, 2022, 11:05pm

Hi,

i got an error might relate to dataloader and shared memory, the error below would pop up randomly.

RuntimeError: falseINTERNAL ASSERT FAILED at /opt/conda/conda-bld/pytorch_1639180588308/work/aten/src/ATen/MapAllocator.cpp:263, please report a bug to PyTorch. unable to open shared memory object </torch_43891_16> in read-write mode

I am using DDP torch.distributed.run to train a model on imagenet dataset with a multi-node multi-gpu setup. i’ve tried many possible solution such as setting ulimit -n <max value>, or using the command torch.multiprocessing.set_sharing_strategy('file_system'), but neither doesn’t work.

Each node has AMD CPU with 64 processors + 2 A100 GPUs + 250G RAM

I previously set num_workers=24 in data loader since the node has 64 processors in total but still failed

In the latest attemption i set num_workers=12 and ulimit -n <max value>, it works for now but the code is still running and i cannot make sure the error won’t pop up later.

May i ask would the num_workers related to this error? and also what would be a proper way to choose the num_workers if it is the case.

Thanks!

yushu · January 24, 2022, 8:05pm

The code worked only once when setting export OMP_NUM_THREADS=32 and num_workers=12 in two nodes training. If I increase the nodes to 4 it fails again.

wanchaol · January 25, 2022, 7:16am

Thanks for posting the question @yushu. There’re someone else who posted similar problem before, do you think this might help you resolve the issue? How to cache an entire dataset in multiprocessing?

yushu · January 25, 2022, 8:42pm

Thanks for your advice. I’ve looked the link you refer, unfortunately it seems not related to my issue.
Current i assume it is related to the PyTorch version. i previously worked on PyTorch 1.10+cuda11.3 on miniconda. Now I switch back to PyTorch 1.9+cuda11, it seems no such error. It is a work around but still can debug why it happens before.

Benjamin_Therien · March 1, 2023, 6:25am

RuntimeError: falseINTERNAL ASSERT FAILED at 
"/opt/conda/conda-bld/pytorch_1639180588308/work
/aten/src/ATen/MapAllocator.cpp":323, │please report 
a bug to PyTorch. unable to mmap 8 bytes from file 
<filename not specified>: Cannot allocate memory (12)

I get a similar error to you with torch 1.10.1 and cuda 11.3. I’m using torch.distributed.dataparallel to train on a 2 GPU node. The bug only occurs when setting pin_memory=True in the dataloader. However, setting it to false is not a great solution as my gpus are under-utilized without pin_memory=True.