CUDA error: peer mapping resources exhausted

ptrblck · December 8, 2022, 9:47pm

You might be setting the env variable too late in your actual script. Once the CUDA context is created, this env variable won’t have any effect anymore, which is why I usually recommend to export it in your current terminal or prepend it to your python command in the terminal.

That is a good idea as DP suffers from some overheads in cloning the model’s state_dict in each forward pass as well as from an imbalanced GPU memory usage. DDP should thus give you a better performance.