OOM when using DDP

jhp · October 4, 2021, 5:22pm

Hi i;m stuck right now in OOM issue.

| distributed init (rank 0): env://                                                                                                                                                                                                                                              
| distributed init (rank 2): env://                                                                                                                                                                                                                                              
| distributed init (rank 3): env://                                                                                                                                                                                                                                              
| distributed init (rank 1): env://                                                                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                                                                                                               
  File "main.py", line 228, in <module>                                                                                                                                                                                                                                          
    main(args)                                                                                                                                                                                                                                                                   
  File "main.py", line 45, in main                                                                                                                                                                                                                                               
    utils.init_distributed_mode(args)                                                                                                                                                                                                                                            
  File "/home/user/detr/util/misc.py", line 304, in init_distributed_mode                                                                                                                                                                                             
    torch.distributed.barrier()                                                                                                                                                                                                                                                  
  File "/home/user/.conda/envs/vt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier                                                                                                                                                  
    work = default_pg.barrier(opts=opts)                                                                                                                                                                                                                                         
RuntimeError: CUDA error: out of memory

Dont know why this happens even though no memory or process is on GPU.

Also I found that just a simple tensor load on GPU did not work. (i.e a=torch.randn((2,3)).to(‘cuda’) raise OOM)

I solve this issue by reboot the machine and I wonder is there any other solution except for reboot?

ptrblck · October 5, 2021, 4:34am

You could try to reset your device via:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

as it seems to be in a “bad state” (e.g. after using hibernate/sleep in Linux).

jhp · October 5, 2021, 12:07pm

Thanks for reply.
If I reboot machine as you’ve suggested, wouldn’t the same problem happen anymore?

ptrblck · October 5, 2021, 8:58pm

No, I don’t think a single reboot could solve the issue in future runs.
Depending on the root cause, you might be running into it again. E.g. I have to reset the device every time after using Ubuntu’s sleep mode.

jhp · October 5, 2021, 11:50pm

Oh that’s too bad… Thanks anyway.

Ah one more question!
Will the fixed seed reset after reboot or reinstall cuda driver? (i.e I use torch.manual_seed(seed) for reproducibility, and I wonder about some initialization changes after reboot.

ptrblck · October 6, 2021, 5:24am

No, the pseudorandom number generator should not change unless the reboot also changed your setup, i.e. updated CUDA (cuRAND) etc. However, even in this case you would need to rebuild PyTorch to see any differences.