OOM when using DDP

Hi i;m stuck right now in OOM issue.

| distributed init (rank 0): env://                                                                                                                                                                                                                                              
| distributed init (rank 2): env://                                                                                                                                                                                                                                              
| distributed init (rank 3): env://                                                                                                                                                                                                                                              
| distributed init (rank 1): env://                                                                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                                                                                                               
  File "main.py", line 228, in <module>                                                                                                                                                                                                                                          
    main(args)                                                                                                                                                                                                                                                                   
  File "main.py", line 45, in main                                                                                                                                                                                                                                               
    utils.init_distributed_mode(args)                                                                                                                                                                                                                                            
  File "/home/user/detr/util/misc.py", line 304, in init_distributed_mode                                                                                                                                                                                             
    torch.distributed.barrier()                                                                                                                                                                                                                                                  
  File "/home/user/.conda/envs/vt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier                                                                                                                                                  
    work = default_pg.barrier(opts=opts)                                                                                                                                                                                                                                         
RuntimeError: CUDA error: out of memory

Dont know why this happens even though no memory or process is on GPU.

Also I found that just a simple tensor load on GPU did not work. (i.e a=torch.randn((2,3)).to(‘cuda’) raise OOM)

I solve this issue by reboot the machine and I wonder is there any other solution except for reboot?

You could try to reset your device via:

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

as it seems to be in a “bad state” (e.g. after using hibernate/sleep in Linux).

Thanks for reply.
If I reboot machine as you’ve suggested, wouldn’t the same problem happen anymore?

No, I don’t think a single reboot could solve the issue in future runs.
Depending on the root cause, you might be running into it again. E.g. I have to reset the device every time after using Ubuntu’s sleep mode.

Oh that’s too bad… Thanks anyway.

Ah one more question!
Will the fixed seed reset after reboot or reinstall cuda driver? (i.e I use torch.manual_seed(seed) for reproducibility, and I wonder about some initialization changes after reboot.

No, the pseudorandom number generator should not change unless the reboot also changed your setup, i.e. updated CUDA (cuRAND) etc. However, even in this case you would need to rebuild PyTorch to see any differences.