| distributed init (rank 0): env://
| distributed init (rank 2): env://
| distributed init (rank 3): env://
| distributed init (rank 1): env://
Traceback (most recent call last):
File "main.py", line 228, in <module>
File "main.py", line 45, in main
File "/home/user/detr/util/misc.py", line 304, in init_distributed_mode
File "/home/user/.conda/envs/vt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
Dont know why this happens even though no memory or process is on GPU.
Also I found that just a simple tensor load on GPU did not work. (i.e a=torch.randn((2,3)).to(‘cuda’) raise OOM)
I solve this issue by reboot the machine and I wonder is there any other solution except for reboot?
No, I don’t think a single reboot could solve the issue in future runs.
Depending on the root cause, you might be running into it again. E.g. I have to reset the device every time after using Ubuntu’s sleep mode.
Ah one more question!
Will the fixed seed reset after reboot or reinstall cuda driver? (i.e I use torch.manual_seed(seed) for reproducibility, and I wonder about some initialization changes after reboot.
No, the pseudorandom number generator should not change unless the reboot also changed your setup, i.e. updated CUDA (cuRAND) etc. However, even in this case you would need to rebuild PyTorch to see any differences.