Hi i;m stuck right now in OOM issue.
| distributed init (rank 0): env://
| distributed init (rank 2): env://
| distributed init (rank 3): env://
| distributed init (rank 1): env://
Traceback (most recent call last):
File "main.py", line 228, in <module>
main(args)
File "main.py", line 45, in main
utils.init_distributed_mode(args)
File "/home/user/detr/util/misc.py", line 304, in init_distributed_mode
torch.distributed.barrier()
File "/home/user/.conda/envs/vt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
Dont know why this happens even though no memory or process is on GPU.
Also I found that just a simple tensor load on GPU did not work. (i.e a=torch.randn((2,3)).to(‘cuda’) raise OOM)
I solve this issue by reboot the machine and I wonder is there any other solution except for reboot?