Keep getting CUDA OOM error with Pytorch failing to allocate all free memory

sbelharbi · December 25, 2021, 10:52am

no.
i am benchmarking some methods. and this error happens only with a single method. but not sure what caused it. i am not sure it is method dependent. it could be pytorch version 1.10.0 or something else.

i did several changes in the code, things seem ok now (on other servers). but i didnt try the method that caused oom yet. also, other servers showed the same issue back than. so it is not a t4 issue.

if you are using ddp, the first thing you could try is to change the backend to mpi (check with your server to see the most stable backend). in version 1.10.0 the backend was causing many issues. now, with mpi, things are fine. but this is server dependent.

for validation, i was accidentally using the ddp_wrapped model. now, i use the true model.

search for memory leaks.
you may be tracking something that is not detached from the graph.

i’ll try the current version of the code on the t4 and report back.

i think torch 1.10.0 has an issue that causes oom.
if you still cant fix it, i recommend downgrading to torch.1.9.0.

thanks