I encounter random OOM errors during the model traning. It’s like:
RuntimeError: CUDA out of memory. Tried to allocate **8.60 GiB** (GPU 0; 23.70 GiB total capacity; 3.77 GiB already allocated; **8.60 GiB** free; 12.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
As you can see, Pytorch tried to allocate 8.60GiB, the exact amount of memory that’s free now according to the exception report, and failed.
This OOM error keeps poping up randomly during my training, i.e., I can’t locate the exact operation that causes it. Sometimes the last call in the trace is conv1d, sometimes it’s some backward(). And the amount of memory “tried to allocate” also changes all the time, but always remains the same as “free memory”.
I’ve been working on training a one-shot NAS model using the single-path method, i.e. at each iteration the model used to forward & backward is not the same. This further add some randomness to this strange error I think…
I’ve done some research on my own. Like setting PYTORCH_CUDA_ALLOC_CONF according to the pytorch doc and also setting PYTORCH_NO_CUDA_MEMORY_CACHING. This two env variable both seemingly solve the problem and help me locate the problem to the pytorch memory allocator with caching mechanism. Perhaps because the allocator is trying to cache all the free memory but fails, I think?
But these env variables can both greatly reduce the training speed. So I wonder if there’s someone who has encountered a similar case before and figured out how to solve it without harming the training efficiency. Thank you very much!