I’m training a model that splits the layers to separate GPUs via model parallel. “cuda:0” starts at around 600MB and slowly keeps etching up until it causes a cuda memory error(12GB per GPU). The other GPU stays under 600MB.
I’ve tried sys.getsizeof(object), which doesn’t do much for tensors. I also now am running empty_cache and gc.collect. Still getting this issue.
Is there a method I can use to troubleshoot where the memory slippage is?
I found some tools here: pytorch.org/docs/stable/cuda.html
under “Memory management”.
Figured out my problem was that I was using the outputs for some statistical calculations and forgot to put them back to cpu before performing those. Somehow that was causing the issue of tensors being duplicated on the device they resided. Running .to(‘cpu’) on all before sending those to stats fixed it pronto.
Edit: This ended up just putting the memory overflow on the cpu. I changed it to
.detach() with no need to remove from the GPU and that seems to resolve the issue.