I am tightly memory bound and currently debugging pytorch memory usage. I am running my script with "CUDA_LAUNCH_BLOCKING": "1"
env var.
I always get an error during a custom torch.autograd.Function
in the forward-part. I can’t explain why it allocates, I think it shouldn’t:
CUDA out of memory. Tried to allocate 5.32 GiB (GPU 1; 31.75 GiB total capacity; 21.18 GiB already allocated; 1.09 GiB free; 29.22 GiB reserved in total by PyTorch)
while executing (jac is very big…):
jac[cond].mul_(negative_slope)
this should just multiply some values and not allocate! I thought it was just a synchronisation issue, but I run it with "CUDA_LAUNCH_BLOCKING": "1"
.
Another thing I am wondering…jac.requires_grad
returns True
. Is this expected inside the forward
function? I am calculating the gradient myself, so it should not be needed…
And last, while trying to see the saved tensors I run:
def memReport():
tensors = []
for obj in gc.get_objects():
try:
if torch.is_tensor(obj):
test = obj.element_size() * obj.nelement()
tensors.append(obj)
except:
pass
tensors = sorted(tensors, key=lambda a: a.element_size() * a.nelement())
for t in tensors[-30:]:
print(f"{(t.element_size() * t.nelement())/1e+9:0.3f}: {t.shape}")
print(f"total found: {sum([(t.element_size() * t.nelement())/1e+9 for t in tensors]):0.3f}")
(got it from the forums)
it returns 8.855
…which is obviously not 21.18 GiB
. Where are the additional tensors?