Debugging Memory usage & inplace modifications

I am tightly memory bound and currently debugging pytorch memory usage. I am running my script with "CUDA_LAUNCH_BLOCKING": "1"env var.

I always get an error during a custom torch.autograd.Function in the forward-part. I can’t explain why it allocates, I think it shouldn’t:

CUDA out of memory. Tried to allocate 5.32 GiB (GPU 1; 31.75 GiB total capacity; 21.18 GiB already allocated; 1.09 GiB free; 29.22 GiB reserved in total by PyTorch)

while executing (jac is very big…):


this should just multiply some values and not allocate! I thought it was just a synchronisation issue, but I run it with "CUDA_LAUNCH_BLOCKING": "1".
Another thing I am wondering…jac.requires_grad returns True. Is this expected inside the forward function? I am calculating the gradient myself, so it should not be needed…

And last, while trying to see the saved tensors I run:

def memReport():
    tensors = []
    for obj in gc.get_objects():
            if torch.is_tensor(obj):
                test = obj.element_size() * obj.nelement()
    tensors = sorted(tensors, key=lambda a: a.element_size() * a.nelement())
    for t in tensors[-30:]:
        print(f"{(t.element_size() * t.nelement())/1e+9:0.3f}: {t.shape}")
    print(f"total found: {sum([(t.element_size() * t.nelement())/1e+9 for t in tensors]):0.3f}")

(got it from the forums)
it returns 8.855…which is obviously not 21.18 GiB. Where are the additional tensors?

Is jac a parameter? In that case it would be expected to have a .grad field even if it were a parameter (e.g., used by a custom autograd layer), as the .grad field would be used by the optimizer to update the parameter values during optimizer.step().

Jac is a tensor needed in the computational graph, but not a pytorch-parameter. It’s a function-parameter for the autograd-forward function wrt I need to calculate gradients for.

For more experiments it crashed at the jac[cond].mul_(self.negative_slope) line. I really need to get the allocation down (if this line really is the culprit), I don’t see why it needs to allocate here really.


it looks like the following code performs better:

mul = torch.ones_like(input)
mul[cond] = self.negative_slope
mul = mul.unsqueeze(1).unsqueeze(1).unsqueeze(1)

is this possible? It looks like it’s faster and doesn’t use as much memory