Out of memory issue

Hello! I have this network:

class Rocket_E_NN(nn.Module):
    def __init__(self):
        super().__init__()
        
        softplus_value = 5
        
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 4, 2, 1),          # B,  32, 32, 32
            modifiedSoftplus(softplus_value),
            nn.Conv2d(32, 32, 4, 2, 1),          # B,  32, 16, 16
            modifiedSoftplus(softplus_value),
            nn.Conv2d(32, 64, 4, 2, 1),          # B,  64,  8,  8
            modifiedSoftplus(softplus_value),
            nn.Conv2d(64, 64, 4, 2, 1),          # B,  64,  4,  4
            modifiedSoftplus(softplus_value),
            nn.Conv2d(64, 256, 4, 1),            # B, 256,  1,  1
            modifiedSoftplus(softplus_value),
            View((-1, 256*1*1)),                 # B, 256
            nn.Linear(256, 2),             # B, 1
        )
            
    def forward(self, x):
        z = self.encoder(x)
        return z

class Rocket_D_NN(nn.Module):
    def __init__(self):
        super().__init__()
        
        softplus_value = 5
        
        self.decoder = nn.Sequential(
            nn.Linear(2, 256),               # B, 256
            View((-1, 256, 1, 1)),               # B, 256,  1,  1
            modifiedSoftplus(softplus_value),
            nn.ConvTranspose2d(256, 64, 4),      # B,  64,  4,  4
            modifiedSoftplus(softplus_value),
            nn.ConvTranspose2d(64, 64, 4, 2, 1), # B,  64,  8,  8
            modifiedSoftplus(softplus_value),
            nn.ConvTranspose2d(64, 32, 4, 2, 1), # B,  32, 16, 16
            modifiedSoftplus(softplus_value),
            nn.ConvTranspose2d(32, 32, 4, 2, 1), # B,  32, 32, 32
            modifiedSoftplus(softplus_value),
            nn.ConvTranspose2d(32, 3, 4, 2, 1),  # B, 3, 64, 64
        )
            
    def forward(self, z):
        x = self.decoder(z)
        return x

So It is an encoder which picks an image, reduces it to a bottleneck of size 2 and then it reconstructs the original image. This works very well and the reconstructed image looks ok. Now I want to calculate the Jacobian of the output image with respect to the bottleneck layer. So the Jacobian will be 3x64x64x2 numbers. Here is my code for that:

def jacobian(inputs, outputs):
    return torch.stack([torch.autograd.grad([outputs[:, i].sum()], [inputs], create_graph=True)[0]
                        for i in range(outputs.size(1))], dim=-1)

z = model_E(x_0)
output = model_D(z)
output = output.view(otp.size(0),otp.size(1)*otp.size(2)*otp.size(3))
J = jacobian(z,output)

So I tested the code above on a much smaller example and it does what I want. My problem is that when I try it on this problem I get this error: CUDA out of memory. Tried to allocate 8.00 MiB (GPU 0; 7.94 GiB total capacity; 7.54 GiB already allocated; 4.94 MiB free; 3.01 MiB cached)

So I am a bit confused. First of all, why do I run out of memory when computing the Jacobian, as the gradients needed for computing the Jacobian are also computed during backprop when I train the network. So why does it work for backprob, but not here? Then, based on the error message I have 7.94 GiB and 7.54 GiB are used. So I would have 400 MiB free, while I only need 8 MiB. Am I missunderstanding the message? And lastly, can someone please help me with this? Thank you so much!

I think this is wrong " the gradients needed for computing the Jacobian are also computed during backprop when I train the network".
Normal backprop is from the loss function (a single value), if you backprop form multiple values it would be much more expensive (for example 3x64x64 times more expensive).

Secondly, it says `4.94 MiB free

Thank you for you reply! So about the first issue, when I train the NN, the output of the network becomes a part of the loss (I compare the output pixels with the real ones). But when I backprop the loss, the derivative of each pixel in the image will have its derivative taken with respect to the previous layers anyway (which is what I need here). My point is, aren’t the derivatives that I need here also computed during the actual NN training (the only difference is that during the training I calculate the mean of the pixels in order to get just one number, but there are still 3x64x64 weights). For example assuming that I have the chain a->b->c->d->loss, and I want the derivative of d with respect to b, isn’t that derivative also computed when I take the derivative of the loss with respect to b? Is this wrong?

For the second issue, you are right, it says 4.94 MiB free, but where is the rest? The math doesn’t add up?

1st issue: I guess you need revise your calculus knowledge, I’m not good enough to explain this, but I’ll try
a->b->c->d->loss
We start with loss.gradient = [1]. And back prop to d
lets say d.value = [1, 2, 3] and d.gradient = [4, 5, 6], you can backprop and can compute a.gradient =[x, y]

Lets say d.gradient = [7, 8, 9] now, is a.gradient still [x, y], or at least be figured quickly from [x, y] somehow? No, you have to backprop all over again

When computing Jacobian, I think you are computing a.gradient for all these cases:
d.gradient = [1, 0, 0]
d.gradient = [0, 1, 0]
d.gradient = [0, 0, 1]
Which can’t be figured from the original backprop when d.gradient = [4, 5, 6]
2nd issues: I think the elves stole the rest