Invalid gradient shape after discarding filters during training

kuonlp · July 10, 2021, 3:18pm

Edit: I’ve solved the problem, so I re-wrote my initial answer and what I initially thought it was the problem. Then I write the actual problem I had (which was embarrassingly simple).

I initially though that with torch.nn.Parameter I would be creating new parameters all the time, and because of this I run out of memory. In fact, the following code makes it run out of memory

with torch.no_grad():
    for mod in self.layer.modules():
        if isinstance(mod, torch.nn.Conv2d) or isinstance(mod, torch.nn.Conv3d    ):
            pp = mod.weight
            del mod.weight
            mod.weight = torch.nn.Parameter(pp.data)

whereas this doesn’t make the code run out of memory. Note: I have tested this on my actualy script, not on the toy example I gave in the initial post.

with torch.no_grad():
    for mod in self.layer.modules():
        if isinstance(mod, torch.nn.Conv2d) or isinstance(mod, torch.nn.Conv3d    ):
            pp = mod.weight
            del mod.weight
            mod.weight = pp

However, I examined what’s on memory (following this), and I noticed that the number of objects in memory (including torch.nn.Parameters) and the total size increase at first, but it becomes constant in both cases. How is it possible that despite the number of tensors, parameters, and their size are constant, the script runs out of memory?

Solution
In my code I kept track of the loss in this way:

tr_loss_iteration = loss(output, Y)
tr_loss += tr_loss_iteration # For knowing the avg loss in the current epoch.

Apparently, tr_loss was accumulating something, although I’m not sure what, because, as I’ve said, the number of tensors and parameters in memory was constant. Maybe it was keeping track of all “new” parameters? Anyway, I solved the memory-leaking problem by simply:

tr_loss += tr_loss_iteration.cpu().detach().numpy()

Thanks!