Best-practices for implementing parameter sharing in non deep learning model

4158ndfkvBHJ1 · May 10, 2021, 11:57am

Hi,

I am trying to implement a pseudo-likelihood parameter estimation method for a Potts model on images. I am using PyTorch because it seems convenient to leverage it’s GPU / auto-differentiation capabilities. However, I am not sure how I should implement some specific details of my model. I have 2 questions :

Question 1:
This model is overparameterized (4 different interaction parameters per neighbourhood: (top, bottom, right, left), a bit like a 2D convolution with a different kernel for every patch of the image), and I am not sure how my weight tensor should be set up so that backpropagation works properly. For now, because I don’t want diagonal parameters in my model, my weight initialization code looks like this:

        param_size = self.unfold(torch.ones(img_size).view((1, 1, *img_size))).size()
        nb_patchs = param_size[-1]
        param_array = torch.zeros(param_size).float()
        self.param_vector = torch.ones((4, nb_patchs), requires_grad = True).float()
        param_array[:,1,:] = self.param_vector[0,:]
        param_array[:,3,:] = self.param_vector[1,:]
        param_array[:,5,:] = self.param_vector[2,:]
        param_array[:,7,:] = self.param_vector[3,:]
        self.params = param_array

I create a param_array tensor that has the same shape as im2col(image), and I set some of it’s values in-place with a grad-enabled tensor. I have self.params.grad equal to None, with grad_fn=CopySlices, while self.param_vector.grad is not None.

This suggests to me that Autograd can differentiate through the in-place setting just fine, and this code will make sure that the non-diagonal elements in every neighbourhood (2 dimension of self.params) will not be updated.

However, the autograd documentation (Autograd mechanics — PyTorch 1.8.1 documentation) suggests that doing in-place operation is a bad idea, because it doesn’t free up any memory. I am not concerned about memory, but does that still mean that I should change my implementation ? If yes, how should I “blank out” some elements of the parameter tensor so that they aren’t updated ? My current understanding is that this can be done at a Tensor level (with the requires_grad flag), but is there a way to do it for a specific element of a tensor ?

Question 2:
My model is implemented as a torch.nn.Module subclass. Should my parameters be an attribute of the class ? I have 2 other functions that compute some elements of the total log-likelihood of the batch, and a forward function that returns a scalar (log-likelihood of the batch given current parameters). Having the model parameters as an attribute of the class is a convenient way to pass parameters. Is that actually how it should be done ? I think this is causing some Trying to backward through the graph a second time, but the saved intermediate results have already been freed problems for me, as if the likelihood of a batch was depending on the computation done in the previous batch.

Sorry for the long post, and thanks in advance to anybody that can help me !

googlebot · May 10, 2021, 1:52pm

you have to distinguish python statements that use parameters and initialization actions. the later should be done without gradient tracking, inside
“with torch.no_grad(): …” context; presense of in-place ops doesn’t matter in no_grad mode.
yes, attributes, but tensors should be wrapped in nn.Parameter, to “register” them (for optimizer and other enumerators). basically, the fact that you have “non deep” model is not relevant here, you can construct things the same way as tutorial DL models.