I have written a custom neural net where there are 12 convolutional blocks each containing 7 layers. However, for the last 2 blocks, I am decomposing the CNN parameters in each layer into two sets among which one is updated for certain iterations.
An example of my model definition for single decomposed CNN layers looks like this defined inside a nn.Module -
self.conv_1_parameters_1 = nn.Parameter((torch.Tensor(3, 12, 3 , 3).normal_(0, 0.01)), requires_grad=True)
self.conv_1_parameters_2 = nn.Parameter(init.xavier_normal_(torch.Tensor(3, 32, 32*12 , 1, 1)), requires_grad=True)
So during the forward pass, I am selecting (based on the input to forward function) both self.conv_1_parameters_1 and self.conv_1_parameters_2 across dim 1 as follows (here the index of first dim is specified using k) :
coeffs_1 = self.conv_1_parameters_2[k].view(32,32, 12)
conv_w_1 = torch.einsum('cvm, mki-> cvki', coeffs_1, self.conv_1_parameters_1[k]).reshape(32, -1, 3, 3).squeeze().contiguous().view((32,32,3,3)
x = F.conv2d(x_input, conv_w_1, padding=1)
I am using use two different optimizers for updating the parameters -
all_parameters = dict(self.model.named_parameters())
encoder_paras = [v for k, v in all_parameters.items() if ('encoder' in k)]
paramter_set_1 = [v for k, v in all_parameters.items() if ('coeff' in k)] # These are just the parameters corresponfing to self.conv_1_parameters_1 in the last 2 conv blocks
paramter_set_2 = [v for k, v in all_parameters.items() if ('combiner' in k)] # These are just the parameters corresponfing to self.conv_1_parameters_2 in the last 2 conv blocks
self.optimizer_1 = torch.optim.Adam(encoder_paras+paramter_set_1, lr=0.0001) # update all the parameters in convolution block 0 to 9 (which are encoder parameters) and only update parameter_set_1 in last 2 conv blocks
self.optimizer_2 = torch.optim.Adam(encoder_paras+paramter_set_2, lr=0.0001) # update all the parameters in convolution block 0 to 9 (which are encoder parameters) and only update parameter_set_2 in last 2 conv blocks
The training loop looks something like this -
for i in range(K): # this will select the subset of parameters for those last 2 conv blocks
pass input through whole model with index i
update the parameters corresponding to certain index using optimizer 1
So after the end of above loop I should be able to see only changes to the parameter_set_1 in last two conv blocks. But the these values don’t change at all. I printed the grad for parameter_set_1 and it is not None. I am not detaching any tensor in between. All the parameters in conv block 0 to 9 are updated by the optimizer. The problem is only with the conv block 10 and 11 which use the decomposed parameters.
What am I missing here?
Update : I figure that the gradients are very small like 10^-16.