If this is the way you update your weights and don’t want gradients to flow back this op (which is what you want I expect), you should wrap this op in a with torch.no_grad():. This way the engine will know that you’re not trying to make a differentiable op and just changing a value of the weights:
# More code
optimizer.step()
with torch.no_grad():
self.conv1.weight.masked_scatter_(self.conv1.weight > self.w_max, self.w_max)
No, I do want to learn both weights and the threshold. Currently I have to do it manually:
loss = loss + 0.001 * model.w_max ** 2
optimizer.zero_grad()
loss.backward()
w_max_grad = torch.sum(model.conv1.weight.grad[model.conv1.weight >= model.w_max])
model.w_max.grad.data += w_max1_grad
optimizer.step()
Note that I force L2 penalty on the threshold growth, so I have to add the clipped weight gradients to the L2 loss gradient for the threshold before I update its value.
This works, however when I do the same thing for activation clipping, I don’t need to do anything manually - the gradients are accumulated correctly in the backward pass.
You can do the clipping within the torch.no_grad() block and remove the .data as well.
To zero out gradients, you can do model.zero_grad() and that will zero out all the .grad fields of all the Parameters in the net.
It does something like:
with torch.no_grad():
for p in model.parameters():
p.grad.zero_()