Check the outputs of your model and make sure they were not set to zero by the F.relu
.
Here is a small example showing how the init could create the expected zero gradients:
# negative init
x = torch.randn(4, 4) - 10.
x.requires_grad_()
y = torch.diagonal(x)
z = F.relu(y)
z.mean().backward()
x.grad
# tensor([[0., 0., 0., 0.],
# [0., 0., 0., 0.],
# [0., 0., 0., 0.],
# [0., 0., 0., 0.]])
# positive init
x = torch.randn(4, 4) + 10.
x.requires_grad_()
y = torch.diagonal(x)
z = F.relu(y)
z.mean().backward()
x.grad
# tensor([[0.2500, 0.0000, 0.0000, 0.0000],
# [0.0000, 0.2500, 0.0000, 0.0000],
# [0.0000, 0.0000, 0.2500, 0.0000],
# [0.0000, 0.0000, 0.0000, 0.2500]])
It’s not using your full model, but just the last operations instead.