Weights not properly updating

I am trying to train a few different types of multilayer networks. In each of these networks the weights and biases for each of the layers are not updating except for the weights and biases of the final layer and the normalization of the final ltsm layer.

I printed out the gradients for each layer and although for some layers its really small (10^-8 or 0), for other layers its only 10^-2 or 10^-1 and those weights also don’t update. I have ltsm layers. convolutional layers, batch and layer normalizations, and fully connected layers in these networks. I am using the relu activation function for all layers except for the ltsm layer where I am using the tanh activation. Additionally for each layer first the layer operation is done then the batch/layer normalization and then the activation function. How can I make sure that the weights update for the gradients that are not to small and how can I combat against really small weights? Also is there a way to programmatically check if the weights changed for each layer?

To see if your weight got updated you could use the following scheme:

# Save init values
old_state_dict = {}
for key in model.state_dict():
old_state_dict[key] = model.state_dict()[key].clone()

# Your training procedure
...

# Save new params
new_state_dict = {}
for key in model.state_dict():
    new_state_dict[key] = model.state_dict()[key].clone()

# Compare params
for key in old_state_dict:
    if not (old_state_dict[key] == new_state_dict[key]).all():
        print('Diff in {}'.format(key))

Let’s see, if the weight updates are just small or not happening.

1 Like

@ptrblck according to the test code the weights and biases for the final linear layer and the final layer normalization (for the ltsm layer) is changing, however the weights and biases for the rest of the normalizations and layers are not updating. I printed out the gradients to check if there is a problem there and found that a large minority of the gradients are 0 with the rest mostly being in the 10^-1 to the 10^4 region. Is there a way to fix these two problems (which are probably related)?

A proper weight initialization might avoid the vanishing gradients.
Also, normalization layers will help (which you are using already).

Are you using pure PyTorch code in your model?
Can you make sure you are not detaching some tensors?
Also, a small code snippet might help, if that’s possible.

@ptrblck For one of my models I am using a pytorch loss function. I do some calculations before hand but in the end the loss is being handled by pytorch.

a = torch.sum(outputs * one_hot)
F.mse_loss(a, desired).backward(retain_graph=True)
self.optimizer.step()

For another model I implemented the loss function:

loss = -(obj_func + tau * entropy)
loss.backward(retain_graph=True)
self.optimizer.step()

Yet the results for both are the same, the last layer and layer normalizations are updating but the remaining are not with the gradient distribution mentioned above.

I am trying to use kaiming and xavier initialization, but the initialized weights are not saving to the model. The code I am using to try accomplish this:

new_weights = collections.OrderedDict()
for key, value in network.state_dict().items():
    try:
            if 'bias' in key:
                new_weights[key] = torch.full_like(network.state_dict()[key], 0)
            elif 'ltsm' in key:
                new_weights[key] = torch.nn.init.xavier_uniform_(value, gain=5/3)
            elif 'batch' in key or 'norm' in key:
                new_weights[key] = torch.nn.init.xavier_uniform_(value)
                # Is the right initialization to use for batch and layer normalization?
            else:
                new_weights[key] = torch.nn.init.kaiming_uniform_(value, a=0, mode='fan_out', nonlinearity = 'relu')
    except:
        new_weights[key] = torch.full_like(network.state_dict[key], 0)

I initially tried to directly reassign the weights but that did not work.

Could you try to fix the weight init mentioned in your other thread?