Learning stops when you multiply the weights of a layer with a scalar?

arkansin · July 1, 2024, 7:09pm

Hi there,

I am trying to implement sparsely connected weight matrices for my simple 3-layer feedforward model. To do this I implemented a mask for each of my layers with a certain % of zeros, with the idea being that I would like to zero out the same set of weights after every optimizer step so that my layers are not fully connected. But I am having trouble with this because when I do an element-wise multiplication of the mask with the weight matrices, the weights stop changing in subsequent backward passes. To see if my mask is causing the issue, I did just multiplied my weight matrices with the scalar 1.0 and this recreates the issue. What might be happening here? I checked and gradients still get calculated. It’s just that the loss doesn’t go down anymore and the weights don’t change. Does doing this multiplication somehow disconnect the weights from the graph?

arkansin · July 1, 2024, 7:13pm

My model:

class TSP(nn.Module):

  def __init__(self, input_size, hidden_size):
    super(TSP, self).__init__()
    self.sc1 = nn.Linear(input_size, hidden_size)
    self.sc2 = nn.Linear(hidden_size, input_size)

    torch.nn.init.normal_(self.sc1.weight, mean=0, std=0.1)
    torch.nn.init.normal_(self.sc2.weight, mean=0, std=0.1)

  def forward(self, x):
    x = torch.relu(self.sc1(x)) 
    x = (self.sc2(x))
    return x


  def predict_hidden(self, x):
    x = torch.relu(self.sc1(x))
    return x

To recreate this issue all that is needed is the following and the weights stop getting updated:

model.sc1.weight = nn.Parameter(1. * model.sc1.weight)
model.sc2.weight = nn.Parameter(1. * model.sc2.weight)

ptrblck · July 1, 2024, 7:20pm

Recreating new tensors or parameters will detach them from the computation graph and will create a new leaf-variable.
Use the parameter directly inside the forward instead of assigning a new parameter to the internal attribute.

arkansin · July 1, 2024, 7:27pm

Thanks for your answer!

So if I want to apply my mask to the weights, I should declare it within the model class and apply in the forward pass?

This is how I currently produce the mask:

# Make a mask for the sparse network. Input params are layer dimensions and sparsity level
def sparse_weight_mask(input_size, output_size, sparsity_level):
    mask = torch.zeros(output_size, input_size, requires_grad=False)
    # For every row in the mask, set a random selection of sparsity_level% of the weights to 1
    for i in range(output_size):
        mask[i, torch.randperm(input_size)[:int(input_size*sparsity_level)]] = 1

    mask.requires_grad = True
    return mask

ptrblck · July 1, 2024, 11:10pm

Yes, the important part it so reuse the already initialized trainable parameter and to use it in the forward e.g. via:

def forward(self, x):
    x = self.weight * mask * x
    ...

I don’t know if the mask should also be trained, but if so you might also want to initialize it in the model’s __init__ method.