Freezing Individual Weights

I am trying out a pytorch implementation of Lottery Ticket Hypothesis. For that, I want to freeze the weights in a model that are zero. Is the following a correct way to implement it ?

for name, p in model.named_parameters():
            if 'weight' in name:
                tensor = p.data.cpu().numpy()
                grad_tensor = p.grad.data.cpu().numpy()
                grad_tensor = np.where(tensor == 0, 0, grad_tensor)
                p.grad.data = torch.from_numpy(grad_tensor).to(device)
1 Like

The question is tricky. First the paper is suspicious… At least for me.

I just fast-forward it. They seam to optimize the computation to reduce the heat…

Definitely, not the memory because the shape of the tensors will be the same.
The original idea of the parameters is they should not be read only.

If you freeze weights then how can you learn?
If huge amount of your weight values are 0 you haven’t initialized your layer well, or mean and std are not normalized going through the layers.
I would ignore this paper.

If from some reason you would still like to deal with the paper (say you would like to implement it):

  • Try using model.parameters() instead of model.named_parametrers(), since named parameters are subset of parameters.

  • .numpy() would convert tensor to numpy array, so don’t set that to a tensor. In your case tensor and grad_tensor should be tensors with the same shape.

  • use torch.from_numpy(arr) if you have numpy arrays and you would like to convert them to tensors.

  • Although CUDA code runs on both CPU and GPU, I would not hardcode .cpu() or .gpu(), instead always use .to(device), unless you are positive you will only run your code on GPU. To detect the device you can use this:

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

I don’t find the paper suspicious. I believe it conveys a nice property of, “Almost all the time, if you give me a network, I can point you to a subnetwork that is capable of giving comparable accuracies”.

It is true that even though we call it a compression, it is still allocated in the memory as Zero and the shape is retained but it is more of an implementation flaw rather than the result it is trying to convey. As I mentioned earlier the result, “Almost all the time, if you give me a network, I can point you to a subnetwork that is capable of giving comparable accuracies” is loud and clear.

Also, by freezing weights, we are freezing it to zero, potentially removing that neuron, hence it starts to act like a new network altogether.

I think it’s a fascinating paper and the following follow-up papers of the Lottery Ticket Hypothesis will help you understand my fascination.

  1. Stabilizing the Lottery Ticket Hypothesis
  2. Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask

The answer to this question was answered on stackoverflow by a user named jodag. I have modified and added it here.

SOLUTION:

What you have seems like it would work provided you did it after loss.backward() and before optimizer.step() (referring to the common usage for these variable names). That said, it seems a bit convoluted. Also, if your weights are floating-point values then comparing them to exactly zero is probably a bad idea, we could introduce an epsilon to account for this.

The modified code :

EPS = 1e-6       #Changes
for name, p in model.named_parameters():
            if 'weight' in name:
                tensor = p.data.cpu().numpy()
                grad_tensor = p.grad.data.cpu().numpy()
                grad_tensor = np.where(tensor < EPS, 0, grad_tensor)     #Changes
                p.grad.data = torch.from_numpy(grad_tensor).to(device)