I am trying out a pytorch implementation of Lottery Ticket Hypothesis. For that, I want to freeze the weights in a model that are zero. Is the following a correct way to implement it ?
for name, p in model.named_parameters():
if 'weight' in name:
tensor = p.data.cpu().numpy()
grad_tensor = p.grad.data.cpu().numpy()
grad_tensor = np.where(tensor == 0, 0, grad_tensor)
p.grad.data = torch.from_numpy(grad_tensor).to(device)
1 Like
The question is tricky. First the paper is suspiciousâ€¦ At least for me.
I just fastforward it. They seam to optimize the computation to reduce the heatâ€¦
Definitely, not the memory because the shape of the tensors will be the same.
The original idea of the parameters is they should not be read only.
If you freeze weights then how can you learn?
If huge amount of your weight values are 0 you havenâ€™t initialized your layer well, or mean and std are not normalized going through the layers.
I would ignore this paper.
If from some reason you would still like to deal with the paper (say you would like to implement it):

Try using model.parameters()
instead of model.named_parametrers()
, since named parameters are subset of parameters.

.numpy()
would convert tensor to numpy array, so donâ€™t set that to a tensor. In your case tensor
and grad_tensor
should be tensors with the same shape.

use torch.from_numpy(arr)
if you have numpy arrays and you would like to convert them to tensors.

Although CUDA code runs on both CPU and GPU, I would not hardcode .cpu()
or .gpu()
, instead always use .to(device)
, unless you are positive you will only run your code on GPU. To detect the device you can use this:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
I donâ€™t find the paper suspicious. I believe it conveys a nice property of, â€śAlmost all the time, if you give me a network, I can point you to a subnetwork that is capable of giving comparable accuraciesâ€ť.
It is true that even though we call it a compression, it is still allocated in the memory as Zero and the shape is retained but it is more of an implementation flaw rather than the result it is trying to convey. As I mentioned earlier the result, â€śAlmost all the time, if you give me a network, I can point you to a subnetwork that is capable of giving comparable accuraciesâ€ť is loud and clear.
Also, by freezing weights, we are freezing it to zero, potentially removing that neuron, hence it starts to act like a new network altogether.
I think itâ€™s a fascinating paper and the following followup papers of the Lottery Ticket Hypothesis will help you understand my fascination.
 Stabilizing the Lottery Ticket Hypothesis
 Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask
The answer to this question was answered on stackoverflow by a user named jodag. I have modified and added it here.
SOLUTION:
What you have seems like it would work provided you did it after loss.backward()
and before optimizer.step()
(referring to the common usage for these variable names). That said, it seems a bit convoluted. Also, if your weights are floatingpoint values then comparing them to exactly zero is probably a bad idea, we could introduce an epsilon to account for this.
The modified code :
EPS = 1e6 #Changes
for name, p in model.named_parameters():
if 'weight' in name:
tensor = p.data.cpu().numpy()
grad_tensor = p.grad.data.cpu().numpy()
grad_tensor = np.where(tensor < EPS, 0, grad_tensor) #Changes
p.grad.data = torch.from_numpy(grad_tensor).to(device)