Hi Everyone,
I am trying to understand whether the following two ways to achieve a sparse matrix are equivalent in Pytorch. Let me add some context:
I am training Sparse Neural Networks with a specific structured sparsity pattern. To do so, there are two methods that I have looked at.

Parameterized Approach
In this approach P_j is a permutation matrix having 1s in places where the weights are supposed to be nonzero in the final W matrix. P is predefined and is not trainable. Moreover, there is no overlap between two P matrices. Also, each P has 1s in D positions.
alpha is a vector of size J and can have values as either 1 or 0. It dictates which permutation matrices to choose for the final W matrix. 
Masking Approach
This approach also uses P matrices to decide the nonzeros’ position in the final W matrix.
M is the final mask after adding K permutation matrices.
Comparison
To compare the two methods, we ensure that the resulting W has the same number of nonzeros. Let’s take the example of K = 30.
 In the first approach, we randomly pick 30 alphas to be 1, and the rest all are zero. This will give us a W matrix with 30 x D nonzeros.
 For the second approach, we use the indices of randomly picked alphas and pick the corresponding P matrices to form M
 I also set the random seed to be the same for both experiments to initialize W.
I am using these weight formulations for an MLP on CIFAR10. My MLP has the following specifications:
 Layer1: 3072x3072 Weight Matrix
 Layer2: 10x3072 Weight Matrix
In both experiments, I am sparsifying Layer1 and keeping Layer2 dense.
The dense accuracy of the network is 57.34 % (Not great, but this network is just a toy example for our sparsity research)
When I choose, K =30 and do the experiments, I get an accuracy of around 50.1% with parameterized approach and 56.4% with masking approach.
Does anyone have any suggestions on why that might be the case?
Are the two approaches equivalent in terms of Pytorch implementation of gradient flow?
Any suggestions on how I can debug this?