Issue with using sparse matrix representation for a pruned model

So, I have trained a model and pruned it using unstructured pruning, which has converted some weights to zero. But the inference time hasn’t improved much, since the 0 weights are also participating in the matrix multiplication at the inference. Since autograd doesn’t support sparse matrix calculation, I’ve retrained the pruned model with it’s dense layer and later I converted it to sparse matrix. But when I used it for evaluation it’s giving me following error:

y_preds = model(X_batch)
File “/sys_apps_01/python/python310/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/jupyterhubenc/e132293/Pruning/utils/nn_architecture.py”, line 105, in forward
out = self.fc1(x)
File “/sys_apps_01/python/python310/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)
File “/sys_apps_01/python/python310/lib/python3.10/site-packages/torch/nn/modules/linear.py”, line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: addmm_sparse_dense: expected either ‘mat1’ to have sparse layout and ‘mat2’ to have strided layout, got ‘mat1’ with layout Strided and ‘mat2’ with layout Sparse

code for the same:
def evaluation( model,test_loader,device,y_test_data):

model.eval()
all_predictions = []

start_time = time.time()

with torch.no_grad():
    for X_batch in test_loader:
        X_batch = X_batch.to(device)
    
        y_preds = model(X_batch)
        
        all_predictions.append(y_preds.view(-1, 2).cpu().numpy())

end_time = time.time()
inference_time = end_time - start_time

print("Inference time:", inference_time, "seconds")

Is there any way I can fix this or I can remove the zero weights from pruned model in order to reduce the inference time/

Similar thread discussing the same.

This link shows how to actually remove the weights via structured pruning.

Hi, @Soumya_Kundu custom structured pruning will definitely reduce parameters, but is there no any way that I can do the same thing in unstructured pruning also by completely removing zero weights. Is this not achievable with sparse matrix representation?

Specifically for that, you can have a read of this: Pruning doesn't affect speed nor memory usage · Issue #36214 · pytorch/pytorch · GitHub

TLDR Code:

import torch
import torch.nn.utils.prune as prune

t = torch.randn(100, 100)
torch.save(t, 'full.pth')

p = prune.L1Unstructured(amount=0.9)
pruned = p.prune(t)
torch.save(pruned, 'pruned.pth')

sparsified = pruned.to_sparse()
torch.save(sparsified, 'sparsified.pth')

I am not sure as to how you can integrate this into an end to end pipeline.