Sparse Layers for Sequential NN Models

Dan_Bingo · August 21, 2022, 11:52pm

I’m looking for a method to sparsify a simple network as described below:

    model = torch.nn.Sequential(
        collections.OrderedDict(
            [
                ("layer1", torch.nn.Linear(num_A, num_A)),
                ("act1", torch.nn.Tanh()), 
                ("layer2", torch.nn.Linear(num_A, num_B)),
                ("act2", torch.nn.Tanh()), 
                ("layer3", torch.nn.Linear(num_B, num_B)),
            ]
        )
    )

I am using the torch.nn.utils.prune.custom_from_mask to prune the weights I want to be zero by sending a matrix to the device that is 99% zeros with 1% ones.

matrix = matrix.to(device)
module1 = model.layer1
module1 = module1.to(device)
torch.nn.utils.prune.custom_from_mask(module1, name='weight', mask=matrix)

The model size after masking is still as large as a fully connected network. I believe that the results I’m getting are good for the model and this strategy, but I need to make the network sparse to deploy this at scale due to memory limitations of making the matrix dense (on the CPU & GPU) and building the layer fully connected (on the GPU).

Any suggestions would be appreciated.

KFrank · August 22, 2022, 8:44pm

Hi Dan!

If you intend to use your network just for inference (e.g., after deployment
to a smaller device), you could do something like:

>>> import torch
>>> print (torch.__version__)
1.12.0
>>>
>>> _ = torch.manual_seed (2022)
>>>
>>> lin = torch.nn.Linear (3, 5)
>>> mask = torch.randint (2, (5, 3))
>>>
>>> with torch.no_grad():
...     lin.weight.mul_ (mask)   # weight has a lot of zeros but not in sparse format
...
Parameter containing:
tensor([[-0.0000,  0.4872,  0.0000],
        [-0.1372, -0.0000, -0.0000],
        [ 0.0000,  0.3245, -0.0000],
        [ 0.0000,  0.0000, -0.3115],
        [ 0.0000,  0.2717,  0.0000]], requires_grad=True)
>>>
>>> input = torch.randn (3)
>>>
>>> output = lin (input)
>>>
>>> lin.weight = torch.nn.Parameter (lin.weight.to_sparse())   # weight is now in sparse format
>>> lin.weight
Parameter containing:
tensor(indices=tensor([[0, 1, 2, 3, 4],
                       [1, 0, 1, 2, 1]]),
       values=tensor([ 0.4872, -0.1372,  0.3245, -0.3115,  0.2717]),
       size=(5, 3), nnz=5, layout=torch.sparse_coo, requires_grad=True)
>>>
>>> outputB = lin (input)
>>>
>>> torch.equal (output, outputB)   # same result
True

Note, you can’t use this for training as pytorch does not yet support
sparse-tensor backpropagation.

Best.

K. Frank

Dan_Bingo · August 23, 2022, 1:54pm

Thank you, K. Frank. I think this will do nicely for deploying the model.

I do need a way to train dense-sparse and sparse-sparse layers.

I’m taking a look at the RigL PyTorch implementation here: rigl-torch · PyPI

Dan_Bingo · August 23, 2022, 8:26pm

I was able to modify the rigl-torch implementation to use a static sparse tensor to prune the weights instead of relying on the random selection from the uniform distribution as done in the paper. Comparing results from the torch.nn.prune.custom_from_mask implementation, the training time is much slower with RigL than with the masking, but the model size and inference times are multiples faster with RigL.