Applying custom mask on kernel for CNN

Barak_Beilin · June 27, 2020, 10:34am

Is this the correct way to specify custom manipulation of the weights of a convolution layer?

class MaskedConv3d(nn.Module):
    def __init__(self, channels, filter_mask):
     
        super().__init__()
        self.kernel_size = tuple(filter_mask.shape)
        self.filter_mask = nn.Parameter(filter_mask) #  tensor
        self.conv = nn.Conv3d(
            in_channels=channels,
            out_channels=channels,
            kernel_size=self.kernel_size,
        )

def forward(self, x):
        self._mask_conv_filter()
        return self.conv(x)

def _mask_conv_filter(self):
        with torch.no_grad():
            self.conv.weight.data = self.conv.weight.data * self.filter_mask

Specifically, in the last line of code I’m using .data attribute and not the tensors themselves since otherwise I’m getting the following error:

TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

Original code:

        with torch.no_grad():
            self.conv.weight = self.conv.weight * self.filter_mask

Thanks
Barak

ptrblck · June 28, 2020, 9:58am

No, you shouldn’t use the .data attribute, as it might yield silent errors and could break your code in various ways.

The error message points to a mismatch between a tensor and the expected nn.Parameter.
Try to wrap the new weight into a parameter via:

with torch.no_grad():
    self.conv.weight = nn.Parameter(self.conv.weight * self.filter_mask)

Also, since self.filter_mask is used in a no_grad() block only, I assume it won’t be trained and can thus be registered as a buffer via:

self.register_buffer('filter_mask', filter_mask)

Barak_Beilin · July 4, 2020, 12:25pm

Thanks @ptrblck for your prompt answer.

BBLN · September 14, 2020, 8:12pm

This actually broke the backward pass for me, seems like the weight matrix is kept the same after backward is called, maybe something changed between pytorch versions?

Barak_Beilin · September 15, 2020, 6:41am

The gradient should not be affected by the masking since it was applied inside the stop_grad context manager. @BBLN

BBLN · September 15, 2020, 9:15pm

I meant the gradient for the weights themselves stopped working at all, maybe because the parameter object being changed in a call to the module forward pass.

lindenmp · February 8, 2023, 10:02pm

I know this thread is old, but I’m having the same issue and wondering if anyone has got this masked approach working?

That is, I can successfully use the posted code from @ptrblck to mask a 2d convolution, but when I print out the weights of that masked convolution during training, none of them are updating. So it seems that the act of masking the convolution stops all of the weights (both those in and outside of the desired mask) from updating. This problem persists even if i omit the with torch.no_grad(): line.

ptrblck · February 9, 2023, 1:38am

Note that my approach replaces the original parameter with the masked one. This would also mean that this operation should be performed during or right after model initialization and before passing the final parameters to the optimizer.
If you replace the parameters after they were already passed to the optimizer they won’t be updated.
Let me know if this approach works for you or if you need to mask the weight in each forward pass etc.

lindenmp · February 9, 2023, 2:16am

Ah okay, thanks for the explanation.

Basically I want a cross-shaped 2d convolution wherein only the cross weights are learned during training and the off-cross corner elements are fixed at 0. I thought this approach would work because, like the OP, I’m masking right before the forward pass. But perhaps I need to do something different?

Thanks

lindenmp · February 9, 2023, 2:46am

For reference, here is what I’m currently doing:

def _get_filter_mask(self):
    """This function creates the filter mask used to generate a cross-shaped convolution

    Returns:
        filter_mask (torch.Tensor): cross-shaped filter mask

    """
    idx = int((self.conv_size - 1) / 2)
    filter_mask = torch.zeros(size=[self.conv_size, self.conv_size])
    filter_mask[idx, :] = 1
    filter_mask[:, idx] = 1

    self.register_buffer('filter_mask', filter_mask)

def _mask_conv_filter(self):
    try:
        self.filter_mask
    except AttributeError:
        self._get_filter_mask()

    self.conv.weight = nn.Parameter(self.conv.weight * self.filter_mask)

def forward(self, x):
    """e2e with conv2d filter (i.e, not using line filters as in CrossFilter1D)

    Args:
        x (tensor): input tensor

    Returns:
        tensor: e2e output
    """
    pad_size = int((self.conv_size - 1) / 2)
    x_pad = torch.nn.functional.pad(x, (pad_size, pad_size, pad_size, pad_size))
    self._mask_conv_filter()

    return self.conv(x_pad)

ptrblck · February 9, 2023, 6:49am

Thanks for sharing the code and describing your use case.
My approach won’t work in this case, since you are replacing the parameters with a new object in each forward pass.
Instead of doing this you could apply the mask inplace on the parameter via:

def forward(self, x):
    ...
    with torch.no_grad():
        self.conv.weight.mul_(mask)
    # use self.conv here

Wrapping the inplace manipulation into the no_grad block hides it from Autograd and it won’t add this operation to the computation graph.
Note that the next parameter update could create non-zero values again, which is why you could apply the masking in each forward pass.
Let me know, if this works for you.

lindenmp · February 9, 2023, 2:14pm

Awesome, thank you! this appears to be working.

Now I’m doing this:

def _get_filter_mask(self):
    """This function creates the filter mask used to generate a cross-shaped convolution

    Returns:
        filter_mask (torch.Tensor): cross-shaped filter mask

    """
    idx = int((self.conv_size - 1) / 2)
    filter_mask = torch.zeros(size=[self.conv_size, self.conv_size])
    filter_mask[idx, :] = 1
    filter_mask[:, idx] = 1

    self.register_buffer('filter_mask', filter_mask)

def forward(self, x):
    """e2e with conv2d filter (i.e, not using line filters as in CrossFilter1D)

    Args:
        x (tensor): input tensor

    Returns:
        tensor: e2e output
    """
    try:
        self.filter_mask
    except AttributeError:
        self._get_filter_mask()

    pad_size = int((self.conv_size - 1) / 2)
    x_pad = torch.nn.functional.pad(x, (pad_size, pad_size, pad_size, pad_size))

    with torch.no_grad():
        self.conv.weight.mul_(self.filter_mask)

    return self.conv(x_pad)

Now the cross-shaped convolution weights are updating every epoch. Am I right in assuming that the off-cross elements (now 0) will be updated in the gradient with every backprop? As you said, this will cause them to be nudged toward slightly non-zero values, which will be overwritten again back to 0 on the next forward pass?

In other words, overwriting the off-cross elements to 0 (using filter_mask) doesn’t mean that those elements are actually being excluded from the gradient calculation, right? In the case of a 3x3 filter, this is basically like taking a 9-dimensional gradient and constraining 4 of those dimensions to start from 0 weight at every step. Is this a problem? Naively, I would’ve thought a solution where the 4 off-cross elements were just excluded from the gradient calculation altogether was more principled, but maybe that’s not possible (or I’m just wrong).

ptrblck · February 9, 2023, 2:35pm

Your explanation is correct and you can also verify it using this simple example:

conv = nn.Conv2d(1, 1, 3)
optimizer = torch.optim.Adam(conv.parameters())
x = torch.randn(1, 1, 24, 24)
print(conv.weight)
Parameter containing:
tensor([[[[-0.1785,  0.2719, -0.1714],
          [-0.3003,  0.1127, -0.2276],
          [-0.0389,  0.2258,  0.2369]]]], requires_grad=True)
mask = torch.randint(0, 2, (1, 1, 3, 3))
print(mask)
tensor([[[[1, 1, 0],
          [0, 0, 0],
          [1, 1, 1]]]])
with torch.no_grad():
    conv.weight.mul_(mask)
Parameter containing:
tensor([[[[-0.1785,  0.2719, -0.0000],
          [-0.0000,  0.0000, -0.0000],
          [-0.0389,  0.2258,  0.2369]]]], requires_grad=True)
out = conv(x)
out.mean().backward()
print(conv.weight.grad)
tensor([[[[-0.0447, -0.0345, -0.0218],
          [-0.0624, -0.0594, -0.0353],
          [-0.0708, -0.0702, -0.0538]]]])
optimizer.step()
print(conv.weight)
Parameter containing:
tensor([[[[-0.1775,  0.2729,  0.0010],
          [ 0.0010,  0.0010,  0.0010],
          [-0.0379,  0.2268,  0.2379]]]], requires_grad=True)

To entirely remove some elements from the computation graph you could try to split the tensor into a trainable parameter and a static tensor. In the forward method you could then concatenate or stack the different parts and use the functional API via F.conv2d.
This post gives you an example using a linear layer.
Im sure there might be a more elegant approach now, but would need to play around with it a bit more.

lindenmp · February 9, 2023, 2:49pm

Right, makes sense! Thanks very much.

I’ll look into the stacking approach you mentioned.

Is it likely to make a difference? Is the masking approach actually problematic or is it functionally going to be pretty much equivalent to the stacking approach?

ptrblck · February 9, 2023, 2:52pm

I would assume it should be mathematically equivalent, since in the end you are using zeros in the desired indices of the conv filter regardless if you are manually resetting these values or recreate your conv filter.

lindenmp · February 9, 2023, 2:53pm

Cool, that’s what I figured. I suppose then the only potential difference is in training time, since in the masking approach I’m learning (and overwriting) 4 weights that I don’t need to learn

Thanks very much for your help!!

lindenmp · February 9, 2023, 5:54pm

I’m having an error pop up suggesting that my filter_mask is not on the gpu device. It appears that this is because I used the self.register_buffer approach you mentioned, which seems to have stored the mask on the cpu. Is there any way to store buffer items on the gpu? or should I just do the following in forward:

# Mask convolution, constraining it to a cross shape
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
with torch.no_grad():
    self.conv.weight.mul_(self.conv_mask.to(device))

return self.conv(x)

ptrblck · February 9, 2023, 8:51pm

Registering the mask as a buffer will make sure that the model.to("cuda") call will move it to the GPU as well. If the mask is still on the CPU afterwards, could you post a minimal example showing this issue?

lindenmp · February 9, 2023, 9:31pm

Yeah I’m calling model.to("cuda") but that does not appear to be transferring to internal model behavior for the buffer

# initialization of network
model = bcn(e2e=args.e2e, e2n=args.e2n, n2g=args.n2g, dropout=args.dropout, dim=args.n_parcels,
            cross=args.model_arch, conv_size=args.conv_size, alpha=args.alpha, binary=binary)
model.to("cuda")

My model is setup like I’ve shown above, with the definition of my mask occurring inside my model (_get_filter_mask). But when I run it I get an error saying some tensors are on cpu and others on cuda. If I force the buffered mask onto cuda as well, then it runs.

ptrblck · February 9, 2023, 9:35pm

Your code is unfortunately not executable and register_buffer works properly for me:

import torch
import torch.nn as nn

class MaskedConv3d(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv3d(
            in_channels=1,
            out_channels=1,
            kernel_size=3,
        )
        self.register_buffer("mask", torch.randint(0, 2, (1, 1, 3, 3)))

model = MaskedConv3d()
print(model.conv.weight.device)
# cpu
print(model.mask.device)
# cpu

model.to("cuda")
print(model.conv.weight.device)
# cuda:0
print(model.mask.device)
# cuda:0