Permute (swap axes) issue in XNOR network

Hello,
I have a network like this:

def SimpleModel(nn.Module):
    def __init__(self):
        super(BINACDNet3, self).__init__();
        self.block1=nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=8, kernel_size=(1,9), stride=(1,2)),
            nn.BatchNorm2d(16),
            nn.ReLU()
            nn.Conv2d(in_channels=16, out_channels=64, kernel_size=(1,5), stride=(1,2)),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size = (1,50))
        );
        self.block2 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=(3,3), stride=1, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3,3), stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.AvgPool2d(kernel_size = (1,4)),
            nn.Flatten(),
            nn.Linear(64, 10, bias=True)
        );

        self.output = nn.Sequential(
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        x = self.block1(x);
        x = x.permute((0, 2, 1, 3));
        x = self.block2(x);
        y = self.output(x);
        return y;

This works fine as a full precision network.
However, when I convert this network to XNOR network, the gradient somehow vanishes and does not perform at all.
If I remove the x = x.permute((0, 2, 1, 3)); and change the kernel_size of the conv layers of self.block2 it works and produces some sort of accuracy. From here, I find that the problem is the permute after self.block1.

I am wondering if anyone knows what is going on here. Any idea would be great. @ptrblck could you please help me in this regard.

I am looking forward to see any help.

Kind regards,
Mohaimen

Could you explain the reason for this permutation and print the shape of x before permuting it?
I assume the permute op is related to the XNOR model, but I’m unfortunately not deeply familiar with it.

@ptrblck In my original network, I take raw audio time series that has the shape (ch, h, w) = (64, 1, 151) just before the permute operation. After permute, the shape is (1, 64, 151). Thus I can treat it as image data and use 2D kernels in the conv layers. The network I provided as an example is the example of a full precision (original) network. It works great.

When I am converting it to XNOR like this:
conv → bn → relu ->maxpool
bn → BinConv2d → relu → maxpool
.
.
.
Now, my forward function looks like this:

def forward(self, x):
        for m in self.modules():
            if isinstance(m, nn.BatchNorm2d):
                if hasattr(m.weight, 'data'):
                    m.weight.data.clamp_(min=0.01)
        x = self.block1(x);
        x = x.permute((0, 2, 1, 3));
        x = self.self.block2(x);
        y = self.output[0](x);
        return y;

I have BinActive function:

class BinActive(torch.autograd.Function):
    '''
    Binarize the input activations and calculate the mean across channel dimension.
    '''
    @staticmethod
    def forward(self, input):
        self.save_for_backward(input)
        # size = input.size()
        input = input.sign()
        return input

    @staticmethod
    def backward(self, grad_output):
        input, = self.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input.ge(1)] = 0
        grad_input[input.le(-1)] = 0
        return grad_input
class BinConv2d(nn.Module): # change the name of BinConv2d
    def __init__(self, ch_in, ch_out, kernel_size=1, stride=1, padding=0, bias=False):
        super(BinConv2d, self).__init__()
        self.ch_in = ch_in
        self.ch_out = ch_out
        self.layer_type = 'BinConv2d'
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.bias = bias;

        self.conv = nn.Conv2d(self.ch_in, self.ch_out, kernel_size=self.kernel_size, stride=self.stride, padding=self.padding, bias=False)
        # nn.init.kaiming_normal_(self.conv.weight, nonlinearity='relu'); # kaiming with relu is equivalent to he_normal in keras

    def forward(self, x):
        x = BinActive.apply(x)
        x = self.conv(x)
        return x

I hope this clarifies your question. Please let me know if you need some more information

Can you clarify a few things for me? If I’m not wrong , XNOR is just a special type of convolutional network which is faster to implement right?

@ptrblck the full precision network is producing state-of-the-art accuracy. However, the XNOR version is the issue. The gradient flowing through is always zero. Now, if I remove the permute operation and just treat is as 1d, it works, the gradient is not zero, however, it is not the same network, right.

XNOR, is not faster to implement. You are basically implementing a 1-bit network. All the weight/activation values are either -1 or 1.

I am plotting the gradients. It seems that removing conv layers after permute and keeping only the linear layer does not make the gradient to be always zeroes. It becomes all zeroes when there is even a single conv layer after the permute. Does that mean, somehow the back-propagation is failing to consider the permute operation? If that is the case, then how the full precision network is working? No clue so far.

To isolate the issue a bit more, the first block after permute is: BatchNorm -> BinaryActivation -> Conv2d ->ReLU. All the elements of the output tensor of this Conv2d are zeros. I have checked the output of the BinActivation. They are not all -1’s, there are 1s as well. I have initialised the conv using self.conv.apply(lambda m: nn.init.kaiming_normal_(m.weight, nonlinearity='relu')); which ensures that the initial weights are not zeros.

Any idea what could go wrong? This is only happening for the Binary conv after the permute, not before the permute.

Could you post a minimal code snippet using the aforementioned block of layers, which would reproduce this issue, please?

@ptrblck I have created a simple project that reproduces the problem. You can run main.py or the notebook file main.ipynb. A real data file is inside the zip folder as well. All you need is to unzip and run main.
https://drive.google.com/file/d/1-ChQ41zKnpjVbWegUwOMEzzOTlsmSXtk/view?usp=sharing

Please let me know if you have any question

Thanks for the code. I’m unfortunately unsure how to reproduce the “working” behavior without the permutation, as the code doesn’t provide any switch and removing the op yields shapes mismatches in the batchnorm layer and after fixing it an activation which is too small for the model.

@ptrblck this should be the model without permute.

class SimpleBinNetNoPermute(nn.Module):
    def __init__(self):
        super(SimpleBinNetNoPermute, self).__init__()

        conv1 = self.make_conv2d_layer(1, 8, (1, 9), (1, 2))
        conv2 = self.make_binconv2d_layer(8, 64, (1, 5), (1, 2))
        conv3 = self.make_binconv2d_layer(64, 32, (1,3), padding=1)

        sfebs = conv1 + conv2 + [nn.MaxPool2d(kernel_size=(1, 50))]
        self.sfeb = nn.Sequential(
            *sfebs
        )

        tfebs = conv3 + [nn.MaxPool2d(kernel_size = (1,2))]
        self.tfeb = nn.Sequential(
            *tfebs
        )

        fcn = nn.Linear(7200, 10, bias=True)
        fcn.apply(lambda m: nn.init.kaiming_normal_(m.weight, nonlinearity='sigmoid'))
        self.output = nn.Sequential(
            nn.Flatten(),
            fcn,
            nn.Softmax(dim=1)
        )

        for m in self.modules():
            if isinstance(m, nn.BatchNorm2d):
                if hasattr(m.weight, 'data'):
                    m.weight.data.zero_().add_(1.0)

    def forward(self, x):
        for m in self.modules():
            if isinstance(m, nn.BatchNorm2d):
                if hasattr(m.weight, 'data'):
                    m.weight.data.clamp_(min=0.01)
        x = self.sfeb(x)
        x = self.tfeb(x)
        print(x.shape)
        y = self.output(x)
        return y

    def make_conv2d_layer(self, in_ch, out_ch, k_size, stride=(1,1), padding=0):
        conv = nn.Conv2d(in_channels=in_ch, out_channels=out_ch, kernel_size=k_size, stride=stride, padding=padding, bias=False)
        conv.apply(lambda m: nn.init.kaiming_normal_(m.weight, nonlinearity='relu'))
        return [
            conv,
            nn.BatchNorm2d(out_ch),
            nn.ReLU()
        ]
    def make_binconv2d_layer(self, in_ch, out_ch, k_size, stride=(1,1), padding=0):
        return [
            nn.BatchNorm2d(in_ch),
            BinConv2d(ch_in=in_ch, ch_out=out_ch, kernel_size=k_size, stride=stride, padding=padding),
            nn.ReLU()
        ]

I have provided a switch in the zip folder now.
You can now pass True or False to trainer.train(permute=True/False)
Could you please download the zip file again

If I’m not wrong , you want x to change from None,64,1,151 to None,1,64,151 right? I am not quite sure how pytorch would compute the backwards gradient graph after one changes the data dimensions…
Back propogation is calculated using chain rule and probably will not work in case of change in dimensions and might not work as intended.I am of course not certian though and you could think in this direction for the solution to your problem.

This is not the problem. Backprop only works using the loss and the existing weight tensors of conv layers in this case. It does not work with the input dimension, data dimension of batchnorm, activationa and others. This network works as expected as a full precision network already.

@ptrblck it would be great to know if you have any findings.

Okay… If it works normally as a full precision network then i really dont know what is going wrong. You must’ve tried different hyper parameters for training too , right ? This is interesting but I’m punching above my weight class here, so I’m out :sweat_smile:

@ptrblck I am just trying with a small input length to see what actually is going on. I am seeing this peculiar behaviour from the binary conv layer.
For example:
The kernels are: [[[[-0.4636, 0.6064]], [[-1.1505, -0.9865]]], [[[-0.1688, -0.3571]], [[-1.7502, -0.6587]]]]
Inputs to the layer: [[[[0., 0., 1., 1.]], [[0., 0., 1., 1.]]]]
The output should be: [[[[0., -0.3801, -1.9952]], [[0., -1.0158, -2.9348]]]]
However, it produces the output: [[[[0., 0., 0.]], [[0., 0., 0.]]]]

I have no clue what is going on.

It does not seem to be a good news. No one is putting their thoughts here. Probably this is not a known issue or might be a case specific issue.

So, I found the problem and solved it. Before Permute the tensor is [64, 32, 1, 151] and after Permute the shape becomes [64, 1, 32, 151]. Hence, the channels become 1. Here, the mean centre fails. I am avoiding mean centre for this particular layer. This solves the issue.