Expected stride to be a single integer value or a list of N values to match the convolution dimensions, but got stride=[2,2]

I’m converting a list of 3d numpy arrays (from an .npz file) into a tensor like so:

data = np.load("path.npz",encoding='bytes')
data = torch.from_numpy(data['arr_0'].unsqueeze(1).float()
train_kwargs = {'data_tensor':data}

The images are 8 channel 96X96 pixels, and currently I’m using a batch size of 64. So my tensor object has a size of
torch.Size([64,1,8,95,95]).

Here is a portion of the code I’ve been using in my encoding portion of my network:

def __init__(self, z_dim=10, nc=3):
    super(VAE, self).__init__()
    self.z_dim = z_dim
    self.nc = nc
    self.encoder = nn.Sequential(
        nn.Conv2d(nc, 32, 4, 2, 1),          # B,  32, 32, 32
        nn.ReLU(True),
        nn.Conv2d(32, 32, 4, 2, 1),          # B,  32, 16, 16
        nn.ReLU(True),
        nn.Conv2d(32, 64, 4, 2, 1),          # B,  64,  8,  8
        nn.ReLU(True),
        nn.Conv2d(64, 64, 4, 2, 1),          # B,  64,  4,  4
        nn.ReLU(True),
        nn.Conv2d(64, 256, 4, 1),            # B, 256,  1,  1
        nn.ReLU(True),
        View((-1, 256*1*1)),                 # B, 256
        nn.Linear(256, z_dim*2),             # B, z_dim*2
    )

When I go to start training I call my network to return mu, log variable for reparameterizing after every iteration like so:

x_recon, mu, logvar = self.net(x)

When I do this, I’m getting the above error. Changing nc to 8 results in the same error. I want to keep these images as 8 channels for now, and continue to do 2d convolutions on them. Is there an easy way to do this?
I’m considering doing 3d convolutions later on, and perhaps you could answer how to input that image as a 3-dimensional image as well, opposed to a 8-channel image?
Thanks in advance.

Your input should have the dimensions [batch_size, channels, height, width].
Just remove the additional 1 and your model should run.

x = x.squeeze(1)

Oh cool, totally missed that! So @ptrblck what about inputting a 3d image for 3D convolutions. Do I treat the z dimension as a channel?

For 3d convolutions your input should have the dimensions [batch_size, channels, depth, height, width].
The convolution will be applied on the channels like in the two dimensional case.

I think it depends on your use case, what z exactly means. Are you working with medical images?

Well not exactly, just time series frames. I know this is asking a lot, I’m pretty new at this, but I have one more question. my current system is working by inputting 64 X 8 X 95 X 95, but I’m getting the error that input and target shapes do not match 256 X 8 X 64 X 64 being the target size.

That is the encoder to my network up top. The decoder looks similar but opposite. Like so:

self.decoder = nn.Sequential(
        nn.Linear(z_dim, 256),               # B, 256
        View((-1, 256, 1, 1)),               # B, 256,  1,  1
        nn.ReLU(True),
        nn.ConvTranspose2d(256, 64, 4),      # B,  64,  4,  4
        nn.ReLU(True),
        nn.ConvTranspose2d(64, 64, 4, 2, 1), # B,  64,  8,  8
        nn.ReLU(True),
        nn.ConvTranspose2d(64, 32, 4, 2, 1), # B,  32, 16, 16
        nn.ReLU(True),
        nn.ConvTranspose2d(32, 32, 4, 2, 1), # B,  32, 32, 32
        nn.ReLU(True),
        nn.ConvTranspose2d(32, nc, 4, 2, 1),  # B, nc, 64, 64
    )

Is there anyway you could explain to me why the output reconstruction loss is being given to me in that dimension? Thanks in advance.

I don’t quite understand, why your batch size is larger in the target than your input.
From the shape calculation in the encoder code, it looks like you are passing a 64x64 tensor.

What kind of criterion are you using? BCELoss?

Well this is my recon_loss function, and I’m using MSE.

def reconstruction_loss(x, x_recon, distribution):
   batch_size = x.size(0)
   assert batch_size != 0

if distribution == 'bernoulli':
    recon_loss = F.binary_cross_entropy_with_logits(x_recon, x, size_average=False).div(batch_size)
elif distribution == 'gaussian':
    x_recon = F.sigmoid(x_recon)
    recon_loss = F.mse_loss(x_recon, x, size_average=False).div(batch_size)
else:
    recon_loss = None

return recon_loss

Gaussian is being run in this case. But it’s not being called until after x_recon is output from my autoencoder. But my recon loss is being out put the decoder obviously during training I call:

x = Variable(cuda(x, self.use_cuda))
x_recon, mu, logvar = self.net(x)

This is based off almost exactly on: https://github.com/1Konny/Beta-VAE

Could you try to pass 64x64 images into your encoder, as this seems to be the height and width of your decoder?

If I resize my image to 64X64, I get the following: invalid argument 2: size ‘[-1 X 256]’ is invalid for input with 16256 elements at /pytorch…

I’ve added some debug info to your model and it seems to work now:

class View(nn.Module):
    def __init__(self, size):
        super(View, self).__init__()
        self.size = size
    
    def forward(self, x):
        x = x.view(self.size)
        return x
    
class Print(nn.Module):
    def __init__(self):
        super(Print, self).__init__()

    def forward(self, x):
        print(x.shape)
        return x


class VAE(nn.Module):
    def __init__(self, z_dim=10, nc=3):
        super(VAE, self).__init__()
        self.z_dim = z_dim
        self.nc = nc
        self.encoder = nn.Sequential(
            nn.Conv2d(nc, 32, 4, 2, 1),          # B,  32, 32, 32
            nn.ReLU(True),
            nn.Conv2d(32, 32, 4, 2, 1),          # B,  32, 16, 16
            nn.ReLU(True),
            nn.Conv2d(32, 64, 4, 2, 1),          # B,  64,  8,  8
            nn.ReLU(True),
            nn.Conv2d(64, 64, 4, 2, 1),          # B,  64,  4,  4
            nn.ReLU(True),
            nn.Conv2d(64, 256, 4, 1),            # B, 256,  1,  1
            nn.ReLU(True),
            Print(),
            View((-1, 256*1*1)),                 # B, 256
            Print(),
            nn.Linear(256, z_dim*2),             # B, z_dim*2
        )

        self.decoder = nn.Sequential(
            nn.Linear(z_dim*2, 256),               # B, 256
            Print(),
            View((-1, 256, 1, 1)),               # B, 256,  1,  1
            Print(),
            nn.ReLU(True),
            nn.ConvTranspose2d(256, 64, 4),      # B,  64,  4,  4
            nn.ReLU(True),
            nn.ConvTranspose2d(64, 64, 4, 2, 1), # B,  64,  8,  8
            nn.ReLU(True),
            nn.ConvTranspose2d(64, 32, 4, 2, 1), # B,  32, 16, 16
            nn.ReLU(True),
            nn.ConvTranspose2d(32, 32, 4, 2, 1), # B,  32, 32, 32
            nn.ReLU(True),
            nn.ConvTranspose2d(32, nc, 4, 2, 1),  # B, nc, 64, 64
        )

    def forward(self, x):
        x = self.encoder(x)
        print(x.shape)
        x = self.decoder(x)
        return x

model = VAE()

x = torch.randn(1, 3, 64, 64)
output = model(x)
print(output.shape)

Perfect, thanks so much for the help! I’m just curious do you understand what I would need to do differently if I were to try and get 96X96 image in instead of 64X64?

The easiest way would be to add a nn.Upsample(96, mode='bilinear') layer at the end of your decoder.

@ptrblck, thanks so much for all your help!

1 Like

@ptrblck, in the encoder when I call view. I’m actually go from a tensor of [64,256,3,3] to [576,256]. Then I get as I go to call the encoder 576 stays as the actual batch size which messes up my final dimensionality. I get the error input and target shapes do not match. Being that the only difference is the batch.

Do you know why the View function is doing this? And how to keep the batch size constant?
Thanks

If you call your model with an image of size 64x64, it should work.
Generally, I would use view as:

x = x.view(x.size(0), -1)

This will keep the batch size constant and puts all remaining values to dim1.

Right, I’m trying to feed in a 96X96. the only real difference i made was to upsample at the end. But I’m getting mismatch errors when I try and treat the view in the way you mentioned (specifically in the linear layer just following the view call), i get the error size mismatch, m1: [256 x 576], m2: [256 x 20] at

This is what I’m getting at every layer. Do you see what the problem is? or how I can fix it?
torch.Size([64, 32, 48, 48])
torch.Size([64, 32, 24, 24])
torch.Size([64, 64, 12, 12])
torch.Size([64, 64, 6, 6])
torch.Size([64, 256, 3, 3])
torch.Size([576, 256])
torch.Size([576, 20])
torch.Size([576, 10])
torch.Size([576, 256])
torch.Size([576, 256, 1, 1])
torch.Size([576, 64, 4, 4])
torch.Size([576, 64, 8, 8])
torch.Size([576, 32, 16, 16])
torch.Size([576, 32, 32, 32])
torch.Size([576, 8, 64, 64])

Something is still strange with your View().
I have to say I’m not a huge fan of nn.Sequential, if you use a more complicated model, but this code should work:

class VAE(nn.Module):
    def __init__(self, z_dim=10, nc=3):
        super(VAE, self).__init__()
        self.z_dim = z_dim
        self.nc = nc
        self.encoder = nn.Sequential(
            nn.Conv2d(nc, 32, 4, 2, 1),          # B,  32, 32, 32
            nn.ReLU(True),
            nn.Conv2d(32, 32, 4, 2, 1),          # B,  32, 16, 16
            nn.ReLU(True),
            nn.Conv2d(32, 64, 4, 2, 1),          # B,  64,  8,  8
            nn.ReLU(True),
            nn.Conv2d(64, 64, 4, 2, 1),          # B,  64,  4,  4
            nn.ReLU(True),
            nn.Conv2d(64, 256, 4, 1),            # B, 256,  1,  1
            nn.ReLU(True),
            Print(),
            View((1, -1)),                 # B, 256
            Print(),
            nn.Linear(256*3*3, z_dim*2),             # B, z_dim*2
        )

        self.decoder = nn.Sequential(
            nn.Linear(z_dim*2, 256*3*3),               # B, 256
            Print(),
            View((1, -1, 3, 3)),               # B, 256,  1,  1
            Print(),
            nn.ReLU(True),
            nn.ConvTranspose2d(256, 64, 4),      # B,  64,  4,  4
            nn.ReLU(True),
            nn.ConvTranspose2d(64, 64, 4, 2, 1), # B,  64,  8,  8
            nn.ReLU(True),
            nn.ConvTranspose2d(64, 32, 4, 2, 1), # B,  32, 16, 16
            nn.ReLU(True),
            nn.ConvTranspose2d(32, 32, 4, 2, 1), # B,  32, 32, 32
            nn.ReLU(True),
            nn.ConvTranspose2d(32, nc, 4, 2, 1),  # B, nc, 64, 64
        )

    def forward(self, x):
        x = self.encoder(x)
        print(x.shape)
        x = self.decoder(x)
        return x

model = VAE()

x = torch.randn(1, 3, 96, 96)
output = model(x)
print(output.shape)
> torch.Size([1, 3, 96, 96])

Man, thank you for all your help and quick responses. But it’s still not working. I think it’s worth mentioning I’m using a channel size of 8. Even if I change nn.Linear to be nn.Linear(256 X 8 X 8) it gives me: size mismatch, m1: [1 x 147456], m2: [16384 x 20]

The spatial dimensions shouldn’t be changed, if you use more or less channels. Could you post your code so that I could have a look?

Ok good to know. And here is the current state of my code:

import torch
import torch.nn as nn
import torch.nn.init as init
from torch.autograd import Variable
from torchvision import models
from torchsummary import summary

def reparametrize(mu, logvar):
std = logvar.div(2).exp()
eps = Variable(std.data.new(std.size()).normal_())
return mu + std*eps

class View(nn.Module):
def __init__(self, size):
    super(View, self).__init__()
    self.size = size

def forward(self, tensor):
    return tensor.view(self.size)


class Print(nn.Module):
def __init__(self):
    super(Print, self).__init__()

def forward(self, x):
    print(x.shape)
    return x

class BetaVAE_H(nn.Module):
"""Model proposed in original beta-VAE paper(Higgins et al, ICLR, 2017)."""

def __init__(self, z_dim=10, nc=8):
    super(BetaVAE_H, self).__init__()
    self.z_dim = z_dim
    self.nc = nc
    self.encoder = nn.Sequential(
        nn.Conv2d(nc, 32, 4, 2, 1),          # B,  32, 48, 48
        Print(),
        nn.ReLU(True),
        nn.Conv2d(32, 32, 4, 2, 1),          # B,  32, 24, 24
        Print(),
        nn.ReLU(True),
        nn.Conv2d(32, 64, 4, 2, 1),          # B,  64,  12, 12
        Print(),
        nn.ReLU(True),
        nn.Conv2d(64, 64, 4, 2, 1),          # B,  64,  6,  6
        Print(),
        nn.ReLU(True),
        nn.Conv2d(64, 256, 4, 1),            # B, 256,  3,  3
        Print(),
        nn.ReLU(True),
        View((1,-1)),                 # B, 256
        Print(),
        nn.Linear(256*8*8,z_dim*2),             # B, z_dim*2
        Print(),
    )
    self.decoder = nn.Sequential(
        Print(),
        nn.Linear(z_dim, 256*3*3),               # B, 256
        Print(),
        View((1, -1, 3, 3)),               # B, 256,  1,  1
        Print(),
        nn.ReLU(True),
        nn.ConvTranspose2d(256, 64, 4),      # B,  64,  4,  4
        Print(),
        nn.ReLU(True),
        nn.ConvTranspose2d(64, 64, 4, 2, 1), # B,  64,  8,  8
        Print(),
        nn.ReLU(True),
        nn.ConvTranspose2d(64, 32, 4, 2, 1), # B,  32, 16, 16
        Print(),
        nn.ReLU(True),
        nn.ConvTranspose2d(32, 32, 4, 2, 1), # B,  32, 32, 32
        Print(),
        nn.ReLU(True),
        nn.ConvTranspose2d(32, nc, 4, 2, 1),  # B, nc, 64, 64
        Print(),
        nn.Upsample(96,mode='bilinear'),

    )

    self.weight_init()

def weight_init(self):
    for block in self._modules:
        for m in self._modules[block]:
            kaiming_init(m)

def forward(self, x):
    #print("x val: " + str(x.size()))
    distributions = self._encode(x)
    mu = distributions[:, :self.z_dim]
    logvar = distributions[:, self.z_dim:]
    z = reparametrize(mu, logvar)
    #print("z val: " + str(z.size()))
    x_recon = self._decode(z)

    return x_recon, mu, logvar

def _encode(self, x):
    return self.encoder(x)

def _decode(self, z):
    return self.decoder(z)

def kaiming_init(m):
if isinstance(m, (nn.Linear, nn.Conv2d)):
    init.kaiming_normal(m.weight)
    if m.bias is not None:
        m.bias.data.fill_(0)
elif isinstance(m, (nn.BatchNorm1d, nn.BatchNorm2d)):
    m.weight.data.fill_(1)
    if m.bias is not None:
        m.bias.data.fill_(0)