Got different behavior when switching from training to test stage

Jacky_Liu · May 6, 2018, 9:01am

I got different behavior when doing training and testing.
My code was totally fine for training stage and output no error message.
However, when I switched to testing, the following error occurred.

At testing stage, the input data has the same dimension with training stage.
Why would conv2d have different behavior, and how can I fix it?

Error message

Traceback (most recent call last):
  File "main.py", line 256, in <module>
    main()
  File "main.py", line 250, in main
    test()
  File "main.py", line 240, in test
    output = model(batch_x)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "main.py", line 111, in forward
    out_conv1 = self.conv1(c_out)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: expected stride to be a single integer value or a list of 1 values to match the convolution dimensions, but got stride=[2, 2]

Training code

def train():
    EPOCH = 20
    BATCH_SIZE = 10
    model = Net()
    model.train()
    
    writer = SummaryWriter()
    
    optimizer = optim.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-6)
    criterion  = nn.L1Loss(size_average=False)
    
    ## Deal with data
    dset = dataset.DatasetIter('path', 
                  'path', 
                  'path')
    
    data_size = len(dset)
    loader = torch.utils.data.DataLoader(
        dset, 
        batch_size=BATCH_SIZE,
        shuffle=True,
        num_workers=4)


    for k in range(EPOCH):
        for step, (batch_x, batch_y) in enumerate(loader):
            optimizer.zero_grad()
            output = model(batch_x)
            
            loss = criterion(output, batch_y)
            
            loss.backward()
            optimizer.step()
            
            writer.add_scalar('loss', loss.data, k*(data_size/BATCH_SIZE) + step) #tensorboard scalar
            
            printProcess(block, data_size, step, k, EPOCH, loss, BATCH_SIZE)
        
        ## Save checkpoint for each epoch
        check_str = 'checkpoint_{}.pt'.format(k)
        torch.save(model.state_dict(), check_str)
    ## Save final model
    torch.save(model.state_dict(), 'path.pt')
    writer.export_scalars_to_json("./all_scalars.json")
    writer.close()

conv

def conv(batchNorm, in_planes, out_planes, kernel_size=3, stride=1):
    if batchNorm:
        return nn.Sequential(
            nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.LeakyReLU(0.1,inplace=True)
        )
    else:
        return nn.Sequential(
            nn.Conv2d(in_planes, out_planes, kernel_size=kernel_size, stride=stride, padding=(kernel_size-1)//2, bias=True),
            nn.LeakyReLU(0.1,inplace=True)
        )

Testing code

def test():
    BATCH_SIZE = 1
    
    checkpoint_pytorch = 'path'
    if os.path.isfile(checkpoint_pytorch):
        checkpoint = torch.load(checkpoint_pytorch,\
            map_location=lambda storage, loc: storage.cuda(0))
    else:
        print('No checkpoint')

    model = Net()
    model.load_state_dict(checkpoint)  
    model.cuda()
    model.eval()

    dset = dataset.DatasetIter('path', 
                'path', 
                'path')

    loader = torch.utils.data.DataLoader(
        dset, 
        batch_size=BATCH_SIZE,
        shuffle=False,
        num_workers=1)
    err = 0
    ans = []

    for step, (batch_x, batch_y) in enumerate(loader):
        print('batch_x', batch_x.shape)
        output = model(batch_x)
        output = output.data.cpu().numpy()
        ans.append(output[0])
        print(output[0])

pisymbol · May 6, 2018, 10:03am

I am actually running into a very similar issue myself. I get that exact runtime exception if I simply pass an image Tensor of the right dimensions to my model directly during inference.

Not sure what I’m doing wrong either…

Jacky_Liu · May 6, 2018, 10:37am

After I remove model.eval(), all error message are gone.

This should be the bug of PyTorch version: 0.4.0

Setting model.train(False) will also raise the same error message.

pisymbol · May 6, 2018, 11:13am

I wasn’t using model.eval() so I’m still stuck. No idea what I could be doing wrong.

ptrblck · May 6, 2018, 11:21am

What is the shape of batch_x before passing it to the model?
Did you make sure to add a batch dimension, so that it’s: [1, channels, height, width]?

Jacky_Liu · May 6, 2018, 2:46pm

The dimension of batch_x = torch.Size([1, 2, 3, 540, 960]) in testing stage.

I think it’s not the dimension problem in my case, since after I comment out model.eval() all error message is gone.

ptrblck · May 6, 2018, 3:21pm

The dimension is wrong for a nn.Conv2d layer.
Do you have 2 images? If so, you could remove dimension 0 with batch_x.squeeze_(0).

Could you post your model architecture, so that I can have a look, if this is an internal bug?

Jacky_Liu · May 6, 2018, 6:06pm

Thanks!
I do have 2 images for my network.
I made a docker image to saving your time to build the environment.

docker pull jjkka132/convbug
cd /notebooks/convBug
python3 main.py

Jacky_Liu · May 8, 2018, 5:49am

@ptrblck Could you reproduce the error?