CUDNN_STATUS_NOT_SUPPORTED with large bach_size when using BatchNorm

liangstein · September 4, 2017, 3:36am

Hi Everyone ,
My pytorch is version 0.2 compiled with cudnn6.0.21. When there is batchnorm in the neural net, batch_size can’t exceed 140000, otherwise it says:
"RuntimeError: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input."
Without using batchnorm, batch_size can be very large unless breaking GPU memory. Does this mean that batchnorm doesn’t work with too large batch_size? Or is it a technical bug?

smth · September 4, 2017, 4:37am

hi Xiao,

it seems like a technical bug that I can fix.
Is there a small script you can provide to reproduce this?

zonemercy · September 30, 2017, 8:07pm

Hi smth,

I have similar question, the batch_size can’t exceed 140000, will cause the same error as shown by Xiao.

PyTorch:0.2.0_3, CUDNN VERSION:6021

If set torch.backends.cudnn.enabled=False, there is no error.
if set nn.BatchNorm1d(1, affine=False), no error.

import torch
import torch.nn as nn

torch.backends.cudnn.enabled=True
x = Variable( torch.rand(140000,1).contiguous()).cuda()
print (torch.backends.cudnn.version())
bn = nn.BatchNorm1d(1)
bn.cuda()
xbn = bn(x)
xbn.size()

BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True)
-----------------------------------------------------
RuntimeError        Traceback (most recent call last)
<ipython-input-55-d78b6a19b222> in <module>()
     10 bn.cuda()
     11 
---> 12 xbn = bn(x)
     13 xbn.size()

/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.pyc in __call__(self, *input, **kwargs)
    222         for hook in self._forward_pre_hooks.values():
    223             hook(self, input)
--> 224         result = self.forward(*input, **kwargs)
    225         for hook in self._forward_hooks.values():
    226             hook_result = hook(self, input, result)

/usr/local/lib/python2.7/dist-packages/torch/nn/modules/batchnorm.pyc in forward(self, input)
     35         return F.batch_norm(
     36             input, self.running_mean, self.running_var, self.weight, self.bias,
---> 37             self.training, self.momentum, self.eps)
     38 
     39     def __repr__(self):

/usr/local/lib/python2.7/dist-packages/torch/nn/functional.pyc in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
    637                training=False, momentum=0.1, eps=1e-5):
    638     f = torch._C._functions.BatchNorm(running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled)
--> 639     return f(input, weight, bias)
    640 
    641 

RuntimeError: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

smth · September 30, 2017, 8:41pm

Thank you, i’ve sent a fix in https://github.com/pytorch/pytorch/pull/2919
WIll be part of next release.