The input to the network is expected to be in a BCHW form, i.e. a 4-dimensional Tensor, where the first dimension is the batch dimension, the second dimension is the number of image channels (3 for color, 1 for grayscale), the third dimension is the image height, and the fourth dimension is the image width.
Your input is 2048x1x1 according to your error message. So pytorch thinks the last two dimensions are height and width, i.e. that you have a 1 pixel image. And if you try to do 2x2 pooling on a single pixel, you get the error you see (you need at least 4 pixels in a 2x2 grid).
I suspect you have an error in the way you transform images into your input tensor. Are you using torchvision.datasets?
How big are Cifar10 images? I think they’re only 32x32, right? It’s possible that you are using a deep network that is “too deep” for these images, because it is trying to do too much pooling / down-sampling.
Given the error you saw, I would double check that (1) Your input tensors really are BCHW and (2) Your input tensors have enough height and width to survive through all the downsampling in your network
I think you are right. CIFAR10 iamges have dim: 32x32. Your argument is reasonable.
So I will try to remove AvgPool layer so that at this point the input of the last fc layer is 2048x0x0.