Confused by CNN ouputs

I’m trying to get my head around Conv2d. Here’s 2 bit of code i’ve seen from mnist and cifar10 in pytorch.

the mninst has a 1 channel input of a 28x28 image and produces 10 outputs but the cifar10 takes in 3 channels of a larger 32x32 image and produces only 6 outputs even though they both look to have the same stride of 5. I’m clearly missing something fundamental here !

MNIST
class Net(nn.Module):
def init(self):
super(Net, self).init()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)

def forward(self, x):
    x = F.relu(F.max_pool2d(self.conv1(x), 2))
    x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
    x = x.view(-1, 320)
    x = F.relu(self.fc1(x))
    x = F.dropout(x, training=self.training)
    x = self.fc2(x)
    return F.log_softmax(x)

CIFAR10
class Net(nn.Module):
def init(self):
super(Net, self).init()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    x = x.view(-1, 16 * 5 * 5)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

class torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)

Without double-checking what are the actual parameters to Conv2d, I assume the first two parameters are not input/output size, but number of channels.

The number of channels is the depth of a stack of image. Like, if you have r/g/b channels, you’d have three images, on top of each other, forming a stack of 3 images. Thats 3 channels. But you can, and normally do, have more than 3 channels. Typically, as you go through the cnn, each layer will make the stack deeper (more images per stack), but the width/height of the stack of images will gradually decrease.

ie, the input to a cnn is not 3 dimensional, but 4 dimensions:

  • batch size
  • number channels (depth of each stack of images)
  • width of image
  • height of image

Think of each batch of inputs as being a bunch of cubes. Each example is a cube of num_channels * image height * image width.

1 Like

yes, the first 2 parameters are in and out for channel

class torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)

i’m looking at the first layer in each example

mnist: self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
cifar10: self.conv1 = nn.Conv2d(3, 6, 5) # i assume 5 here is the same as kernel_size=5 in the example in the line above

so input channels of 1 for mnist and 3 for cifar10 makes sense.

but how are the outputs 10 for mnist but only 6 for cifar10. these are the numbers i cannot understand.

in my head mnist 1 x 28 x 28 with a kernel of 5 produces 1 x 24 x 24
and cifar10 3 x 32 x 32 with a kernel of 5 produces 3 x 28 x 28.

it’s the concept of how the images get ‘deeper’ that i’m missing.

(thinking of things as cubes when its ‘Conv2D’ is a little confusing. why not Conv3D’ ?)

oh hang on. am i just making this more complicated than it is ? !

are the 6 and 10 just arbitrary choices that we’ll create that many clones of the smaller image…

well, they’re not really clones, but yeah, you can make the output channel size/count any number you want (wihtin the bounds of available memory etc)

great, thanks ! i was thinking it had been calculated as part of the convolutions. d’oh :slight_smile: