Question about convolution of multiple channels

Recently, I’ve been learning about convolution by way of grayscale images like MNIST. The idea of convolution made pretty good sense, and the code worked fine.

Then, I moved on to trying a CNN for Cifar10, and it was also straight forward. I didn’t think about the differences, because the same code that worked on greyscale images worked fine on these new color images.

But then I realized that I didn’t understand something:

Suppose you have some 3 channel input to a conv layer that has shape (W, H, 3), and you have want 4 output channels that preserve the width and height of the input, . So as I understand it, the output should be of shape (W, H, 4).

Now, in my mind, I take a single 2D kernel and try to apply it to a 3D input with shape (W, H, 3). I assume that I apply that same convolution to each of the three channels, which gives me an output of (W, H, 3). In other words, my filter is a 3D stack, where each layer is an identical copy of that 2D kernel.

If I do the same thing with the other 3 kernels I end up with a collection of 4 outputs with shapes

[(W, H, 3), (W,H,3), (W, H, 3), (W, H, 3)].

The shape of this thing overall actually (4, W, H, 3).

A contrived example:

X = np.arange(60).reshape(4,5,3)
# Assuming that each filter returns X unchanged. 
Y = np.array([X, X, X, X])
>>> (4, 4, 5, 3)

So this leads to some questions:

  1. If the kernel is 3D, is each layer of the stack always identical, or can they be different?
  2. Is the final shape (4, W, H, 3) of the above example correct? Or are results of each filter averaged across the 3 channels to get the shape (4, W, H).
  3. If results are combined or averaged, how so exactly?
  4. What if I want to apply each filter to only a subset of the inputs to the layer?

Sorry if this is a lot. I’m trying to implement LeNet-5, and there is a middle layer where a given filter only acts on a few inputs instead of all of them. This is very confusing, so I’m trying to figure out how to make it all work.


No this is not correct. Each kernel has same amount of channels as the input.

In our earlier discussion, Why add an extra dimension to convolution layer weights?, we agreed that if input has [3, h, w], and if we want to have 10 output channels, then we run 10 different filters which each has size of [3, k, k] which creates [1, h, w] responses then we stack all of them which in our case we wanted 10 output channels so we have 10 [1, h, w] responses then by stacking them along channel dim we get [10, h ,w]. About the values, assume a 2D gaussian for [1, k, k] filter, now assume a 3D gaussian for [3, k, k] filter.


  1. All values distributed from a function which leads to different values in each channel (can be identical but no reason for that)
  2. Based on my example, no, we have [4, 3, f, f] filters but output is [4, h ,w] and there is no averaging as responses have only one and channel we stack them.
  3. based on 2
  4. About this I am not really sure about the idea, do you mean skipping some windows in convolution? or skipping some filters?


1 Like