Recently, I’ve been learning about convolution by way of grayscale images like MNIST. The idea of convolution made pretty good sense, and the code worked fine.

Then, I moved on to trying a CNN for Cifar10, and it was also straight forward. I didn’t think about the differences, because the same code that worked on greyscale images worked fine on these new color images.

But then I realized that I didn’t understand something:

Suppose you have some 3 channel input to a conv layer that has shape (W, H, 3), and you have want 4 output channels that preserve the width and height of the input, . So as I understand it, the output should be of shape (W, H, 4).

Now, in my mind, I take a single 2D kernel and try to apply it to a 3D input with shape (W, H, 3). I **assume** that I apply that same convolution to each of the three channels, which gives me an output of (W, H, 3). In other words, my filter is a 3D stack, where each layer is an identical copy of that 2D kernel.

If I do the same thing with the other 3 kernels I end up with a collection of 4 outputs with shapes

[(W, H, 3), (W,H,3), (W, H, 3), (W, H, 3)].

The shape of this thing overall actually (4, W, H, 3).

A contrived example:

```
X = np.arange(60).reshape(4,5,3)
# Assuming that each filter returns X unchanged.
Y = np.array([X, X, X, X])
Y.shape
>>> (4, 4, 5, 3)
```

So this leads to some questions:

- If the kernel is 3D, is each layer of the stack always identical, or can they be different?
- Is the final shape
`(4, W, H, 3)`

of the above example correct? Or are results of each filter averaged across the 3 channels to get the shape (4, W, H). - If results are combined or averaged, how so exactly?
- What if I want to apply each filter to only a subset of the inputs to the layer?

Sorry if this is a lot. I’m trying to implement LeNet-5, and there is a middle layer where a given filter only acts on a few inputs instead of all of them. This is very confusing, so I’m trying to figure out how to make it all work.