Hi Make!
The short answer is that doing so has been found empirically to be useful (admittedly,
not a very satisfying answer).
With the disclaimer that intuition – even that of experts – about why various neural-network
techniques work is notoriously unreliable, let me give you my thoughts:
Let’s consider a series of 3x3 convolutions (with padding, so that the spatial size of the
image doesn’t change). Let’s start with a three-channel ((RBG) input, and imagine that
the number of channels is increased to, say, 8 and then 16 and then 32 and so on.
Consider a specific pixel location, more or less in the middle of the image. After the first
convolution, the pixel at that location depends on the 3x3 square of pixels from the
original image that surrounds that specific location. (All 8 channel values for that pixel
after the first convolution depend on all 3 channel values for all 9 of the surrounding
pixels in the original image.) After the second convolution, that pixel depends on the
5x5 square of surrounding pixels (again, with all channels depending on all channels).
Think of the channels as being “features” rather than colors. Those features may or
may not encode color information (depending on how the network is trained). Speaking
figuratively, if the original pixel in question is blue and one of the immediately adjacent
pixels is, say, green, one of the channels after the first convolution might “detect” this
as a “blue-green-edge” feature, and after another couple of convolutions, this may
lead to an “iris” feature that then leads to an “eye” feature.
So the series of convolutions with increasing numbers of out_channels
is converting
spatial information (for which color – encoded in the three original color channels – may
be important) into features (for which color may or may not still be important).
That is, we are increasing the number of channels so that the network can process spatial
information into feature information.
Now consider a commonplace architecture where convolutions are followed by
downsampling, say a series of 3x3 convolutions interleaved with 2x2 max-pools. The
network is now dumping the spatial information into features attached to a shrinking
number of pixels, leading, potentially, to an output image that consists of a single
pixel – that is, no spatial information – with all of the features encoded in the channels
dimension being derived from the entire image.
In such a situation, one could well imagine that none of those features directly contain
any color information. (Of course, if you trained a network to distinguish red-tinged
images from blue-tinged images, you would likely have a red-vs-blue feature encoded
somewhere in the channels dimension.)
Best.
K. Frank