I was going through this chapter Subclassing nn.Module (Chapter 8, Page number 209) from Deep Learning with PyTorch book when I came across the following analogy but not able to fully comprehend:

Second, in one layer, there is not a reduction of output size with regard to input
size: the initial convolution. If we consider a single output pixel as a vector of 32 elements (the channels), it is a linear transformation of 27 elements (as a convolution of
3 channels × 3 × 3 kernel size)—only a moderate increase. In ResNet, the initial convolution generates 64 channels from 147 elements (3 channels × 7 × 7 kernel size). [6]
So the first layer is exceptional in that it greatly increases the overall dimension (as in
channels times pixels) of the data flowing through it, but the mapping for each output pixel considered in isolation still has approximately as many outputs as inputs. [7]

Can someone help with the explanation?

If we consider a single output pixel as a vector of 32 elements (the channels), it is a linear transformation of 27 elements (as a convolution of
3 channels × 3 × 3 kernel size)—only a moderate increase. In ResNet, the initial convolution generates 64 channels from 147 elements (3 channels × 7 × 7 kernel size).

Footnotes:
[6] The dimensions in the pixel-wise linear mapping defined by the first convolution were emphasized by Jeremy
Howard in his fast.ai course (https://www.fast.ai).

[7] Outside of and older than deep learning, projecting into high-dimensional space and then doing conceptually simpler (than linear) machine learning is commonly known as the kernel trick. The initial increase in the
number of channels could be seen as a somewhat similar phenomenon, but striking a different balance
between the cleverness of the embedding and the simplicity of the model working on the embedding.

So maybe this was overly terse in the book (entirely my fault), and thank you for asking rather than just being dissatisfied with our book.
I should caution that this, to me, is about having a useful intuition rather than postulating strict and extremely deep “this is a law of nature”-type absolute statements.

The background for this (in the paragraph before the one you quote) is that for the other layers, conventional wisdom says that one would typically reduce mathematical dimension (i.e. number of elements) of the activations in a classification network.

Now the first layer does it differently, and we may ask if we can have an intuition why it takes the form it takes e.g. in ResNet.
There are three parts (because I am splitting the footnote in two):

As mentioned in the footnote, starting the processing by embedding the image in a high-dimensional space and then working with that is a tried-and-true approach and so is using information in the “neighbourhood” of a given point. This is a parallel to kernel embeddings.

The other question is then why not use vastly more, like 512 channels directly there. And there the intuition I would suggest is that if the “pointwise/patchwise” embedding is to very high dimensions, one would not expect to gain much. Imagine having 1 input channel and using 1x1 convolutions in the beginning and embedding to N channels. This would mean you take the scalar (1 pixel) and embed it in some N-dimensional space and do this with all pixels. But now you just have all your data in very sparsely populated space and have not gained any insight at all (try linearly embedding a sequence of 1d points to into 2-dimensional space and see if you find it very satisfying). So this line of thinking suggests that it doesn’t make much sense to have more output channels than the patch size. In ResNet 7x7x3 = 147 → 64 this is satisfied, in our example, it is not quite true that we keep the number of output channels smaller than the patch size.

Finally, kernel machines use this high-dimensional embedding thing and the kernels can be more elaborate than what a conv layer learns. But after the embedding, they kernel machines extremely simple (linear) classifiers, where in deep learning typically have a rather rich structure afterwards. (And to me, this is a bit like a vexation image: You could answer “which part of your model is the feature extractor and which part of the model if the classification head (and where does the classifier end and the loss start)” in many different ways and would get interesting parallels with other methods. Here, I (perhaps outside the usual convention) suggest take a moment to look for at the net as if the first conv layer were the feature extractor and the remainder the classifier (which is then, per the footnote) more fancy than those in the kernel machines.

I hope this elaboration helps clear it up a bit, do not hesitate to ask if there is something to be desired.