What is a neuron in a convolutional layer of CNN?

AjeelAhmed1998 · August 13, 2022, 6:04pm

I realize that this is not exactly on topic since this is a more conceptual question but I was hoping to get some insight.

From the famous CS231n lecture notes:

Local Connectivity. When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in 2D space (along width and height), but always full along the entire depth of the input volume.

Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5x5x3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.

What is the neuron here? There are only kernels in a convolutional layer so why bring neurons into this?

Moreover what does it mean to connect each neuron to only a local region of the input volume? When we convolve we are not looking at a single small region but we in fact go over the whole image by sliding our kernel over the input image.

Finally is filter size and receptive field the same thing? The lecture notes seem to indicate they are.

KFrank · August 20, 2022, 3:04am

Hi Ajeel!

The short story is that “neuron” is not always well defined.

It’s a jargon term.
The analogy between machine-learning neural networks and actual
neurons and brains is quite weak and often stretched beyond the
breaking point.
Often, a lot of the neural-network literature, including refereed papers,
uses language unhelpfully loosely, so even if there were an agreed-upon
definition for “neuron,” you wouldn’t be able to count on people using
the term consistently with the definition.

Having said that:

Neural networks were inspired by biological neurons (but to say that they
are modelled on biological neurons would be an overstatement).

Here’s a cartoon view of a neuron (that is connected to other neurons):

It has a main cell body, a bunch of “tentacles” called dendrites that
receive signals from other neurons, and an axon that carries the signal
somewhere else that ends in some more tentacles (“axon terminals”) that
transmit the signal to the dendrites of other neurons through synapses.

The basic idea is that the connections between two neurons can have
different strengths and can be excitatory or inhibitory. Because a neuron
receives inputs from multiple other neurons, the simplified picture is that
if the excitatory inputs are stronger, in aggregate, than the inhibitory inputs,
the neuron will “fire” and may fire more or less strongly.

In pytorch language, consider:

torch.nn.Sequential (
    torch.nn.Linear (10, 10),
    torch.nn.ReLU(),
    torch.nn.Linear (10, 2),
    torch.nn.ReLU()
)

Focus on the first ReLU and the second Linear (and ignore batch and
other dimensions):

For each of the two outputs of the second Linear, there are ten inputs.
You can think of the weight property of the second Linear as being the
strengths of the synaptic connections between ten input neurons and two
output neurons. Because the weighted inputs are summed in the matrix
multiplication inside of the Linear, you can think of this step as aggregating
together the signals from the input neurons, with positive weights
corresponding to excitatory synaptic connections and negative weights
corresponding to inhibitory connections. Then the bias and subsequent
non-linear activation – in this case, the second ReLU – determine whether
and how strongly the output neuron will fire.

Let’s try to push the biological-neuron analogy as far as we can. Where,
exactly, is the neuron? The second Linear, and therefore the second
ReLU, have (ignoring batch and other dimensions) two scalar outputs.
I like to think of a single scalar output of the ReLU as corresponding
to the axon of a neuron. The ReLU itself and the bias and the weight
of the Linear are spread in some fashion over the cell body and the
dendritic synaptic clefts of the two neurons associated with the second
Linear.

That covers the dendrites, cell body, and axon of our biological neuron.
But where, then, are the axon terminals of our neuron? I would say that
they are in the weight of some subsequent Linear layer.

So if you were to ask me what a neuron is in a neural network, the second
best answer I could give would be to say that a single output of a non-linear
activation layer is a neuron – or, more precisely, is the axon of a neuron
(with the rest of the neuron being distributed over other pieces of the
network, as outlined above). (My best answer would be to say don’t take
the biological-neuron analogy too seriously.)

Coming back to your original question, where in a convolutional layer
is a neuron? This gets dicier still.

Consider early biological image processing: Illustratively speaking, in
the retina (or perhaps early in the visual cortex), there is some cluster
of neurons in the lower-left corner of the visual field that has (perhaps)
connections that implement an edge detector, and a similar “edge-detector”
cluster likely exists in the upper-right of the visual field, as well.

In a neural network the upstream convolutions are thought to “learn” such
low-level features, so one imagines that some upstream convolution learns
edge-detection weights. This one convolutional kernel would perform edge
detection for the entire input image, including (among other locations)
the lower left and upper right.

Thus, the difference between the convolutional layer and the biological
system is that there is only one convolution kernel that gets slid over
(convolved with) the input image, while the retina has visual-processing
neurons across its entire back surface, and different sets of such neurons
perform early visual processing (e.g., edge detection) for different parts of
the visual field (e.g., the lower left vs. the upper right).

So if I had to identify a “neuron” in a convolutional layer, I would note that
the output of the convolutional layer (including its non-linear activation), is
a whole “image,” not just a kernel-sized patch, and then say that each scalar
element of this “image” is the axon of a neuron.

So far, so good. But where are that neuron’s dendrites? The best I can say
is that the convolution kernel (in part) represents a bunch of virtual sets of
dendrites, one set for each axon, as the kernel is slid over the input image.

Based on my discussion above, if we say that (an axon of) a neuron is a
single element of the output “image” tensor of the convolutional layer
(including its associated non-linear activation), then yes, that neuron only
depends on a local region of the input image.

As you slide the convolution kernel along the input image, it’s only one
position of the kernel – and hence only one local region of the input
image – that contributes to a particular single element of the output
“image” tensor.

Yes, the receptive field of a “neuron” (as I define it above) is just that
region of the input image that contributes to the single element of the
output tensor that I am calling the neuron.

If we have, say, a 3x3 filter (convolution kernel), then when the filter is in
the location, as it is being slid along, that corresponds to that particular
element of the output tensor, the 3x3 region of the input image that the
filter is “on top of” is the only part of the input image (assuming no convolution
dilations) that affects that element of the output tensor. That 3x3 region of
the input image is the “receptive field” of the neuron (that is, of that single
element of the output tensor). If we had, say, a 5x7 kernel, the receptive
field (of a particular neuron) would be a (particular) 5x7 region of the input
image.

But, yeah, I’m not a big fan of saying that neural networks have well-defined
neurons in them, and even less a fan when speaking of convolutional layers.

Best.

K. Frank

AjeelAhmed1998 · August 24, 2022, 3:48pm

Hey @KFrank thank you so much for such a detailed explanation, I do agree the actual neuron analogies is a bit too much to understand at times, it is why I was having trouble in the first place but if you go all in and start looking at everything in terms of neurons, it starts making a lot more sense. What really helped me was watching Andrej Karpathy’s lectures from CS231n.

Thank you so much once again @KFrank!