If I have a 100x100 grid and I apply a 3x3 convolution to output another 100x100 grid, I am propagating information to each cell from its neighbors. If I want to propagate information from each cell’s neighbors’ neighbors, I would apply another 3x3 convolution after the first one. If I want to propagate from N cells away, I would apply N 3x3 convolutional layers one after the other.
But what if I want to propagate information from every cell in the grid, and the grid has variable size? I can’t have a variable number of layers.
What if instead of N 3x3 convolutional layers, I applied the same 3x3 convolutional layer to its own output N times? That would have the same effect of propagating information to each cell from N cells away, and it can be done for variable N.
How would learning work with this setup? What could the drawbacks be to this approach? Should I somehow use a hidden state between the applications like in an RNN?
Each convolution (or a set of filters) would learn different features depending on where they are placed in the network. So, the convolutions early in the network would learn to identify lower level features (such as lines and points) while the later convolutions would learn to identify higher-level features (such as eyes and ears). If you simply reuse the same convolution N times, the parameters would be shared. Hence, it would hard (or impossible) for the convolution to clearly identify (or learn) the different features in your input image.
Remember, RNNs are based on the concept of sequence and BPTT. CNNs are not based on that idea, hence there’s no hidden state shared between CNNs. Hence simply using multiple CNN layers is the best approach.
Isn’t it structurally similar to an RNN, because we are feeding the output of the convolution into itself over a number of steps in sequence? I thought pytorch might even use BPTT for this kind of setup.
If my input grid’s length is variable and I want to communicate information to cell (i,j) from every other cell in the grid, or 50% of the other cells in the grid, etc., then I cannot just use multiple convolution layers, I need a way to apply a variable number of convolutions.
CNNs process spatial information while RNNs process temporal information. While they both have hidden states (and parameters), the purpose and application is completely different.1-D CNNs are sometimes applied to text data, but in most cases, they cannot carry the context information the way RNNs do (better techniques are available now).
In most cases, you would like to pad or trim your input image to the same size to ensure consistent dimensions before feeding to your network.
Try implementing what you’re thinking and we can discuss more.