Say, we have an intermediate layer of a neural network, which gets two inputs:
- Output generated by a previous layer in the following format: N=2000 elements, Cin=10 channels, H=W=100 pixels
- Another input in the following format: N=2000 elements, Cin=1 channel, H=W=100 pixels
And we need to combine these two inputs into one and then apply a convolution to the combined data, which will have the following format: N=2000 elements, Cin = 10+1 = 11 channels, H=W=100 pixels
What is the most efficient way to do this?
I’ve considered several options, but all of them look bad.
In the given layer construct a new tensor which will have for each input element 10 channels from the first input stream and 1 channel from the second one. This approach will require resizing of input tensor on the fly, which is not very efficient.
Use 11 input channels in all previous operations, but somehow restrict the operations to use only 10 of them to avoid adding unused weights. Unfortunatelly, I didn’t found how to implement this approach.
Use 11 input channels in all previous operations, and do not restict them from using 11th channel. This will generate redundand connections (weights) between neurons.
Do you have any ideas?