I am trying to write a Binary Classification for pairs of images taken one after the other from a video. My frames are in black and white and thus only have one channel. I’ve been using 2 Conv3d() layers with one channel to moderate success (~85% accuracy), however it occurred to me that I could just be using Conv2d() layers with multiple channels instead.
What would be the advantages/disadvantages of using either or for this scenario? I’m thinking I should stick with Conv3d() layers due to the ability to run convolution in the third dimension, but as I’m rather new to this I don’t know how useful this is.
Note, I did come across this thread before making this post, but I was hoping someone could give a more in-depth answer and I wasn’t sure if I should bump the thread or not, thus I am making a new post.
If your use case is restricted to having just pairs of images as your
network input (as compared to using more than two frames of the
video, say, 8 or 16), then using Conv2d with in_channels = 2
and kernel_size = (k, k) and Conv3d with in_channels = 1, kernel_size = (2, k, k), and a tensor with a depth dimension of 2
passed in are essentially equivalent (assuming that out_channels = 1).
With a depth dimension (and kernel_size) of 2 , you don’t really
have anything to convolve over, so you get the same result as the
2-channel computation.
Since they’re equivalent, in terms of the result, neither one gives you
a better network. To my mind, Conv2d is stylistically better, because
it better fits the way I think about what is going on.
A profound thanks to you for not resurrecting that zombie thread.
Exhuming old threads from their rightful interment just adds to the
noise.