I am trying to write a Binary Classification for pairs of images taken one after the other from a video. My frames are in black and white and thus only have one channel. I’ve been using 2 Conv3d() layers with one channel to moderate success (~85% accuracy), however it occurred to me that I could just be using Conv2d() layers with multiple channels instead.
What would be the advantages/disadvantages of using either or for this scenario? I’m thinking I should stick with Conv3d() layers due to the ability to run convolution in the third dimension, but as I’m rather new to this I don’t know how useful this is.
Note, I did come across this thread before making this post, but I was hoping someone could give a more in-depth answer and I wasn’t sure if I should bump the thread or not, thus I am making a new post.
All help is appreciated, thanks in advance!
If your use case is restricted to having just pairs of images as your
network input (as compared to using more than two frames of the
video, say, 8 or 16), then using
in_channels = 2
kernel_size = (k, k) and
in_channels = 1,
kernel_size = (2, k, k), and a tensor with a depth dimension of 2
passed in are essentially equivalent (assuming that
out_channels = 1).
With a depth dimension (and
kernel_size) of 2 , you don’t really
have anything to convolve over, so you get the same result as the
Since they’re equivalent, in terms of the result, neither one gives you
a better network. To my mind,
Conv2d is stylistically better, because
it better fits the way I think about what is going on.
A profound thanks to you for not resurrecting that zombie thread.
Exhuming old threads from their rightful interment just adds to the
Hi K. Frank,
Thank you for your response!
If the out channels were greater than 1 for both Conv2d and Conv3d, are they still equivalent?
The shapes of the relevant weights suggest that they could do the same
thing, and a test computation shows that this is the case:
>>> import torch
>>> _ = torch.manual_seed (2022)
>>> conv2 = torch.nn.Conv2d (in_channels = 2, out_channels = 5, kernel_size = (3, 3), bias = False)
>>> conv3 = torch.nn.Conv3d (in_channels = 1, out_channels = 5, kernel_size = (2, 3, 3), bias = False)
torch.Size([5, 2, 3, 3])
torch.Size([5, 1, 2, 3, 3])
>>> with torch.no_grad():
... _ = conv3.weight.copy_ (conv2.weight.unsqueeze (1))
>>> input2 = torch.randn (1, 2, 7, 7)
>>> input3 = input2.unsqueeze (1)
>>> result2 = conv2 (input2)
>>> result3 = conv3 (input3)
>>> torch.equal (result2, result3.squeeze (2))