Summary:

```
t1 = torch.zeros(8, 256, 64, 64)
t2 = torch.zeros(8, 256, 64, 64)
catted = torch.cat((t1, t2), dim=1) # shape = (8, 512, 64, 64)
print(catted.shape)
conv2d = nn.Conv2d(512, 256, kernel_size=3, padding=1)
result_conv2d = conv2d(catted) # shape = (8, 256, 64, 64)
print(result_conv2d.shape)
stacked = torch.stack((t1, t2), dim=2) # shape = (8, 256, 2, 64, 64)
print(stacked.shape)
conv3d = nn.Conv3d(256, 256, kernel_size=(2,3,3), padding=(0,1,1))
result_conv3d = conv3d(stacked) # shape = (8, 256, 64, 64)
print(result_conv3d.squeeze().shape)
```

Would in this case conv3d be doing the same as conv2d?

Background:

I’m currently experimenting and playing around with fusion of data in conv nets, and was thinking that it might be sensible to include 3d convolutions at some places. However after thinking some more about it, I began to question how exactly the 3d convolution would even work…

Lets say I have 2 different, but somewhat related images as input to 2 separate feature extraction branches, 1 image per feature extraction branch.

Now after some convolutions, I end up with 2 tensors of sizes (N, C, H, W).

Standard procedure would now be, to concatenate both tensors along the channel dimension C, so we get 1 tensor of size (N, 2C, H, W) and then apply a 2d convolution.

For this example lets assume the 2d convolution has kernel_size=3 and padding=1 such that the input and output shape stays the same (except for the channels)

However maybe it might be sensible to make use of a 3d convolution, as in theory, a 3d convolution should extract some kind of information about how the 2 inputs correlate in the depth dimension.

So regarding the implementation:

I can simply torch.stack my 2 tensors with dim=2 to get 1 tensor of shape (N, C, D, H, W), where D = 2 that I could then feed to a conv3d.

But now, if I apply the 3d convolution, I still end up with a (N, C, D, H, W) tensor however I would like to get a (N, C, H, W) tensor in the end.

My first idea was, to do a conv3d with kernel_size=(D, 3, 3) and padding=(0, 1, 1) such that the depth dimension has size 1 and we can then squeeze the resulting tensor.

However isn’t a conv3d with kernel_size(D, 3,3) exactly the same as a conv2d with kernel_size(3,3)? Because if our kernels D dimension is equal to the tensors D dimension and we have no D padding, we cannot move the filter along the D dimension. Therefore the filter only moves along the H and W dimensions and it has exactly the same amount of parameters as the conv2d filter. Ergo it should be exactly the same or not?