How does one use 3D convolutions on standard 3 channel images?

I am trying to use 3d conv on cifar10 data set (just for fun). I see the docs that we usually have the input be 5d tensors (N,C,D,H,W). Am I really forced to pass 5 dimensional data necessarily?

The reason I am skeptical is because 3D convolutions simply mean my conv moves across 3 dimensions/directions. So technically I could have 3d 4d 5d or even 100d tensors and then should all work as long as its at least a 3d tensor. Is that not right?

I tried it real quick and it did give an error:

import torch
​
​
def conv3d_example():
    N,C,H,W = 1,3,7,7
    img = torch.randn(N,C,H,W)
    ##
    in_channels, out_channels = 1, 4
    kernel_size = (2,3,3)
    conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
    ##
    out = conv(img)
    print(out)
    print(out.size())
​
##
conv3d_example()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-29c73923cc64> in <module>
     15 
     16 ##
---> 17 conv3d_example()

<ipython-input-3-29c73923cc64> in conv3d_example()
     10     conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
     11     ##
---> 12     out = conv(img)
     13     print(out)
     14     print(out.size())

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    491             result = self._slow_forward(*input, **kwargs)
    492         else:
--> 493             result = self.forward(*input, **kwargs)
    494         for hook in self._forward_hooks.values():
    495             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
    474                             self.dilation, self.groups)
    475         return F.conv3d(input, self.weight, self.bias, self.stride,
--> 476                         self.padding, self.dilation, self.groups)
    477 
    478 

RuntimeError: Expected 5-dimensional input for 5-dimensional weight 4 1 2 3, but got 4-dimensional input of size [1, 3, 7, 7] instead

cross posted:

am I really forced to this?

def conv3d_example_with5d_tensor():
    N,C,H,W,D = 2,1,3,7,7
    img = torch.randn(N,C,H,W,D)
    ##
    in_channels, out_channels = 1, 4
    kernel_size = (2,3,3)
    conv = torch.nn.Conv3d(in_channels, out_channels, kernel_size)
    ##
    out = conv(img)
    print(out)
    print(out.size())

##
#conv3d_example()
conv3d_example_with5d_tensor()

A 3-dimensional conv kernel uses all input channels and “moves” along the three spatial dimensions D, H, W in the standard setup.

How would you like the kernel to operate on the input in your first code snippet?
Since one spatial dimension is missing, I’m not sure, how the kernel should be used in that case.

so the way I understand it is that it moves in 3 dimensions. I get that part. So for me I would have expected a 3D conv to just move around however it wants as long as we specify the 3 dimenions to move around.

So the input tensor could be of size (A,B,C) or (A,B,C,D) or (A,B,C,D,E) or even (X1,X2,X3,…,X_100) and the 3d convolution should work as long as we specify which 3 dimensions to move around. Does what I am thinking make sense?

But instead I have the layer just explode on my face with an error. It seems its been hardcoded to expect 5 dimensional tensors no matter what. I’m fine with that (even though I might consider it an incomplete implementation) but its not clear from the docs what is going on. Specially cuz it has the paragraph of “in the simplest case…”

Currently, I’m working using 3D convolution and multiple input images. Compared to 2D convolution, 3D convolution considers the relationship among the adjacent images (also video frames) along with the time domain, which means the depth dimension. Five input images can be regarded as five video frames. For that reason, we need to expand the depth or channel dimensions to extract the features among the images in the time domain as [batch * channel(# of channels of each image) * depth(# of frames) * height * width].

1 Like