Depthwise convolution using nn.functional.conv3d

Hello,

I’m implementing the CNN to align multiple video frames using the dynamic filter network (DFN).

In my implementation, the DFN estimates the convolution filters of size [batch x 3 (depth, # of input frames) x filter_height x filter_width] and the estimated filters are used to align three input frames by a convolution.

Given the tensor of size [8 x 128 x 3 x 32 x 32] , which is generated using nn.Conv3D and three input frames, and the estimated filters of size [8 x 3 x 9 x 9], I tried to use nn.functional.conv3d function to perform the depthwise convolution to the feature maps of each input frame as

# estimated filters
filters = torch.unsqueeze(filters, dim=1)       # [8, 1, 3, 9, 9] 
filters = filters.repeat(1, 128, 1, 1, 1)       # [8, 128, 3, 9, 9] 
filters = filters.permute(1, 0, 2, 3, 4)       # [128, 8, 3, 9, 9] 

f_sh = filters.shape
filters = torch.reshape(filters, (1, f_sh[0] * f_sh[1], f_sh[2], f_sh[3], f_sh[4]))       # [1, 128*8, 3, 9, 9] 
filters = filters.permute(1, 0, 2, 3, 4)       # [128*8, 1, 3, 9, 9] 

# input tensor to be aligned
sh = x_in.shape        # [8, 128, 3, 32, 32]
x_in = torch.reshape(x_in, (1, sh[0] * sh[1], sh[2], sh[3], sh[4]))       # [1, 8*128, 3, 32, 32]
temp_out = nn.functional.conv3d(x_in, filters, stride=(1, 1, 1), padding=(0, 2, 2), groups=128*8)

It works, but I want to know that it’s correctly implemented or not.
Could you give me an idea to effectively perform the depthwise convolution using nn.functional.conv3d?

Thank you.

I’m not sure, if reshaping the batch size into the channels is the right approach here.
Currently you are using an own filter for each channel in each batch (128*8).
This would set your batch size to a fixed value, which seems a bit unusual.

Could you explain your use case a bit?

I refer this paper, which performs the image registration. I’m trying to the convolution to each feature map using the estimated dynamic filters as

In this figure, N is the number of input images (in my case, input frames) and F is the number of feature maps to each input image.

Besides, I found the TensorFlow code released by the authors and I’m implementing the registration part using Pytorch. In this code, they performed the convolution to the misaligned feature maps using tf.nn.depthwise_conv2d function (Since Tensorflow doesn’t have a function for the 3D depthwise convolution.)

Compared to the authors’ code, I’m not sure my approach is right or not.

Thank you.