Input of video with Conv3D to resnet3D

Hassan_Imani · December 24, 2020, 8:38pm

I have a question abou using video with Conv3D.
My input is like [batch_size=128, channels=3, depth=32, height=64, width=64]
Like 32 RGB videos with size 64 * 64

I want to feed this input to resnet3D (3D version of resnet model) and fine tune it.
I am not sure how to feed 32 videos to resnet.
Does it matter to feed the input like:
first case: [batch_size=128, channels=3, depth=32, height=64, width=64]
or
second case: [batch_size=128, channels=3, height=64, width=64, depth=32]

It accepts both of the cases, but I dont know which one is the right way.

If I feed like second case, what will happen? The network will not learn or it will not learn efficiently?

Thank you

InnovArul · December 25, 2020, 1:30am

As per the documentation for Conv3d, the input shall be passed as [batch_size=128, channels=3, depth=32, height=64, width=64].

https://pytorch.org/docs/stable/generated/torch.nn.Conv3d.html#conv3d

It might learn something and the effect of this on performance is unknown.
The common understanding might be lost between the design of the network and how the input is organized. It all depends on how you formulate the kernel size I guess. (whether it is designed keeping [batch_size, channel, depth, height, width] or [batch_size, channel, height, width, depth] in mind).

Given that you are using pretrained resnet3D model, it would have been trained with [batch_size, channel, depth, height, width].

Hassan_Imani · December 25, 2020, 6:45pm

Thank you for the reply.
I have another isue related to this issue.

I have a code in keras. I have a problem understanding the “same” padding.
My input shape to this layer is: [128, 2048, 4, 2, 2]
This is my keras code:
combined = Conv3D(128, (3, 3, 3),strides=1, padding=‘same’)(combined)

And I want to do it in pytorch. What I did:
self.conv1 = nn.Conv3d(2048, 128, kernel_size=(3,3,3), stride=1, padding=( 1,1,1))

Are these perform the same?

InnovArul · December 25, 2020, 10:31pm

Yes, in my understanding, they do the same.