Input of video with Conv3D to resnet3D

I have a question abou using video with Conv3D.
My input is like [batch_size=128, channels=3, depth=32, height=64, width=64]
Like 32 RGB videos with size 64 * 64

I want to feed this input to resnet3D (3D version of resnet model) and fine tune it.
I am not sure how to feed 32 videos to resnet.
Does it matter to feed the input like:
first case: [batch_size=128, channels=3, depth=32, height=64, width=64]
or
second case: [batch_size=128, channels=3, height=64, width=64, depth=32]

It accepts both of the cases, but I dont know which one is the right way.

If I feed like second case, what will happen? The network will not learn or it will not learn efficiently?

Thank you

As per the documentation for Conv3d, the input shall be passed as [batch_size=128, channels=3, depth=32, height=64, width=64].

It might learn something and the effect of this on performance is unknown.
The common understanding might be lost between the design of the network and how the input is organized. It all depends on how you formulate the kernel size I guess. (whether it is designed keeping [batch_size, channel, depth, height, width] or [batch_size, channel, height, width, depth] in mind).

Given that you are using pretrained resnet3D model, it would have been trained with [batch_size, channel, depth, height, width].

Thank you for the reply.
I have another isue related to this issue.

I have a code in keras. I have a problem understanding the “same” padding.
My input shape to this layer is: [128, 2048, 4, 2, 2]
This is my keras code:
combined = Conv3D(128, (3, 3, 3),strides=1, padding=‘same’)(combined)

And I want to do it in pytorch. What I did:
self.conv1 = nn.Conv3d(2048, 128, kernel_size=(3,3,3), stride=1, padding=( 1,1,1))

Are these perform the same?

Yes, in my understanding, they do the same.