How to apply 3D convolution to stacked video frames

Erhan · June 5, 2020, 7:09pm

I am working on video classification for motion recognition.
I selected 10 frame from video and applied optical flow these sequantial frames.
After that, I found x and y direction of each flows and stack them together.
Finally, I have 3D matrix with shape H x W x 20.
H = height
W = weight
20 = flows from 10 frame (2 times because of x and y dimension of optical flow).
So, I want to apply 3D convolution layer this matrix above.
When I look at pytorch documentation for 3D convolution, I saw 5 dimensional input like that (N,C,D,H,W).
But my input is 4D dimensional like that (N,C,H,W) with N samples.
So, How can I apply 3D convolutional to my matrix?

ebarsoum · June 5, 2020, 7:36pm

N is the batch size, the number of sample. In your case, sample != single frame. Your sample is HxWx1x20., (c=1, you have only one channel) .

To map that to (N,C,D,H,W). C=1, D=20, H=H, and W=W. N depends on your batch size, and if it fits in memory or not.

Erhan · June 5, 2020, 7:49pm

Thanks a lot for reply.

Erhan · June 5, 2020, 9:31pm

I want to ask one more question.
As @ebarsoum mentioned, I have HxWx1x20 numpy uint8 array.
I want to implement a dataset class that inherit torch.utils.data.Dataset.

In getitem part of the function, I return this array by applying transforms like:

transforms.RandomCrop,
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]).

To be able apply the first three of them requires PiL image.
How can handle this ?