How to organize image sequences for 3D CNN

I am working on the segmentation of a video of a person walking. My goal is to segment the body parts of the person.

I have used a U-Net model to perform segmentation on each individual frame. It works well for segmenting the person from the background, but not for segmenting their individual parts. I want to see if I can improve results by leveraging the temporal order of the images.

I’m wondering how to organize the images into sequences for a 3D CNN. Should I use the same number of images in each sequence, say five? Then I could have a batch of 50 sequences each containing five consecutive images. Or would it be better to randomly select the number of images for each sequence? Also, should I shuffle the sequences in the batch?

I don’t think you should shuffle the images if you want to leverage their temporal order. By shuffle I mean in the 3D axis, so I would load images like this batch 0, [Id, Id+1 …], 3, 256, 1024. But if train with more than one batch size than I would shuffle on the 0 axis while training.

Changing the 2D layers to 3D you will increase the number of parameters by about a factor of 3. Make sure you have enough training data before you try.

Also, look at keypoint rcnn first, it might be enough for your application.