I am trying to train a model in the biomedical domain with a rather specialized task (flow prediction).
My input consists of video clips and I would like to predict either a single image or a video. I have pixel wise ground truth for every frame of the video but the number of videos is very limited.
Architecture wise I am considering a CNN, RNN combination where the CNN provides a representation of the input frames for the RNN to learn about the temporal relationship between input frames.
Now my question is: What kind of CNN do I use and on what do I pretrain it? Since I am working with biomedical data I would assume image net as well as most other image data sets do not really help as the image content is very different. Are there data sets/tasks/networks I could use for this purpose?