Creating a Multi-Input Model for Streaming Object Detection

I’d like to create two CNN or similar streams, and I want to make these networks work in parallel. Each network would take one of the two streams, grab a frame and input it into the detection process, join in the last fully connected layer, and then make the object detection predictions based on the standard VOC classes.

There would be only one weights files for this.

I looked at doing this in Yolo V2 or V3 for speed, but my C is too rusty I think.

I checked the forums here, and there are some smaller answers, but perhaps I’m not understanding how you actually set up two networks to run at the same time with all the special boxes, truth predictions, etc you find like in a standard SSD or PyTorch Yolo-v2.

How would I approach this problem in PyTorch?

In general, torch.nn.Module.forward() doesn’t care if you glue two things together and provide them as single input to the network, and then split and concatenate things inside as you wish. Here’s a small example (DAWNBench CIFAR-10 leader, see def net() in his; or lookup for U-Net architecture for splitting and concatenating in big style.