[Q] YoloV3/V4: Where to position a sequence-model to model the spatio-temoporal signal in video data?

Dear Community,

I am currently working on turning a YoloV3/V4 Network into a video object detector. That is, we want to add simple sequence model like an RNN or transformer to the architecture to model. We are currently not sure where to place the sequence model within the architecture that looks roughly as followed:

  1. EfficentNet Backbone
  2. Neck consisting of Spatial Pyramid Pooling and a Path Aggregation Network
  3. A prediction head consisting of a linear convolution

In particular, we feel that it is a little tricky, since YoloV3/V4 makes prediction on the feature maps using a linear convolution. That is unlike a sequence model it does not make predictions on a flattened input.

There are a number of approaches which we think can work, but they come with their own problems:

  1. Repalce the prediction head with an RNN: flatten the output of the preceedig layer, give it as input to the RNN and predict Nclasses x (class_prob, x, y, height width), however the output can quickly grow with 1000 classes there are 1000 * 5 = 5000 predictions. Can RNNs handle these type of predictions? Loss adjustment may be neccessary, but doable.

  2. Add an RNN right after the backbone: flatten the output of the backbone, give it as input to the RNN. Here the problem arises on how we can adjust the output shape of the RNN for the following layers deeper down the architecture i.e., the Neck ? Furthermore, the Network then also only aggregates the spatio-temporal signal up to the backbone and not further, which may not be ideal.

If there is a more suitable approach or place to put the RNN within the architecture please let us know.

all the best,