For my video action recognition model, I am using I3D network as a feature extractor.
I am passing a clip of 64 RGB frames to the network and am taking the output of one of the intermediate layers(ie. Mixed_4f.b3b.conv3d) layer as my feature map.
In the original RGB frames, I know the bounding box coordinates of all the objects. Is it possible to get the corresponding bounding box locations in the feature map space?
Is there a 1-1 pixel mapping between the original frame and the feature map.
For eg. my input is B x 64 x 3 x 400 x 400 [B x T x C x W x H] and my output is B x 512 x 16 x 25 x 25 [X x F x T/4 x W/16 x H/16].
Since width and height have been reduced to 1/16 can I interpolate and estimate the coordinates of the objects in the feature map?
I wanted to do something similar. Yes you can do that. There are 2 main operations for this: ROI-pooling and ROI-aligning. Basically, each bounding box is a certain region of interest (ROI), which is first projected onto the feature map. They differ in the way features are computed for a certain ROI.
I highly recommend this video which explains how both operations work (it starts around the 20th minute). ROI-aligning is actually simpler and performs better than ROI-pooling, since it doesn’t snap the projected region of interest onto the grid cells. Both are implemented in torchvision.
Hi even im trying to solve the same problem , but my question can we determine the actions of individual object bounding box given its object location and features from the i3d of full frame ?