How to get the bounding boxes in feature map space?

For my video action recognition model, I am using I3D network as a feature extractor.

I am passing a clip of 64 RGB frames to the network and am taking the output of one of the intermediate layers(ie. Mixed_4f.b3b.conv3d) layer as my feature map.

In the original RGB frames, I know the bounding box coordinates of all the objects. Is it possible to get the corresponding bounding box locations in the feature map space?

Is there a 1-1 pixel mapping between the original frame and the feature map.

For eg. my input is B x 64 x 3 x 400 x 400 [B x T x C x W x H] and my output is B x 512 x 16 x 25 x 25 [X x F x T/4 x W/16 x H/16].

Since width and height have been reduced to 1/16 can I interpolate and estimate the coordinates of the objects in the feature map?

I wanted to do something similar. Yes you can do that. There are 2 main operations for this: ROI-pooling and ROI-aligning. Basically, each bounding box is a certain region of interest (ROI), which is first projected onto the feature map. They differ in the way features are computed for a certain ROI.

I highly recommend this video which explains how both operations work (it starts around the 20th minute). ROI-aligning is actually simpler and performs better than ROI-pooling, since it doesn’t snap the projected region of interest onto the grid cells. Both are implemented in torchvision.

I also recommend this Stackoverflow post that explains how ROI-align works in PyTorch.

Thanks for the answer @Niels_PyTorch. I just tried out ROI-align and it seems to be serving my purpose well.

I will update this thread with any new learnings/challenges with ROI-align.