To get the activation maps you could use forward hooks as described here.
However, the pixel locations of these maps might not correspond to the output locations and it depends on the architecture of your model.
E.g. convolution layers will use filters with a specific window, stride, dilation, etc., so that you would have to calculate the receptive field of the output locations for each activation map.