I am currently going over various architectures for object detection and came across the Feature Pyramid Network (FPN) which at first glance looks incredibly similar to the UNet architecture which made me wonder if it had any major differences other than implementation details to reduce memory footprint or inference/training time.
After doing some digging around I came across this answer which says:
The Feature Pyramid Network (FPN) looks a lot like the [U-net] (https://vitalab.github.io/deep-learning/2017/02/27/unet.html ). The main difference is that there is multiple prediction layers: one for each upsampling layer. Like the U-Net, the FPN has laterals connection between the bottom-up pyramid (left) and the top-down pyramid (right). But, where U-net only copy the features and append them, FPN apply a 1x1 convolution layer before adding them. This allows the bottom-up pyramid called “backbone” to be pretty much whatever you want.
Why is it that the UNet doesn’t allow us to chose any “backbone” as the encoding portion of the architecture?
Thank you!