Why doesn't UNet allow any backbone?

I am currently going over various architectures for object detection and came across the Feature Pyramid Network (FPN) which at first glance looks incredibly similar to the UNet architecture which made me wonder if it had any major differences other than implementation details to reduce memory footprint or inference/training time.

After doing some digging around I came across this answer which says:

The Feature Pyramid Network (FPN) looks a lot like the [U-net] (https://vitalab.github.io/deep-learning/2017/02/27/unet.html ). The main difference is that there is multiple prediction layers: one for each upsampling layer. Like the U-Net, the FPN has laterals connection between the bottom-up pyramid (left) and the top-down pyramid (right). But, where U-net only copy the features and append them, FPN apply a 1x1 convolution layer before adding them. This allows the bottom-up pyramid called “backbone” to be pretty much whatever you want.

Why is it that the UNet doesn’t allow us to chose any “backbone” as the encoding portion of the architecture?

Thank you!

I can’t speak for the author of the linked question, but would guess

But, where U-net only copy the features and append them, FPN apply a 1x1 convolution layer before adding them. This allows the bottom-up pyramid called “backbone” to be pretty much whatever you want.

means that the FPN implementation could be more flexible as the skip connections are processes (and you might be thus able to change the number of channels, spatial size etc.) while the (original) UNet implementation might have just concatenated the skip activations and would thus be more shape-dependent.
In any case, that’s just my interpretation of this answer so you might want to follow up with the authod.

1 Like