This is more of a deep learning question than a PyTorch question. I am observing a weird phenomenon when training semantic segmentation networks. In short, my training program gives (roughly) the same final IoU scores no matter what backbone I use. I have tried ResNet (34, 50, 101), ResNeXt-50, WideResNet-50, WiderResNet-38, Xception, etc. This is uncommon because a stronger backbone should result in a better performance when used for transfer learning, right? I have observed that in the first few epochs stronger backbones indeed lead to faster performance improvement. However, they also reach a plateau faster so that weaker backbones eventually catch up. To give more details, all models are trained with the exact same configuration. I’m trying to figure out want hyperparameter is causing this saturation phenomenon. The one I’m suspecting the most is the batch size. Due to budget reasons, I can only use a batch size of 2. I know that most SoTA semantic segmentation networks are trained with a batch size of 8 or greater. However, a batch size of 2 has been leading to better than SoTA results for low complexity backbones (ResNet18, ResNet34). For ResNet34, the IoU score is only increased by 0.2 when increasing the batch size from 2 to 4. I can’t test how batch size affects stronger backbones because of insufficient memory. The loss function I use is the sum of categorical cross entropy and log jaccard.
Edit 1: I forgot to mention I had to cast models to half() because of insufficient memory.