I’ve reimplemented a Caffe network in PyTorch. I am training with identical data splits, augmentation, loss weights, and learning parameters. Yet while I can get decent results, my network is still not nearly as good as the original Caffe model. The only way I can get somewhat close is by using a learning rate decay–which they don’t use in the original paper. Instead they only use Adam with a weight decay. When I do the same, my results are meh.
What could I be missing here?