Following the instructions in Deeplab_v2 repository, I was able to replicate the results they have given in the paper(Namely, 65.85% mIOU performance on the validation set when trained on train_aug set).
However, with the same setup in Pytorch, I am only able to get a under-performing net.(~64% meanIOU).I have checked my code multiple time for any error. I found none.
My question is, has anybody been able to replicate the said performance on Pytorch?
If yes, would he/she be able to share snippets of the code?
there are usually subtle differences with input preprocessing that will affect the learning rate to be used, etc. Have you checked that the inputs to the network are preprocessed exactly the same?
Yes. I am absolutely sure that the data ingestion is exactly same as in Caffe. For ensuring that the data ingestion is the same, I also trained a Pytorch model using Caffe-Deeplab modelâs data-layer-output as input. Hence the input to both the networks were the same.
Another concern might be the initialization. To have as much similarity as possible, I even converted Caffe-Deeplabâs âinit.caffemodelâ(provided by Liang-Chieh Chen) to a pytorch compatible Ordered Dict for initialization.
To use âpolyâ learning policy correctly, I am following the advice on this forum itself.
Parameters like batch-size, learning rate, weight-decay, momentum are also same.
I was hoping to know if anyone else has been able to get the same performance. Its a few hours training with an easy setup.
hmmm, that looks like very well done analysis. I have to say that Caffe and PyTorch have slightly different formulations of momentum, especially around what momentum decay means. Other than that, at this point I cant think of much else, just evidence that weâve trained a lot of networks on classification and detection and that they do match their original accuracy.
Yes. I am doing the transformation that the Caffe version does in itâs âImage Seg Dataâ layer. Without any external data augmentation by say rotation etc, the Caffe version is able to reap the said performance. Hence, I have restricted myself to the same transforms as well.
I think you mean different learning rate for different layers.
Yes, I use a different learning rate for the fc8_${exp} layer(ten times the rate of other layers), as they have given in the paper. The weight decay is kept at 0.0005 as given in the paper.
You can see that learning rate and weight decay for bias is different from the learning rate and weight decay for the weights. Moreover, the bias is removed from fc8 etc.
If you are using the above code in caffe, you will get different result as compared to pytorch code, unless you make the same changes in the pytorch code.
Although, only the bias_filler has been set to type:âconstantâ and value: 0 , implying they initialize it with 0. They have bias for the last layer as well.
But I regret I did not see the âprototxtâ files as closely as I should have. I found two things that I wasnât doing:
The learning rate for Bias is twice the learning rate for the corresponding weights in all layers.
The weight decay for all Bias parameters is kept 0.
I wish such details are not missed out in the papers. No reason for this to be lesser important than some of the other implementation details that they share.
I will soon add the github repository link.
I have to clean parts of the code.
Additionally, I think maybe I have found out my error, thanks to @gaurav_pandey.
I hope to upload the code by Monday. Although I could not get the exact
performance stated in the paper(even after incorporating @gaurav pandeyâs
suggestion), I will upload it with the best performance I could get.
The repository put up by @isht7 should work fine as the results put up on
that repository were verified by @isht7. A few minor bugs exist, which @isht7 is currently solving. I am sure @isht7 would be able to inform you
better regarding deeplab-resnet.
Hi, I am the owner of the repo you have mentioned. There is a performance difference between the pytorch and caffe implementation by about 3.25% of mean IOU(with pytorch implementation being worse, pytorchâs mean IOU is 71.13% and caffeâs is 74.39%). I am not sure why this difference occurs. There are some subtle differences(mentioned here in the readme), but I dont think they would cause this difference.
I am creating a new optimizer after each 10 iterations(iter size is 10 in caffe implementation), which causes the momentum to be lost. To prevent this, I tried this also, but it gave worse results. @smth what do you think might be causing the difference? Thanks!