Deeplab Large FOV version 2 Trained in Caffe but not on Pytorch

BardOfCodes · July 1, 2017, 12:12pm

Following the instructions in Deeplab_v2 repository, I was able to replicate the results they have given in the paper(Namely, 65.85% mIOU performance on the validation set when trained on train_aug set).

However, with the same setup in Pytorch, I am only able to get a under-performing net.(~64% meanIOU).I have checked my code multiple time for any error. I found none.

My question is, has anybody been able to replicate the said performance on Pytorch?
If yes, would he/she be able to share snippets of the code?

Thank you!

smth · July 2, 2017, 1:44pm

there are usually subtle differences with input preprocessing that will affect the learning rate to be used, etc. Have you checked that the inputs to the network are preprocessed exactly the same?

BardOfCodes · July 3, 2017, 4:58am

Yes. I am absolutely sure that the data ingestion is exactly same as in Caffe. For ensuring that the data ingestion is the same, I also trained a Pytorch model using Caffe-Deeplab model’s data-layer-output as input. Hence the input to both the networks were the same.

Another concern might be the initialization. To have as much similarity as possible, I even converted Caffe-Deeplab’s “init.caffemodel”(provided by Liang-Chieh Chen) to a pytorch compatible Ordered Dict for initialization.

To use ‘poly’ learning policy correctly, I am following the advice on this forum itself.
Parameters like batch-size, learning rate, weight-decay, momentum are also same.

I was hoping to know if anyone else has been able to get the same performance. Its a few hours training with an easy setup.

smth · July 3, 2017, 1:50pm

hmmm, that looks like very well done analysis. I have to say that Caffe and PyTorch have slightly different formulations of momentum, especially around what momentum decay means. Other than that, at this point I cant think of much else, just evidence that we’ve trained a lot of networks on classification and detection and that they do match their original accuracy.

Gaurav_Pandey · July 3, 2017, 4:45pm

Are you cropping/scaling the images while training the network or feeding them as it is?

BardOfCodes · July 3, 2017, 5:03pm

Thank you!
I will let the world know if I find out what exactly was wrong!

BardOfCodes · July 3, 2017, 5:06pm

Yes. I am doing the transformation that the Caffe version does in it’s ‘Image Seg Data’ layer. Without any external data augmentation by say rotation etc, the Caffe version is able to reap the said performance. Hence, I have restricted myself to the same transforms as well.

Gaurav_Pandey · July 6, 2017, 11:07am

The caffe code also uses learning rate and decay multipliers for different layers. Are you using those multipliers in the pytorch code?

BardOfCodes · July 6, 2017, 11:32am

I think you mean different learning rate for different layers.

Yes, I use a different learning rate for the fc8_${exp} layer(ten times the rate of other layers), as they have given in the paper. The weight decay is kept at 0.0005 as given in the paper.

For reference: Deeplab v2 TAPMI submission

Gaurav_Pandey · July 6, 2017, 12:58pm

This is the caffe code for 1 branch of pyramid pooling of the LargeFOV model available at
http://liangchiehchen.com/projects/DeepLabv2_vgg.html

### hole = 6
layer {
  name: "fc6_1"
  type: "Convolution"
  bottom: "pool5"
  top: "fc6_1"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 1024
    pad: 6
    kernel_size: 3
    dilation: 6
  }
}
layer {
  name: "relu6_1"
  type: "ReLU"
  bottom: "fc6_1"
  top: "fc6_1"
}
layer {
  name: "drop6_1"
  type: "Dropout"
  bottom: "fc6_1"
  top: "fc6_1"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "fc7_1"
  type: "Convolution"
  bottom: "fc6_1"
  top: "fc7_1"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 1024
    kernel_size: 1
  }
}
layer {
  name: "relu7_1"
  type: "ReLU"
  bottom: "fc7_1"
  top: "fc7_1"
}
layer {
  name: "drop7_1"
  type: "Dropout"
  bottom: "fc7_1"
  top: "fc7_1"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "fc8_${EXP}_1"
  type: "Convolution"
  bottom: "fc7_1"
  top: "fc8_${EXP}_1"
  param {
    lr_mult: 10
    decay_mult: 1
  }
  param {
    lr_mult: 20
    decay_mult: 0
  }
  convolution_param {
    num_output: ${NUM_LABELS}
    kernel_size: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}

You can see that learning rate and weight decay for bias is different from the learning rate and weight decay for the weights. Moreover, the bias is removed from fc8 etc.
If you are using the above code in caffe, you will get different result as compared to pytorch code, unless you make the same changes in the pytorch code.

uapatira · July 6, 2017, 1:51pm

@BardOfCodes do you have your pytorch implementation attempt online somewhere where we could take a closer look?

BardOfCodes · July 7, 2017, 8:41am

Thank you very much! That was very useful.

Although, only the bias_filler has been set to type:‘constant’ and value: 0 , implying they initialize it with 0. They have bias for the last layer as well.

But I regret I did not see the ‘prototxt’ files as closely as I should have. I found two things that I wasn’t doing:

The learning rate for Bias is twice the learning rate for the corresponding weights in all layers.
The weight decay for all Bias parameters is kept 0.

I wish such details are not missed out in the papers. No reason for this to be lesser important than some of the other implementation details that they share.

Again, Thanks a lot for pointing this out!

BardOfCodes · July 7, 2017, 8:43am

I will soon add the github repository link.
I have to clean parts of the code.
Additionally, I think maybe I have found out my error, thanks to @gaurav_pandey.

Gaurav_Pandey · July 7, 2017, 9:29am

You are welcome. I hope that it works.

lizw · July 12, 2017, 7:50pm

Hi, thanks for implementing deeplab-ver2 in pytorch.

May I ask the date when you will release the code? Since I will be a little bit urgent to run it.

By the way, do you know if this pytorch implementation works well? Link: https://github.com/isht7/pytorch-deeplab-resnet

BardOfCodes · July 13, 2017, 1:57pm

I hope to upload the code by Monday. Although I could not get the exact
performance stated in the paper(even after incorporating @gaurav pandey’s
suggestion), I will upload it with the best performance I could get.
The repository put up by @isht7 should work fine as the results put up on
that repository were verified by @isht7. A few minor bugs exist, which
@isht7 is currently solving. I am sure @isht7 would be able to inform you
better regarding deeplab-resnet.

isht7 · July 18, 2017, 2:19pm

Hi, I am the owner of the repo you have mentioned. There is a performance difference between the pytorch and caffe implementation by about 3.25% of mean IOU(with pytorch implementation being worse, pytorch’s mean IOU is 71.13% and caffe’s is 74.39%). I am not sure why this difference occurs. There are some subtle differences(mentioned here in the readme), but I dont think they would cause this difference.
I am creating a new optimizer after each 10 iterations(iter size is 10 in caffe implementation), which causes the momentum to be lost. To prevent this, I tried this also, but it gave worse results. @smth what do you think might be causing the difference? Thanks!