Opacus with large clipping norm performs differently from normal network

Jun_Wan · November 13, 2021, 7:59pm

Hi, I compared two tests:

resnet20 on cifar10 with privacy-engine, the clipping norm is set to 10M. This should be equivalent to not doing clipping at all.
resnet20 on cifar10 without privacy-engine (noise-multiplier is set as 0), with exactly the same parameters as example 1.

Test 2 soon reached 92% accuracy while test 1 struggled to reach 85%. I then wrote another test where we created two models for the two above tests and made them train on the same data (with the same trainloader) simultaneously. Model 1 has optimizer 1 which is attached to a privacy-engine, while model 2 has optimizer 2 which is just a normal SGD optimizer.

The code looks something like this:

loss1.backward()
loss2.backward()

optimizer1.step()
optimizer2.step()

Before we call the step() functions, the param.grad are exactly the same between the two models. However, after we called the step() functions, there are approximately a 3% difference between the param.grad of the two models.

Is this because pytorch’s default way of computing gradient is different from opacus even when the clipping value is 10 million? Or is it because of accuracy loss during opacus computations?

ilyamironov · November 15, 2021, 6:28am

Are you sure you use the same exact network in both cases? The “canonical” Resnet20 includes batch normalization, which is incompatible with DP-SGD.

Jun_Wan · November 15, 2021, 9:41pm

Yes. We called the opacus’ convert_batchnorm_modules() function for both models. All the batchnorm layers are converted to groupnorm layers.

We also use model2.load_state_dict(copy.deepcopy(model1.state_dict())) at the beginning to make sure they start with the same network parameters.

karthikprasad · November 26, 2021, 11:32pm

Hi @Jun_Wan, is it possible your noise multiplier was non-zero in your first case? If yes, do you mind sharing a notebook with the above issue? This will help us debug further.

ashkan_software · December 7, 2021, 11:08pm

Hi @Jun_Wan. Has your issue been resolved? If not, do you mind sharing a notebook so we can look into this?

Jun_Wan · December 8, 2021, 11:19pm

Hi, sorry. For some reasons, the previous notification emails went to the trash folder. I just noticed them today.

I posted our test code on GitHub - junwan0224/test-code. We used the compare.py file compare the gradients. One model has privacy engine while the other does not. We feed them the exact same data, but the resulted gradients are different.

For a previous question: yes, our noise multiplier is set as zero. Please let me know if there is any other question. Thank you for helping!

ashkan_software · January 3, 2022, 7:36pm

Hi @Jun_Wan

Thanks for sharing your code and thanks again for using Opacus!

I looked at your code and I suspect the issue might be because of the learning rate. It seems that the learning rate for the 2 optimizers are different:

on line 155 you define a LR scheduler for optimizer1 but not for optimizer2. So technically the two learnings are going through different training surfaces. You wrote

and this kind of shows that, I think.

I know the test is for resnet20, but in case you use other resnets, make sure to also include optimizer2 on line 169.
lines 224-228 seem to only be for optimizer1

Last but not least, we have released Opacus 1.0 recently. So, when you have time, please migrate to Opacus 1.0 for more features: Release Opacus v1.0.0 · pytorch/opacus · GitHub

Please let us know if that didn’t solve the issue and we can investigate further.