Opacus with large clipping norm performs differently from normal network

Hi, I compared two tests:

  1. resnet20 on cifar10 with privacy-engine, the clipping norm is set to 10M. This should be equivalent to not doing clipping at all.

  2. resnet20 on cifar10 without privacy-engine (noise-multiplier is set as 0), with exactly the same parameters as example 1.

Test 2 soon reached 92% accuracy while test 1 struggled to reach 85%. I then wrote another test where we created two models for the two above tests and made them train on the same data (with the same trainloader) simultaneously. Model 1 has optimizer 1 which is attached to a privacy-engine, while model 2 has optimizer 2 which is just a normal SGD optimizer.

The code looks something like this:

loss1.backward()
loss2.backward()

optimizer1.step()
optimizer2.step()

Before we call the step() functions, the param.grad are exactly the same between the two models. However, after we called the step() functions, there are approximately a 3% difference between the param.grad of the two models.

Is this because pytorch’s default way of computing gradient is different from opacus even when the clipping value is 10 million? Or is it because of accuracy loss during opacus computations?

Are you sure you use the same exact network in both cases? The “canonical” Resnet20 includes batch normalization, which is incompatible with DP-SGD.

Yes. We called the opacus’ convert_batchnorm_modules() function for both models. All the batchnorm layers are converted to groupnorm layers.

We also use model2.load_state_dict(copy.deepcopy(model1.state_dict())) at the beginning to make sure they start with the same network parameters.

Hi @Jun_Wan, is it possible your noise multiplier was non-zero in your first case? If yes, do you mind sharing a notebook with the above issue? This will help us debug further.