Text Classification tutorial without frozen layers

Hi,

I tried the tutorial notebook on Text Classification. It works well. However, I don’t understand if I don’t freeze any layer, there will be a problem in training step. More specifically:

/usr/local/lib/python3.7/dist-packages/opacus/optimizers/optimizer.py in clip_and_accumulate(self)
    397             g.view(len(g), -1).norm(2, dim=-1) for g in self.grad_samples
    398         ]
--> 399         per_sample_norms = torch.stack(per_param_norms, dim=1).norm(2, dim=1)
    400         per_sample_clip_factor = (self.max_grad_norm / (per_sample_norms + 1e-6)).clamp(
    401             max=1.0

RuntimeError: stack expects each tensor to be equal size, but got [8] at entry 0 and [1] at entry 

Any idea ? Thanks

Hello @long21wt

Thank you for reporting this. This is likely a bug in our tutorial. Do you mind sending us your full stack error, along with our template Colab and post here the link?
Please paste your colab link here. Remember: SET IT TO PUBLIC :slight_smile:

Thank you. Here is the link:

As far as I know, it seems like you would need to modify forwarding method of BERT ( lxuechen/private-transformers: make differentially private training of transformers easy (github.com))
And Roberta works out of the box with opacus in other experiments.
Best

Thanks for creating this. We are looking into this!

Hi,

After a while, I’m back to this issue. By printing the model’s parameters:

for n, p in model.named_parameters():
    print("{:50s} {}".format(n, list(p.grad_sample.shape) if hasattr(p, "grad_sample") else None))

I found the position_embeddings cause the problem to the optimizer, do you have any idea to fix this ?

_module.bert.embeddings.word_embeddings.weight     [7, 28996, 768]
_module.bert.embeddings.position_embeddings.weight [1, 512, 768]
_module.bert.embeddings.token_type_embeddings.weight [7, 2, 768]
_module.bert.embeddings.LayerNorm.weight           [7, 768]
_module.bert.embeddings.LayerNorm.bias             [7, 768]

I will try to take a look. In the meantime, I believe that functorch can alleviate the issue because it computes per-sample gradients in a different way (using the “no_op” version of the grad sample module, see e.g. https://github.com/pytorch/opacus/blob/main/examples/cifar10.py)