Text Classification tutorial without frozen layers


I tried the tutorial notebook on Text Classification. It works well. However, I don’t understand if I don’t freeze any layer, there will be a problem in training step. More specifically:

/usr/local/lib/python3.7/dist-packages/opacus/optimizers/optimizer.py in clip_and_accumulate(self)
    397             g.view(len(g), -1).norm(2, dim=-1) for g in self.grad_samples
    398         ]
--> 399         per_sample_norms = torch.stack(per_param_norms, dim=1).norm(2, dim=1)
    400         per_sample_clip_factor = (self.max_grad_norm / (per_sample_norms + 1e-6)).clamp(
    401             max=1.0

RuntimeError: stack expects each tensor to be equal size, but got [8] at entry 0 and [1] at entry 

Any idea ? Thanks

Hello @long21wt

Thank you for reporting this. This is likely a bug in our tutorial. Do you mind sending us your full stack error, along with our template Colab and post here the link?
Please paste your colab link here. Remember: SET IT TO PUBLIC :slight_smile:

Thank you. Here is the link:

As far as I know, it seems like you would need to modify forwarding method of BERT ( lxuechen/private-transformers: make differentially private training of transformers easy (github.com))
And Roberta works out of the box with opacus in other experiments.

Thanks for creating this. We are looking into this!


After a while, I’m back to this issue. By printing the model’s parameters:

for n, p in model.named_parameters():
    print("{:50s} {}".format(n, list(p.grad_sample.shape) if hasattr(p, "grad_sample") else None))

I found the position_embeddings cause the problem to the optimizer, do you have any idea to fix this ?

_module.bert.embeddings.word_embeddings.weight     [7, 28996, 768]
_module.bert.embeddings.position_embeddings.weight [1, 512, 768]
_module.bert.embeddings.token_type_embeddings.weight [7, 2, 768]
_module.bert.embeddings.LayerNorm.weight           [7, 768]
_module.bert.embeddings.LayerNorm.bias             [7, 768]

I will try to take a look. In the meantime, I believe that functorch can alleviate the issue because it computes per-sample gradients in a different way (using the “no_op” version of the grad sample module, see e.g. opacus/cifar10.py at main · pytorch/opacus · GitHub)