No gradient in layers text classification tutorial

anna_l · June 14, 2021, 3:27pm

Hi!

I was running the tutorial on text classification, exactly as in opacus/building_text_classifier.ipynb at master · pytorch/opacus · GitHub, but I get the following error when I try to train:

AttributeError: The following layers do not have gradients: [‘module.bert.encoder.layer.11.attention.self.query.weight’, ‘module.bert.encoder.layer.11.attention.self.query.bias’, ‘module.bert.encoder.layer.11.attention.self.key.weight’, ‘module.bert.encoder.layer.11.attention.self.key.bias’, ‘module.bert.encoder.layer.11.attention.self.value.weight’, ‘module.bert.encoder.layer.11.attention.self.value.bias’, ‘module.bert.encoder.layer.11.attention.output.dense.weight’, ‘module.bert.encoder.layer.11.attention.output.dense.bias’, ‘module.bert.encoder.layer.11.attention.output.LayerNorm.weight’, ‘module.bert.encoder.layer.11.attention.output.LayerNorm.bias’, ‘module.bert.encoder.layer.11.intermediate.dense.weight’, ‘module.bert.encoder.layer.11.intermediate.dense.bias’, ‘module.bert.encoder.layer.11.output.dense.weight’, ‘module.bert.encoder.layer.11.output.dense.bias’, ‘module.bert.encoder.layer.11.output.LayerNorm.weight’, ‘module.bert.encoder.layer.11.output.LayerNorm.bias’, ‘module.bert.pooler.dense.weight’, ‘module.bert.pooler.dense.bias’, ‘module.classifier.weight’, ‘module.classifier.bias’]. Are you sure they were included in the backward pass?

Could someone help me understand why this is happening?
I’m on ubuntu and am using python 3.8.5

cheers!

ptrblck · June 15, 2021, 2:33am

Based on cell 8 it seems you are freezing some layers and train only others:

trainable_layers = [model.bert.encoder.layer[-1], model.bert.pooler, model.classifier]
total_params = 0
trainable_params = 0

for p in model.parameters():
        p.requires_grad = False
        total_params += p.numel()

for layer in trainable_layers:
    for p in layer.parameters():
        p.requires_grad = True
        trainable_params += p.numel()

print(f"Total parameters count: {total_params}") # ~108M
print(f"Trainable parameters count: {trainable_params}") # ~7M

so I would assume that the frozen parameters do not have valid gradients.
I’m however unsure where this message is raised from and if it’s an error etc. so could you explain the issue a bit more?

anna_l · June 15, 2021, 9:40am

Hmm makes sense. The issue arises when virtual_step() is called:
… in
optimizer.virtual_step()
… line 282, in virtual_step
self.privacy_engine.virtual_step()
… line 435, in virtual_step
self.clipper.clip_and_accumulate()
… line 179, in clip_and_accumulate
named_params=self._named_grad_samples(),
… line 263, in _named_grad_samples
where the error is thrown

ptrblck · June 15, 2021, 6:28pm

I’m unsure what virtual_step() does and assume it’s coming from a 3rd party library?
Do you know, if this method expects all .grad attributes to be set and if so, could you filter the frozen parameters out while passing them to the optimizer?

ffuuugor · June 16, 2021, 4:08pm

Hi @anna_l !
Thanks for your question and for taking interest in opacus.

I’d need some more info to be able to help, as I wasn’t able to reproduce the issue in my setup.

Can you please share which versions of transformers and opacus are you using?
Does the error happen on the first training iteration or later?

To comment on some of the discussion points above:

virtual_step() is a method defined in PrivacyEngine in opacus. It a way to simulate large batches without heavy memory footprint.
In our tutorial we indeed freeze some layers, as correctly pointed out. However, the error above lists trainable layers as not having gradients, which is not what should happen. (e.g. bert.encoder.layer.11 is bert.encoder.layer[-1])

anna_l · June 22, 2021, 5:50pm

Hi @ffuuugor, pardon slow reply. It happens on the first training iteration, my transformers version is 4.6.1

ffuuugor · June 28, 2021, 2:49pm

Hey
Sorry, but I’m still having trouble reproducing the issue.
I’ve tried multiple package versions (opacus 0.13, 0.14, master), but none produce the error you’ve described.

Can you maybe share a Colab notebook with the error to help find the reason?

PS: While investigating this we’ve found and fixed quite bad memory inefficiency, so thanks for pointing that way