Selective gradient clipping

SudakshK · April 1, 2025, 2:14pm

Hi,
I have been working on an application where I need to partly freeze the layers and I do this by masking gradients to zero in each epoch for particular weights.
Thus I removed gradient clipping while training vit-b-16 model on imagenet, as it always changed my gradeints which need to be 0. It turned out that the gradients were exploding and I recieved nan loss. How can I avoid this, while keeping gradient from not exploding and also freeze the grads. Is it possible to selectively apply grad clipping on particular weights?

ptrblck · April 1, 2025, 5:32pm

Yes, you can pass a subset to the clipping operator:

model = models.resnet18()
params_for_grad_clipping = itertools.chain(model.conv1.parameters(), model.fc.parameters())

x = torch.randn(1, 3, 224, 224)
out = model(x)
out.mean().backward()

print(model.conv1.weight.grad.abs().sum(), model.fc.weight.grad.abs().sum())
# tensor(42.4067) tensor(420.8640)

torch.nn.utils.clip_grad_value_(params_for_grad_clipping, clip_value=1e-7)
print(model.conv1.weight.grad.abs().sum(), model.fc.weight.grad.abs().sum())
# tensor(0.0009) tensor(0.0512)

SudakshK · April 2, 2025, 12:17pm

Hi Patrick,
Really thanks for the reply, but maybe I will rephrase my question, I need to partially freeze each layer. Below is the code snippet:

with torch.no_grad():
                layer.weight.grad[freeze_dim0, freeze_dim1] = 0
                if layer.bias is not None:
                        layer.bias.grad[freeze_dim0] = 0

So I am freezing specific parts of the weights and not whole weight, can I apply gradient clipping only on the remaining weights, whose gradients are still non-zero?

ptrblck · April 2, 2025, 2:03pm

I don’t think this would be directly possible without workarounds of creating custom trainable and frozen weights in custom modules.
Given that gradient clipping should not change gradients with a value of zero you should be able to just apply it on the entire parameter.
If you want to compute e.g. the norm of the trainable part only, you could use torch.nn.utils.clip_grads_with_norm_ and pass the pre-calculated total_norm to this call.

SudakshK · April 8, 2025, 1:17pm

Thanks! Got it, It works in this way!