ValueError : Attemting to unscale fp16 Gradients

Hello all, I am trying to train an LSTM in the half-precision setting. The LSTM takes an encoded input from a pre-trained autoencoder(Not trained in fp16). I am using torch.amp instead of apex and scaling the losses as suggested in the documentation.

Here is my training loop -

    def train_model(self, model, dataloader, num_epochs):
        model.cuda()
        least_loss = 5
        model.train()
        optimizer = torch.optim.Adam(model.parameters(), lr =1e-5)
        scaler = amp.GradScaler()
        training_loss = []
        for i in range(0, num_epochs + 1):
            st = time.time()
            training_acc = 0
            epoch_loss = 0
            for _, (x, y) in enumerate(dataloader):
                optimizer.zero_grad()
                sst = time.time()
                x = x.float().half().cuda()
                x, out = self.autoencoder(x)
                x = x.permute(0,2, 1)
                model.init_Hidden()
                y = y.cuda()
                output = model(x)
                loss = self.criterion(output, y)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()

I call my model as -

lstm =lstm(features=1024, hidden_size=512, sequence_length=313, autoencoder=model).half().cuda()

I am getting the followiing error -

ValueError: Attempting to unscale FP16 gradients.

Could someone please tell why would this be happening
TIA

You shouldn’t call half manually on the model or data.
Could you remove the half call here: x = x.float().half().cuda() and rerun your script?

@ptrblck thanks for replying
I thought we had to convert the model to half by calling model.half() for fp16 training. (I am using torch.amp from 1.5 nightly builds) Also if .half() is only called either data or the model it gives an error saying weight type and input type should be the same(as one of them is half)

I tried running the script without calling .half() and Cuda ran out of memory. Also after calling .half(), the model did not go out of memory but raised the same error of unscaling at scaler.step(optimizer ) line.
I also did run a similar training loop and got the same error(did explicitly call model.half() and data.half())

torch.cuda.amp.autocast will use mixed-precision training and cast necessary tensors under the hood for you.
From the docs:

When entering an autocast-enabled region, Tensors may be any type. You should not call .half() on your model(s) or inputs when using autocasting.

Thanks a lot @ptrblck . works like a charm

I get this error too:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-4d12a5af1b3f> in <module>
----> 1 trainer.run_epoch()

~/ccai/github/dev_omni/omnigan/omnigan/trainer.py in run_epoch(self)
    655                     param.requires_grad = True
    656 
--> 657                 self.update_D(multi_domain_batch)
    658 
    659             # -------------------------------

~/ccai/github/dev_omni/omnigan/omnigan/trainer.py in update_D(self, multi_domain_batch, verbose)
   1133                 d_loss = self.get_D_loss(multi_domain_batch, verbose)
   1134             self.grad_scaler_d.scale(d_loss).backward()
-> 1135             self.grad_scaler_d.step(self.d_opt)
   1136             self.grad_scaler_d.update()
   1137         else:

~/.conda/envs/omnienv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py in step(self, optimizer, *args, **kwargs)
    287 
    288         if optimizer_state["stage"] is OptState.READY:
--> 289             self.unscale_(optimizer)
    290 
    291         assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."

~/.conda/envs/omnienv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py in unscale_(self, optimizer)
    238         found_inf = torch.full((1,), 0.0, dtype=torch.float32, device=self._scale.device)
    239 
--> 240         optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
    241         optimizer_state["stage"] = OptState.UNSCALED
    242 

~/.conda/envs/omnienv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py in _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
    185                 if param.grad is not None:
    186                     if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 187                         raise ValueError("Attempting to unscale FP16 gradients.")
    188                     else:
    189                         torch._amp_non_finite_check_and_unscale_(param.grad,

ValueError: Attempting to unscale FP16 gradients.

The piece of code yielding this error is:

with autocast():
    d_loss = self.get_D_loss(multi_domain_batch, verbose)
self.grad_scaler_d.scale(d_loss).backward()
self.grad_scaler_d.step(self.d_opt)
self.grad_scaler_d.update()

I’m using pytorch 1.6 and not calling half() on anything. Maybe this context can help: I’m training a GAN model and the exact same procedure on the generator’s loss, optimizer and scaler works without error.

Generator and Discriminator’s optimizers are Adam optimizers from torch.optim and grad_scaler_d and grad_scaler_g are GradScaler() instances from from torch.cuda.amp. @ptrblck where do I start debugging beyond looking for .half() calls?

Could you post the model definitions and the general workflow, i.e. how the losses are calculated, which optimizers are used etc. so that we could help debugging?

It’s quite complex (here) so I can’t really paste it all but for some reason the culprit seems to be changing requires_grad back and forth for the discriminator

# ------------------------------
# -----  Update Generator  -----
# ------------------------------
if self.d_opt is not None:
    for param in self.D.parameters():
        # continue
        param.requires_grad = False

self.update_G(batch)

# ----------------------------------
# -----  Update Discriminator  -----
# ----------------------------------

# unfreeze params of the discriminator
for param in self.D.parameters():
    # continue
    param.requires_grad = True

self.update_D(batch)

The error disappears if I comment-in continue or equivalently if I comment our the 2 for loops around the gradient

Where both self.update_X(batch) for being either the generator (g) or the discriminator (d) methods are structured as:

with autocast():
    x_loss = self.get_x_loss(batch)
self.grad_scaler_x.scale(x_loss).backward()
self.grad_scaler_x.step(self.x_opt)
self.grad_scaler_x.update()

In both cases x_opt is a regular torch.optim.Adam(X.parameters())

I cannot reproduce the issue using the DCGAN example and setting requires_grad=False for the parameters of netD in the update step of the generator.

Hmm this is so weird. I’m going to try and keep digging. I’ll get back to you, hopefully with a reproducible culprit. Thank you

@ptrblck Is there any reason why this error would suddenly appear when using code that worked locally (on a RTX3050 GPU) on a Azure Data Science VM with a T4 GPU?
Local versions:
Torch 1.10.0
Cuda 11.4

Azure versions:
Torch 1.10.0
Cuda 11.5

No, I don’t think there would be any reason as I also wasn’t able to reproduce the issue as described in the previous posts.

Can you help me figure out if there are any issues with my implementation specifically?

Model:

class BertClassifier(nn.Module):
    """
    Class defining the classifier model with a BERT encoder and a single fully connected classifier layer.
    """
    def __init__(self, dropout=0.5, num_labels=24):
        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, num_labels)
        self.relu = nn.ReLU()
        self.best_score = 0

    def forward(self, input_id, mask):
        _, pooled_output = self.bert(input_ids=input_id, attention_mask=mask, return_dict=False)
        output = self.relu(self.linear(self.dropout(pooled_output)))

        return output

Helper objects:

device = torch.device("cuda" if use_cuda else "cpu")
criterion = nn.CrossEntropyLoss().cuda() if use_cuda else nn.CrossEntropyLoss
# Set eps to 1e-04 to use float16
optimizer = Adam(model.parameters(), lr=learning_rate, eps=1e-04)
# Use scaler to use mixed precision (float16 and float32)
scaler = torch.cuda.amp.GradScaler()
# Use scheduler to reduce learning rate gradually
scheduler = ReduceLROnPlateau(optimizer, factor=0.5, patience=5)
if use_cuda:
    # use float16 to reduce GPU memory load
    model = model.cuda().to(dtype=torch.float16)

Training steps:

def forward_pass(auxiliaries, inputs, label):
    device, criterion, optimizer, scaler, model, _ = auxiliaries
    label = label.to(device)
    mask = inputs['attention_mask'].to(device)
    input_id = inputs['input_ids'].squeeze(1).to(device)
    with torch.cuda.amp.autocast():
        output = model(input_id, mask)
        loss = criterion(output, label)
    return loss

def backward_pass(auxiliaries, batch_loss):
    _, _, optimizer, scaler, model, _ = auxiliaries

    scaler.scale(batch_loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

def train_loop(auxiliaries, train_dataloader):
    for train_input, train_label in tqdm(train_dataloader):
        batch_loss = forward_pass(auxiliaries, train_input, train_label)
        backward_pass(auxiliaries, batch_loss)

Error trace:

Traceback (most recent call last):
  File "/bert_extraction/bert_extraction_main.py", line 27, in <module>
    ranked_train = train(model, df_train, df_val, ENV, label_converter)
  File "/bert_extraction/train_test.py", line 221, in train
    train_results = train_loop(auxiliaries, train_dataloader)
  File "/bert_extraction/train_test.py", line 140, in train_loop
    backward_pass(auxiliaries, batch_loss)
  File "/bert_extraction/train_test.py", line 119, in backward_pass
    scaler.step(optimizer)
  File "/anaconda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 334, in step
    self.unscale_(optimizer)
  File "/anaconda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 279, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/anaconda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 207, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

If you are using amp you are not supposed to call model.half() or model.to(torch.float16) and let autocast perform the casting.

Maybe this is a rookie question, but if I don’t call model.to(torch.float16), how do I indicate at all that I want to be using mixed precision? Or does the amp module do that automatically whenever it is possible?

Additionally, why would this then work locally? For multiple experiments throughout multiple weeks, even.

I tried it, and when I remove the model.to(torch.float16) call, I get the ValueError: Attempting to unscale FP16 gradients. error locally, instead. So it appears there is something else going wrong still.

As described before, amp.autocast will perform the casting. The AMP Recipe as well as the AMP tutorial might be a good started to see how this util. is used.

I don’t know, but feel free to post a minimal, executable code snippet reproducing the issue.

Thanks for the help. Your suggestion of removing the model.to(torch.float16) solved the issue on the VM, and the local issue that “resulted from the change” was actually caused by another change made in an attempt to fix the first issue. So all is good now.

I have no explanation why the model.to(torch.float16) call doesn’t break anything locally, and allows the model to train on mixed precision normally, but I currently have no interest in diving too deep into that.