ValueError : Attemting to unscale fp16 Gradients

a_d · May 15, 2020, 1:00pm

Hello all, I am trying to train an LSTM in the half-precision setting. The LSTM takes an encoded input from a pre-trained autoencoder(Not trained in fp16). I am using torch.amp instead of apex and scaling the losses as suggested in the documentation.

Here is my training loop -

    def train_model(self, model, dataloader, num_epochs):
        model.cuda()
        least_loss = 5
        model.train()
        optimizer = torch.optim.Adam(model.parameters(), lr =1e-5)
        scaler = amp.GradScaler()
        training_loss = []
        for i in range(0, num_epochs + 1):
            st = time.time()
            training_acc = 0
            epoch_loss = 0
            for _, (x, y) in enumerate(dataloader):
                optimizer.zero_grad()
                sst = time.time()
                x = x.float().half().cuda()
                x, out = self.autoencoder(x)
                x = x.permute(0,2, 1)
                model.init_Hidden()
                y = y.cuda()
                output = model(x)
                loss = self.criterion(output, y)
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()

I call my model as -

lstm =lstm(features=1024, hidden_size=512, sequence_length=313, autoencoder=model).half().cuda()

I am getting the followiing error -

ValueError: Attempting to unscale FP16 gradients.

Could someone please tell why would this be happening
TIA

ptrblck · May 17, 2020, 5:21am

You shouldn’t call half manually on the model or data.
Could you remove the half call here: x = x.float().half().cuda() and rerun your script?

a_d · May 17, 2020, 7:23am

@ptrblck thanks for replying
I thought we had to convert the model to half by calling model.half() for fp16 training. (I am using torch.amp from 1.5 nightly builds) Also if .half() is only called either data or the model it gives an error saying weight type and input type should be the same(as one of them is half)

I tried running the script without calling .half() and Cuda ran out of memory. Also after calling .half(), the model did not go out of memory but raised the same error of unscaling at scaler.step(optimizer ) line.
I also did run a similar training loop and got the same error(did explicitly call model.half() and data.half())

ptrblck · May 17, 2020, 7:32am

torch.cuda.amp.autocast will use mixed-precision training and cast necessary tensors under the hood for you.
From the docs:

When entering an autocast-enabled region, Tensors may be any type. You should not call .half() on your model(s) or inputs when using autocasting.

a_d · May 19, 2020, 1:00pm

Thanks a lot @ptrblck . works like a charm

vict0rsch · November 7, 2020, 8:23pm

I get this error too:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-4d12a5af1b3f> in <module>
----> 1 trainer.run_epoch()

~/ccai/github/dev_omni/omnigan/omnigan/trainer.py in run_epoch(self)
    655                     param.requires_grad = True
    656 
--> 657                 self.update_D(multi_domain_batch)
    658 
    659             # -------------------------------

~/ccai/github/dev_omni/omnigan/omnigan/trainer.py in update_D(self, multi_domain_batch, verbose)
   1133                 d_loss = self.get_D_loss(multi_domain_batch, verbose)
   1134             self.grad_scaler_d.scale(d_loss).backward()
-> 1135             self.grad_scaler_d.step(self.d_opt)
   1136             self.grad_scaler_d.update()
   1137         else:

~/.conda/envs/omnienv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py in step(self, optimizer, *args, **kwargs)
    287 
    288         if optimizer_state["stage"] is OptState.READY:
--> 289             self.unscale_(optimizer)
    290 
    291         assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."

~/.conda/envs/omnienv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py in unscale_(self, optimizer)
    238         found_inf = torch.full((1,), 0.0, dtype=torch.float32, device=self._scale.device)
    239 
--> 240         optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
    241         optimizer_state["stage"] = OptState.UNSCALED
    242 

~/.conda/envs/omnienv/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py in _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
    185                 if param.grad is not None:
    186                     if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 187                         raise ValueError("Attempting to unscale FP16 gradients.")
    188                     else:
    189                         torch._amp_non_finite_check_and_unscale_(param.grad,

ValueError: Attempting to unscale FP16 gradients.

The piece of code yielding this error is:

with autocast():
    d_loss = self.get_D_loss(multi_domain_batch, verbose)
self.grad_scaler_d.scale(d_loss).backward()
self.grad_scaler_d.step(self.d_opt)
self.grad_scaler_d.update()

I’m using pytorch 1.6 and not calling half() on anything. Maybe this context can help: I’m training a GAN model and the exact same procedure on the generator’s loss, optimizer and scaler works without error.

Generator and Discriminator’s optimizers are Adam optimizers from torch.optim and grad_scaler_d and grad_scaler_g are GradScaler() instances from from torch.cuda.amp. @ptrblck where do I start debugging beyond looking for .half() calls?

ptrblck · November 7, 2020, 9:25pm

Could you post the model definitions and the general workflow, i.e. how the losses are calculated, which optimizers are used etc. so that we could help debugging?

vict0rsch · November 8, 2020, 12:01am

It’s quite complex (here) so I can’t really paste it all but for some reason the culprit seems to be changing requires_grad back and forth for the discriminator

# ------------------------------
# -----  Update Generator  -----
# ------------------------------
if self.d_opt is not None:
    for param in self.D.parameters():
        # continue
        param.requires_grad = False

self.update_G(batch)

# ----------------------------------
# -----  Update Discriminator  -----
# ----------------------------------

# unfreeze params of the discriminator
for param in self.D.parameters():
    # continue
    param.requires_grad = True

self.update_D(batch)

The error disappears if I comment-in continue or equivalently if I comment our the 2 for loops around the gradient

Where both self.update_X(batch) for being either the generator (g) or the discriminator (d) methods are structured as:

with autocast():
    x_loss = self.get_x_loss(batch)
self.grad_scaler_x.scale(x_loss).backward()
self.grad_scaler_x.step(self.x_opt)
self.grad_scaler_x.update()

In both cases x_opt is a regular torch.optim.Adam(X.parameters())

ptrblck · November 9, 2020, 1:38am

I cannot reproduce the issue using the DCGAN example and setting requires_grad=False for the parameters of netD in the update step of the generator.

vict0rsch · November 9, 2020, 5:35pm

Hmm this is so weird. I’m going to try and keep digging. I’ll get back to you, hopefully with a reproducible culprit. Thank you

Kroshtan · February 2, 2022, 1:35pm

@ptrblck Is there any reason why this error would suddenly appear when using code that worked locally (on a RTX3050 GPU) on a Azure Data Science VM with a T4 GPU?
Local versions:
Torch 1.10.0
Cuda 11.4

Azure versions:
Torch 1.10.0
Cuda 11.5

ptrblck · February 2, 2022, 6:50pm

No, I don’t think there would be any reason as I also wasn’t able to reproduce the issue as described in the previous posts.

Kroshtan · February 3, 2022, 8:35am

Can you help me figure out if there are any issues with my implementation specifically?

Model:

class BertClassifier(nn.Module):
    """
    Class defining the classifier model with a BERT encoder and a single fully connected classifier layer.
    """
    def __init__(self, dropout=0.5, num_labels=24):
        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, num_labels)
        self.relu = nn.ReLU()
        self.best_score = 0

    def forward(self, input_id, mask):
        _, pooled_output = self.bert(input_ids=input_id, attention_mask=mask, return_dict=False)
        output = self.relu(self.linear(self.dropout(pooled_output)))

        return output

Helper objects:

device = torch.device("cuda" if use_cuda else "cpu")
criterion = nn.CrossEntropyLoss().cuda() if use_cuda else nn.CrossEntropyLoss
# Set eps to 1e-04 to use float16
optimizer = Adam(model.parameters(), lr=learning_rate, eps=1e-04)
# Use scaler to use mixed precision (float16 and float32)
scaler = torch.cuda.amp.GradScaler()
# Use scheduler to reduce learning rate gradually
scheduler = ReduceLROnPlateau(optimizer, factor=0.5, patience=5)
if use_cuda:
    # use float16 to reduce GPU memory load
    model = model.cuda().to(dtype=torch.float16)

Training steps:

def forward_pass(auxiliaries, inputs, label):
    device, criterion, optimizer, scaler, model, _ = auxiliaries
    label = label.to(device)
    mask = inputs['attention_mask'].to(device)
    input_id = inputs['input_ids'].squeeze(1).to(device)
    with torch.cuda.amp.autocast():
        output = model(input_id, mask)
        loss = criterion(output, label)
    return loss

def backward_pass(auxiliaries, batch_loss):
    _, _, optimizer, scaler, model, _ = auxiliaries

    scaler.scale(batch_loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

def train_loop(auxiliaries, train_dataloader):
    for train_input, train_label in tqdm(train_dataloader):
        batch_loss = forward_pass(auxiliaries, train_input, train_label)
        backward_pass(auxiliaries, batch_loss)

Error trace:

Traceback (most recent call last):
  File "/bert_extraction/bert_extraction_main.py", line 27, in <module>
    ranked_train = train(model, df_train, df_val, ENV, label_converter)
  File "/bert_extraction/train_test.py", line 221, in train
    train_results = train_loop(auxiliaries, train_dataloader)
  File "/bert_extraction/train_test.py", line 140, in train_loop
    backward_pass(auxiliaries, batch_loss)
  File "/bert_extraction/train_test.py", line 119, in backward_pass
    scaler.step(optimizer)
  File "/anaconda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 334, in step
    self.unscale_(optimizer)
  File "/anaconda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 279, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/anaconda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 207, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

ptrblck · February 3, 2022, 8:41am

If you are using amp you are not supposed to call model.half() or model.to(torch.float16) and let autocast perform the casting.

Kroshtan · February 3, 2022, 8:45am

Maybe this is a rookie question, but if I don’t call model.to(torch.float16), how do I indicate at all that I want to be using mixed precision? Or does the amp module do that automatically whenever it is possible?

Additionally, why would this then work locally? For multiple experiments throughout multiple weeks, even.

I tried it, and when I remove the model.to(torch.float16) call, I get the ValueError: Attempting to unscale FP16 gradients. error locally, instead. So it appears there is something else going wrong still.

ptrblck · February 3, 2022, 8:51am

As described before, amp.autocast will perform the casting. The AMP Recipe as well as the AMP tutorial might be a good started to see how this util. is used.

I don’t know, but feel free to post a minimal, executable code snippet reproducing the issue.

Kroshtan · February 3, 2022, 9:30am

Thanks for the help. Your suggestion of removing the model.to(torch.float16) solved the issue on the VM, and the local issue that “resulted from the change” was actually caused by another change made in an attempt to fix the first issue. So all is good now.

I have no explanation why the model.to(torch.float16) call doesn’t break anything locally, and allows the model to train on mixed precision normally, but I currently have no interest in diving too deep into that.

rajat_sarkar · January 22, 2024, 7:41pm

I utilized amp.autocast without invoking model.half(), and while my code was executing, I encountered a problem with the training loss being NaN due to the use of amp.GradScaler(). Can you please guide me in resolving this NaN issue? Currently, my code is operating with float16 through the amp library, but I am experiencing NaN training loss. Initially, I was running a float32 model and encountered a CUDA memory error. Hence, I decided to switch to float16.

ptrblck · January 22, 2024, 7:43pm

The GradScaler scales the gradients and not the forward activations, thus cannot create NaNs in the output of loss of your model.
You could try to narrow down which operation is causing the NaNs via forward hooks or e.g. by printing stats about intermediate activations. If you cannot isolate the operation causing NaNs, you could try to use bflaot16 which has the same range as float32 assuming an activation overflows.

Shivani2 · April 10, 2024, 9:07am

# Train and validation loop
def train_and_validate(model, train_loader, val_loader, criterion, optimizer, num_epochs):
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    scaler = torch.cuda.amp.GradScaler()  # Initialize GradScaler for mixed precision training

    model.to(device)

    for epoch in range(num_epochs):
        # Training loop
        model.train()
        running_loss = 0.0
        correct_predictions = 0
        total_samples = 0

        for inputs, labels in tqdm(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()

            # Perform forward pass and loss computation under autocast for automatic mixed precision
            with torch.cuda.amp.autocast():
                outputs = model(inputs)
                loss = criterion(outputs, labels)

            # Perform backward pass and optimization under autocast
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

            running_loss += loss.item()

            _, predicted = torch.max(outputs, 1)
            correct_predictions += (predicted == labels).sum().item()
            total_samples += labels.size(0)

        epoch_loss = running_loss / len(train_loader)
        epoch_accuracy = correct_predictions / total_samples

        #print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {epoch_loss:.4f}, Train Accuracy: {epoch_accuracy:.4f}")

        # Validation loop
        model.eval()
        running_loss = 0.0
        correct_predictions = 0
        total_samples = 0

        for inputs, labels in tqdm(val_loader):
            inputs, labels = inputs.to(device), labels.to(device)

            with torch.cuda.amp.autocast():
                outputs = model(inputs)
                loss = criterion(outputs, labels)

            running_loss += loss.item()

            _, predicted = torch.max(outputs, 1)
            correct_predictions += (predicted == labels).sum().item()
            total_samples += labels.size(0)

        validation_loss = running_loss / len(val_loader)
        validation_accuracy = correct_predictions / total_samples

        #print(f"Validation Loss: {validation_loss:.4f}, Validation Accuracy: {validation_accuracy:.4f}")

        info = "[Epoch {}/{}]: train-loss = {:0.6f} | train-acc = {:0.3f} | val-loss = {:0.6f} | val-acc = {:0.3f}"

        print(info.format(epoch+1, num_epochs, epoch_loss, epoch_accuracy, validation_loss, validation_accuracy))

        torch.save(model.state_dict(), '/content/drive/MyDrive/AdakeData/cropped_data/checkpoints_weights/checkpoints_resnet_FP16/checkpoint_resnet_FP16_{}'.format(epoch + 1))

    torch.save(model.state_dict(), '/content/drive/MyDrive/AdakeData/cropped_data/checkpoints_weights/checkpoints_resnet_FP16/resnet_FP16_weights')

if __name__ == "__main__":
    train_loader, val_loader, test_loader = dataloader_custom_data()

    model = ResNet56()
    model = model.half()
    # Move the model to the appropriate device
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    model.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.00001)

    num_epochs = 1000

    train_and_validate(model, train_loader, val_loader, criterion, optimizer, num_epochs)

this is my training code i m getting this error which i dont know to solve

Image shape of a random sample image : (1, 256, 256)

Training Set:   55 images
Validation Set:   16 images
Test Set:       8 images
  0%|          | 0/28 [00:02<?, ?it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-e943bea6e81d> in <cell line: 1>()
     13     num_epochs = 1000
     14 
---> 15     train_and_validate(model, train_loader, val_loader, criterion, optimizer, num_epochs)

3 frames
/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py in _unscale_grads_(self, optimizer, inv_scale, found_inf, allow_fp16)
    256                         continue
    257                     if (not allow_fp16) and param.grad.dtype == torch.float16:
--> 258                         raise ValueError("Attempting to unscale FP16 gradients.")
    259                     if param.grad.is_sparse:
    260                         # is_coalesced() == False means the sparse grad has values with duplicate indices.

ValueError: Attempting to unscale FP16 gradients.