Self.scaler.step(self.d_optimizer): AssertionError: No inf checks were recorded for this optimizer

v-moayman · August 10, 2022, 8:59am

I am new to pytorch and am trying to reduce the consumption usage of GPUs. What I am trying to do is to update the weights manually. In this sense, I am getting the new gradient values. Then, I update the weights as follows:

 grads = torch.autograd.grad(
                        d_loss, weights.values(), create_graph=True, allow_unused=True
                    )
weights = OrderedDict(
                        (name, param - grad) if grad is not None else (name, param)
                        for ((name, param), grad) in zip(weights.items(), grads)
                    )

The problem here is that I do need to update the weights without making the gradient to be true, as it increases the GPU consumption usage without a need for it.

When I changed the second line as follows:

with torch.no_grad():
                            for ((name, param), grad) in zip(weights.items(), grads):
                                if grad is not None:
                                    param -= grad

However, it gives me the following message:

assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer

What I do not understand why this even occurs. I just tried to make the updates inplace without a need for the gradient to be true. Any help?

ptrblck · August 10, 2022, 6:59pm

I don’t understand this claim.
There shouldn’t be any difference in GPU memory usage if you calculate the gradients explicitly via torch.autograd.grad or if you allow the backward() call to store these gradients in the .grad attributes of the trainable parameters. In the end you need to store the gradients somewhere, so could you explain which operation uses GPU memory without the need for it?

You are most likely trying to use torch.cuda.amp and call scaler.step(optimizer) somewhere which internally will call optimizer.step() and fail as it’s not needed.
Since you are already using a manual approach of directly updating the parameters (instead of letting the optimizer do it) you might also want to consider using a manual loss scaling approach.

v-moayman · August 10, 2022, 7:35pm

Thank you for your reply,
I mean that the line that causes of updating the weights leads to a huge increase in the GPU consumption. Now, since the updates do not need the grad to be true, I made it under with torch. no_grad()
Regarding the updating the parameters, this the the snippet of my code:


for outer_step in range(self.outer_steps):
                for inner_step in range(self.n_inner_steps):
                    # forward
                    d_losses, _, bert_outputs = model.forward_with_params(
                       .....
                    )
                    d_loss = d_losses.mean()
                  
                    # backward
                    grads = torch.autograd.grad(
                        d_loss, weights.values(), create_graph=True, allow_unused=True
                    )
    
                    # # update parameters (SGD)
                    with torch.no_grad():
                        for ((name, param), grad) in zip(weights.items(), grads):
                            if grad is not None:
                               param -= grad
                
                    del grads 
                    del d_loss
                    torch.cuda.empty_cache()

                    if self.accum_loss or (inner_step + 1) == self.n_inner_steps:
                        # loss of updated parameters (after each gradient step)
                        with amp.autocast(dtype=self.dtype, enabled=self.use_amp):
                            after_losses, after_logits, _ = model.forward_with_params(
                                input_ids=input_ids.to(self.device),
                                attention_mask=attention_mask.to(self.device),
                                labels=labels.to(self.device),
                                weights=weights,
                            )
                            after_loss = after_losses.mean()
                            loss += after_loss

                cur_after_loss += after_loss.item() * batch_size
                cur_after_correct += (
                    after_logits.cpu().argmax(1).eq(labels).sum().item()
                )

                
                self.d_optimizer.zero_grad()
                # backward
                self.scaler.scale(loss).backward()
                # unscale gradients (for gradient clipping)
                self.scaler.unscale_(self.d_optimizer)
                # gradient cliping
                torch.nn.utils.clip_grad_norm_(
                    self.optimize_param_list, self.max_grad_norm
                )
                self.scaler.step(self.d_optimizer)
                self.scaler.update()
                self.d_scheduler.step()

What is interesting here is that when I replace the snippet code of updating the weights
as shown in the post, it works.

ptrblck · August 10, 2022, 7:54pm

Which optimizer are you using? If the optimizer itself uses running stats (e.g. such as Adam) an increase in memory would be expected since the optimizer needs to allocate these stats and track them. In your manual update you would then skip these running stats and would most likely fall back to a plain SGD update rule.

v-moayman · August 10, 2022, 8:09pm

It is apparent that Adam is memory hungry, but would it reserve more than 20 GB from GPU memory?

ptrblck · August 10, 2022, 9:49pm

It would depend on the parameter count but 20GB sounds for the majority of models too large, so could you share a minimal, executable code snippet which would show these 20GB increase?

v-moayman · August 15, 2022, 9:01am

The GPU increase was due to Adam usage for the GPU. I guess you were right. Thank you!