RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3, 1]], which is output 0 of TanhBackward, is at version 1; expected version 0 instead

This is taken from here GitHub - davda54/sam: SAM: Sharpness-Aware Minimization (PyTorch)
You are right, I fixed the first one but the second method is still giving an error

#---------------Defination of LLOSS------------
if method == 'lloss':
                base_optimizer = torch.optim.SGD   
                optim_module   = SAM(models['module'].parameters(),  base_optimizer, lr=LR, 
                    momentum=MOMENTUM, weight_decay=WDECAY)
                sched_module   = lr_scheduler.MultiStepLR(optim_module, milestones=MILESTONES)
                optimizers = {'backbone': optim_backbone, 'module': optim_module}
                
                schedulers = {'backbone': sched_backbone, 'module': sched_module} 
            
# -----------------SAM Optimizer -------------------
        
        criterion(models['backbone'](inputs)[0], labels)
        loss.backward(retain_graph=True)
        optimizers['backbone'].first_step(zero_grad=True)
        
        criterion(models['backbone'](inputs)[0], labels)
        optimizers['backbone'].second_step(zero_grad=True)

        # -----------------SAM Optimizer for LLOSS Method -------------------
        if method == 'lloss':
            #optimizers['module'].step()
            criterion(models['backbone'](inputs)[0], labels)
            loss.backward(retain_graph=True)
            optimizers['module'].first_step(zero_grad=True)
            
            criterion(models['backbone'](inputs)[0], labels)
            optimizers['module'].second_step(zero_grad=True)

ERROR

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 100]], which is output 0 of AsStridedBackward0, is at version 3; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Using retain_graph=True will not fix the issue as it would only keep the intermediate forward activations alive. The main issue might still be the same: the step() method updates parameters which would be needed for the next backward call.
In this case you could either recalculate the forward pass to create the forward activations using the already updated parameters or update the parameters after all gradients were computed.

thank you so much for your comments, can you share with me any related tutorials.

I don’t know if there is a good tutorial, but this code snippet shows why this approach is mathematically wrong:

# setup
model = nn.Sequential(
    nn.Linear(10, 10),
    nn.ReLU(),
    nn.Linear(10, 10)
)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# forward pass
x = torch.randn(1, 10)
out = model(x)

# loss calclation
loss = criterion(out, torch.rand_like(out))

# gradient calculation using the intermediate forward activations from the 
# previous forward pass (a0) and the current parameter set (p0)
loss.backward(retain_graph=True)

# update parameters to new set p1
optimizer.step()

# gradient calculation using the stale activations (a0) and the new parameter
# set p1, which will not work as it's mathematically wrong
loss.backward()
# RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10, 10]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Hi, I’m facing a similar issue:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 10000]], which is output 0 of SoftmaxBackward0, is at version 10; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

#Snippet-1 is the head part of the model
class HeadPart(nn.Module):

    def __init__(self, num_features, num_classes):
        
        super().__init__()
        self.num_classes = tuple(num_classes)
        for num_class in self.num_classes:
            assert num_class > 0

        self.heads = nn.ModuleList(
            # [nn.Linear(num_features, num_class) for num_class in self.num_classes] **# Line-A**
            [nn.Sequential(nn.Linear(num_features, num_class), nn.Softmax(1)) for num_class in self.num_classes] **#Line-B**
            
        )

    def forward(self, x):
        return [head(x) for head in self.heads]

The code was working fine! But when I updated 1) the output layer values of the model (according to the requirements of the task but with the same shape) and 2) replace line-A with line-B; the above-mentioned runtime error occurs in line-C of the below snippet-2.

#Snippet-2
class NativeScalerWithGradNormCount:
    state_dict_key = "amp_scaler"

    def __init__(self):
        self._scaler = torch.cuda.amp.GradScaler()

    def __call__(
        self,
        loss,
        optimizer,
        clip_grad=None,
        parameters=None,
        create_graph=False,
        update_grad=True,
    ):
        self._scaler.scale(loss).backward(create_graph=create_graph) **#Line-C**
        if update_grad:
            if clip_grad is not None:
                assert parameters is not None
                self._scaler.unscale_(
                    optimizer
                )  # unscale the gradients of optimizer's assigned params in-place
                norm = torch.nn.utils.clip_grad_norm_(parameters, clip_grad)
            else:
                self._scaler.unscale_(optimizer)
                norm = ampscaler_get_grad_norm(parameters)
            self._scaler.step(optimizer)
            self._scaler.update()
        else:
            norm = None
        return norm

# snippet-3
def updated_output(hierarchy, outputs): 

    for level in range (len(hierarchy)):
        for p in hierarchy[level].keys():
            outputs[level+1][:, hierarchy[level][p]]=(outputs[level+1][:, hierarchy[level][p]]).mul_(outputs[level][:, [p]])
        outputs[level+1]=normalize_output(outputs[level+1])

    return outputs

A few observations:

  1. The code works with line-A of snippet-1, even though I update the output layer values of the model (snippet-3).
  2. The code works with Line-B of code snippet-1 only if I don’t update the output layer values of the model (snippet-3)
  3. But, If I use Line-B of snippet-1, and also update the output layer values of the model (snippet-3) then the above error occurs.

What I want is, to use line B of snippet-1 with updating the output values.

If the error/issue (in place=True) is with snippet 3, what is the efficient way to resolve it?

Need help to resolve this.
Thanks!!

Hi @John5
Not sure I clearly understand what is going on in your code. When you say line 1 and 2 you mean A and B right?
So adding a soft max and doing i place modifications of your output causes an error that is not raised when you don’t have it, is that right?

I can’t really tell you what’s going on without a further look at the code (we don’t see there the last snippet is used or what is the shape of the tensor in the code etc) but in general you should avoid in-place modifications of your tensors when you’re using autograd. In the last snippet I would create an empty tensor (torch.empty_like(tensor)) and populate it like you do in the loop, rather than modifying the input tensor in place.
Hope that helps!

Thank you @vmoens for your reply!!!
Yes, line 1 and 2 means lines A and B (now corrected).
Using tensor (torch.empty_like(tensor)) solves my problem. Thanks!!!

1 Like

Hey, I have encountered the similar error, and I am using hugging face accelerator for DDP,
Please help me with the solution, Below is my code,

from accelerate import Accelerator

model = AutoModel.from_pretrained(‘BAAI/bge-large-en’)
def mean_pool(token_embeds, attention_mask):
# reshape attention_mask to cover 768-dimension embeddings
in_mask = attention_mask.unsqueeze(-1).expand(
token_embeds.size()
).float()
# perform mean-pooling but exclude padding tokens (specified by in_mask)
pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(
in_mask.sum(1), min=1e-9
)
return pool
accelerator = Accelerator()
device = accelerator.device
model.to(device)

cos_sim = torch.nn.CosineSimilarity()
loss_func = torch.nn.CrossEntropyLoss()
scale = 20.0 # we multiply similarity score by this scale value

move layers to device

cos_sim.to(device)
loss_func.to(device)

from transformers.optimization import get_linear_schedule_with_warmup

initialize Adam optimizer

optim = torch.optim.Adam(model.parameters(), lr=2e-5)

setup warmup for first ~10% of steps

total_steps = int(len(anchors) / batch_size)
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(
optim, num_warmup_steps=warmup_steps,
num_training_steps=total_steps-warmup_steps)

model, optim, loader, scheduler = accelerator.prepare(
model, optim, loader, scheduler)

from tqdm.auto import tqdm
epochs=10

1 epoch should be enough, increase if wanted

for epoch in range(epochs):
model.train() # make sure model is in training mode
# initialize the dataloader loop with tqdm (tqdm == progress bar)
loop = tqdm(loader, leave=True)
for batch in loop:
# zero all gradients on each new step
optim.zero_grad()
# prepare batches and more all to the active device
anchor_ids = batch[‘anchor’][‘input_ids’].to(device)
anchor_mask = batch[‘anchor’][‘attention_mask’].to(device)
pos_ids = batch[‘positive’][‘input_ids’].to(device)
pos_mask = batch[‘positive’][‘attention_mask’].to(device)
# extract token embeddings from BERT
a = model(
anchor_ids, attention_mask=anchor_mask
)[0] # all token embeddings
p = model(
pos_ids, attention_mask=pos_mask
)[0]
# get the mean pooled vectors
a = mean_pool(a, anchor_mask)
p = mean_pool(p, pos_mask)
# calculate the cosine similarities
scores = torch.stack([
cos_sim(
a_i.reshape(1, a_i.shape[0]), p
) for a_i in a])
# get label(s) - we could define this before if confident of consistent batch sizes
labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device)
# and now calculate the loss
loss = loss_func(scores*scale, labels)
# using loss, calculate gradients and then optimize
accelerator.backward(loss)
# loss.backward()
optim.step()
# update learning rate scheduler
scheduler.step()
# update the TDQM progress bar
loop.set_description(f’Epoch {epoch}')
loop.set_postfix(loss=loss.item())

Hi, I am facing the similar issue, I am using Accelerator for DDP, below is my code please help

from accelerate import Accelerator

model = AutoModel.from_pretrained(‘BAAI/bge-large-en’)
def mean_pool(token_embeds, attention_mask):
# reshape attention_mask to cover 768-dimension embeddings
in_mask = attention_mask.unsqueeze(-1).expand(
token_embeds.size()
).float()
# perform mean-pooling but exclude padding tokens (specified by in_mask)
pool = torch.sum(token_embeds * in_mask, 1) / torch.clamp(
in_mask.sum(1), min=1e-9
)
return pool
accelerator = Accelerator()
device = accelerator.device
model.to(device)

cos_sim = torch.nn.CosineSimilarity()
loss_func = torch.nn.CrossEntropyLoss()
scale = 20.0 # we multiply similarity score by this scale value

move layers to device

cos_sim.to(device)
loss_func.to(device)

from transformers.optimization import get_linear_schedule_with_warmup

initialize Adam optimizer

optim = torch.optim.Adam(model.parameters(), lr=2e-5)

setup warmup for first ~10% of steps

total_steps = int(len(anchors) / batch_size)
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(
optim, num_warmup_steps=warmup_steps,
num_training_steps=total_steps-warmup_steps)

model, optim, loader, scheduler = accelerator.prepare(
model, optim, loader, scheduler)

from tqdm.auto import tqdm
epochs=10

1 epoch should be enough, increase if wanted

for epoch in range(epochs):
model.train() # make sure model is in training mode
# initialize the dataloader loop with tqdm (tqdm == progress bar)
loop = tqdm(loader, leave=True)
for batch in loop:
# zero all gradients on each new step
optim.zero_grad()
# prepare batches and more all to the active device
anchor_ids = batch[‘anchor’][‘input_ids’].to(device)
anchor_mask = batch[‘anchor’][‘attention_mask’].to(device)
pos_ids = batch[‘positive’][‘input_ids’].to(device)
pos_mask = batch[‘positive’][‘attention_mask’].to(device)
# extract token embeddings from BERT
a = model(
anchor_ids, attention_mask=anchor_mask
)[0] # all token embeddings
p = model(
pos_ids, attention_mask=pos_mask
)[0]
# get the mean pooled vectors
a = mean_pool(a, anchor_mask)
p = mean_pool(p, pos_mask)
# calculate the cosine similarities
scores = torch.stack([
cos_sim(
a_i.reshape(1, a_i.shape[0]), p
) for a_i in a])
# get label(s) - we could define this before if confident of consistent batch sizes
labels = torch.tensor(range(len(scores)), dtype=torch.long, device=scores.device)
# and now calculate the loss
loss = loss_func(scores*scale, labels)
# using loss, calculate gradients and then optimize
accelerator.backward(loss)
# loss.backward()
optim.step()
# update learning rate scheduler
scheduler.step()
# update the TDQM progress bar
loop.set_description(f’Epoch {epoch}')
loop.set_postfix(loss=loss.item())

Your code is unfortunately neither properly formatted nor executable and I didn’t see any obvious issues by skimming through it.

Hi, please I also have a similar problem, with the same kind of error message. I am very new to this. Here is the code:

Your help will be appreciated!

Your code is unreadable, so shorten it down and post a minimal and executable code snippet to reproduce the issue by wrapping it into three backticks ``` instead of a screenshot.