RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1280, 5]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
following issue with code
def train_model_sam(model, epoch,train_loader,val_loader,optimizer,criterion,scheduler):
model.train()
losses = AverageMeter()
accs = AverageMeter()
tk = tqdm(train_loader, total=len(train_loader), position=0, leave=True)
for idx, (imgs, labels) in enumerate(tk):
imgs_train, labels_train = imgs.cuda(), labels.cuda().long()
output_train = model(imgs_train)
#loss = criterion(output_train, labels_train)
# first forward-backward pass
optimizer.zero_grad()
loss = criterion(output_train, labels_train) # use this loss for any training statistics
loss.backward(retain_graph = True)
optimizer.first_step(zero_grad=True)
# second forward-backward pass
optimizer.zero_grad()
criterion(output_train, labels_train).backward()
optimizer.second_step(zero_grad=True)
#optimizer.zero_grad()
#loss.backward()
#optimizer.step()
accs.update((output_train.argmax(1)==labels_train).sum().item()/imgs_train.size(0),imgs_train.size(0))
losses.update(loss.item(), imgs_train.size(0))
tk.set_postfix(loss=losses.avg,acc=accs.avg)
return losses.avg
What is most likely happening is that your first optimizer step is actually changing some of the weights inplace. And so the second backward cannot run because it needs the original value of these weights.
You should either delay the first step after all the backward have been done. Or re-do the forward for the weights that have been changed before doing the second backward.
THank you albabD
but did you visit the github i provide?
I try to use the sam optimizer from what they provide
so can you suggest some code modification?
I changed it like this but I guess this is not what they are asking
I did not, but the error is in the code sample you shared
I try to use the sam optimizer from what they provide
Note that these checks were buggy in old versions of PyTorch and fixed recently. So if this repo has not been updated in a while, that might be the reason why they share this code but it doesn’t work.