RuntimeError: one of the variables needed for gradient computation has been modified by an inplace

try to use sam

torchversion: 1.7.0+cu110
ubuntu : 18.04

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1280, 5]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

following issue with code

def train_model_sam(model, epoch,train_loader,val_loader,optimizer,criterion,scheduler):
    model.train() 
    
    losses = AverageMeter()
    accs = AverageMeter()
    
    tk = tqdm(train_loader, total=len(train_loader), position=0, leave=True)
    for idx, (imgs, labels) in enumerate(tk):
        imgs_train, labels_train = imgs.cuda(), labels.cuda().long()
        output_train = model(imgs_train)

        #loss = criterion(output_train, labels_train)

        # first forward-backward pass
        optimizer.zero_grad() 
        loss = criterion(output_train, labels_train)  # use this loss for any training statistics
        loss.backward(retain_graph = True)
        optimizer.first_step(zero_grad=True)
        
        # second forward-backward pass
        optimizer.zero_grad() 
        criterion(output_train, labels_train).backward()
        optimizer.second_step(zero_grad=True)

        
        #optimizer.zero_grad() 
        #loss.backward()
        #optimizer.step() 
        
        accs.update((output_train.argmax(1)==labels_train).sum().item()/imgs_train.size(0),imgs_train.size(0))
        losses.update(loss.item(), imgs_train.size(0))

        tk.set_postfix(loss=losses.avg,acc=accs.avg)
        
    return losses.avg

Hi,

What is most likely happening is that your first optimizer step is actually changing some of the weights inplace. And so the second backward cannot run because it needs the original value of these weights.
You should either delay the first step after all the backward have been done. Or re-do the forward for the weights that have been changed before doing the second backward.

THank you albabD
but did you visit the github i provide?
I try to use the sam optimizer from what they provide
so can you suggest some code modification?

I changed it like this but I guess this is not what they are asking

def train_model_sam(model, epoch,train_loader,val_loader,optimizer,criterion,scheduler):
model.train()

losses = AverageMeter()
accs = AverageMeter()

tk = tqdm(train_loader, total=len(train_loader), position=0, leave=True)
for idx, (imgs, labels) in enumerate(tk):
    imgs_train, labels_train = imgs.cuda(), labels.cuda().long()
    output_train = model(imgs_train)

    # first forward-backward pass
    #optimizer.zero_grad() 
    loss = criterion(output_train, labels_train)  # use this loss for any training statistics
    loss.backward(retain_graph = True)
    optimizer.first_step(zero_grad=True)
    

    # second forward-backward pass
    #optimizer.zero_grad() 
    output_train = model(imgs_train)
    loss=criterion(output_train, labels_train)
    loss.backward()
    optimizer.second_step(zero_grad=True)

    #loss = criterion(output_train, labels_train)
    #optimizer.zero_grad() 
    #loss.backward()
    #optimizer.step() 
    
    accs.update((output_train.argmax(1)==labels_train).sum().item()/imgs_train.size(0),imgs_train.size(0))
    losses.update(loss.item(), imgs_train.size(0))

    tk.set_postfix(loss=losses.avg,acc=accs.avg)
    
return losses.avg

I did not, but the error is in the code sample you shared :slight_smile:

I try to use the sam optimizer from what they provide

Note that these checks were buggy in old versions of PyTorch and fixed recently. So if this repo has not been updated in a while, that might be the reason why they share this code but it doesn’t work.

Your updated code should run fine no?