[Solved][Pytorch1.5] RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Could you explain your use case and in particular why you are using retain_graph=True as this ususally yields these kind of errors (and is often used as a workaround for another error)?

Hi ptrblck, many thanks for your help here. I have solved this bug here.


else:
    optimizer.zero_grad()
    loss.backward(retain_graph = True)
    optimizer.step()       
    train_batch.grad.zero_()
    loss.backward()
    grads = train_batch.grad

Hi guys . I met the problem with loss.backward() as you can see here
File “train.py”, line 360, in train
loss_adv.backward(retain_graph=True)
File “/usr/local/lib/python3.7/dist-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py”, line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 7]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

My code is


I use pytorch 1.12.1 in google colab
Can anyone help me to solve this problem .Thank you very much
@ptrblck @albanD can you help me

Could you also check why retain_graph is used in your code?

When I don’t use retain_graph=True I meet this problem
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

In that case try to fix this issue as it seems your computation graph is growing in each iteration such that the backward pass would try to compute the gradient for multiple iterations.
This could happen e.g. if the input to your model depends somehow on the output from the previous iteration.

Try moving all the optimizer steps to the very end after all the backwards have completed

See these two similar issues:

1 Like

Now , it work well .Thank you for your suggestion

Now , it work well when I move all the optimizer step after all backwards .Thanks for yours suggestion

Hi raharth,
In case you or anyone else are still struggling in this problem, I would like to post the solution I just figure out to disable version checking intentionally by adding saved_tensors_hooks. This could work because according to the source of the version checking, the version checking is implemented as an unpack hook and would be skipped if there are any other hooks defined.

A minimal demo would be

import torch
from torch.autograd import Variable
a = Variable(torch.randn([3,4]), requires_grad=True)
b = torch.randn([3, 1])

def pack_hook(x):
    print("Packing")
    return x

def unpack_hook(x):
    print("Unpacking")
    return x

# if True:
with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook):
    c = a * b
    d = c.sum()
    b[0][0] = 1.
    d.backward()
print(a.grad)

In my case, I have one thread that is using the model being trained by another thread. I want to avoid copying the model between threads since it’s time consuming for my task and only one thread needs to be accurate.

I’m facing the same problem. I tried the solutions suggested above, but they didn’t work. I have a number “N” of agents, and each agent owns an independent actor and critic. Each agent has different states according to the label given to each agent.
###############

all_agents = []

all_agents.append (Agent (actor_dims, critic_dims))

for agent_idx, agent in enumerate (all_agents):
    i = agent.agent_label
    critic_value_ = agent.target_critic.forward (states_[i], new_actions_cluster[i]).flatten ()

    critic_value = agent.critic.forward (states[i], old_actions_cluster[i]).flatten ()

    target = rewards[:, agent_idx] + agent.gamma * critic_value_

    critic_loss= F.mse_loss (critic_value.float (), target.float ())

    agent.critic.optimizer.zero_grad ()
    critic_loss.backward (retain_graph=True)

    actor_loss = agent.critic.forward (states[i], mu_cluster[i]).flatten ()
    actor_loss = -(T.mean (actor_loss))

    agent.actor.optimizer.zero_grad ()
    actor_loss.backward ()

    agent.critic.optimizer.step ()
    agent.actor.optimizer.step ()```
 
#################################

[W …\torch\csrc\autograd\python_anomaly_mode.cpp:85] Warning: Error detected in AddmmBackward. No forward pass information available. Enable detect anomaly during forward pass for more information. (function _print_stack)
Traceback (most recent call last):

allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 3]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I’m facing the same problem. could you help me please?
I tried the solutions suggested above, but they didn’t work. I have a number “N” of agents, and each agent owns an independent actor and critic. Each agent has different states according to the label given to each agent.
###############

all_agents = []

all_agents.append (Agent (actor_dims, critic_dims))

for agent_idx, agent in enumerate (all_agents):
    i = agent.agent_label
    critic_value_ = agent.target_critic.forward (states_[i], new_actions_cluster[i]).flatten ()

    critic_value = agent.critic.forward (states[i], old_actions_cluster[i]).flatten ()

    target = rewards[:, agent_idx] + agent.gamma * critic_value_

    critic_loss= F.mse_loss (critic_value.float (), target.float ())

    agent.critic.optimizer.zero_grad ()
    critic_loss.backward (retain_graph=True)

    actor_loss = agent.critic.forward (states[i], mu_cluster[i]).flatten ()
    actor_loss = -(T.mean (actor_loss))

    agent.actor.optimizer.zero_grad ()
    actor_loss.backward ()

    agent.critic.optimizer.step ()
    agent.actor.optimizer.step ()```
 
#################################

[W …\torch\csrc\autograd\python_anomaly_mode.cpp:85] Warning: Error detected in AddmmBackward. No forward pass information available. Enable detect anomaly during forward pass for more information. (function _print_stack)
Traceback (most recent call last):

allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 3]], which is output 0 of TBackward, is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Same as before: [Solved][Pytorch1.5] RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation - #37 by ptrblck

Could you explain why retain_graph=True is used?

when I remove retain_graph=True it gives another error :

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

I modified the code as follows, and it working. but I’m not sure if this way is correct or not

all_agents = []

all_agents.append (Agent (actor_dims, critic_dims))

for agent_idx, agent in enumerate (all_agents):
    i = agent.agent_label
    critic_value_ = agent.target_critic.forward (states_[i], new_actions_cluster[i]).flatten ()

    critic_value = agent.critic.forward (states[i], old_actions_cluster[i]).flatten ()

    target = rewards[:, agent_idx] + agent.gamma * critic_value_

    agent.critic_loss= F.mse_loss (critic_value.float (), target.float ())
    agent.critic_loss.backward (retain_graph=True)

for agent_idx, agent in enumerate (all_agents):
    agent.critic.optimizer.zero_grad ()
for agent_idx, agent in enumerate (all_agents):
    agent.critic.optimizer.zero_grad ()

for agent_idx, agent in enumerate (all_agents):
    i = agent.agent_label
    agent.actor_loss = agent.critic.forward (states[i], mu_cluster[i], typ).flatten ()
    agent.actor_loss = -T.mean (agent.actor_loss)
    agent.actor_loss.backward (retain_graph=True)

for agent_idx, agent in enumerate (all_agents):
    agent.actor.optimizer.step ()
    # agent.actor.optimizer.zero_grad ()

for agent_idx, agent in enumerate (all_agents):
    agent.actor.optimizer.zero_grad ()

I met this error when I was doing the PPO (Proximal Policy Optimization). I solve this problem by defining a target network and a main network. The target network at the beginning has the same parameter values with the main network. During the training, the target network parameters are assigned to the main network every constant time steps. The details can be found in the code: https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO_colab.ipynb

I am facing the same probelm. one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1024]] is at version 12; expected version 11 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

for epoch_idx in range(n_epochs):
    len_dataloader = min(len(Source_loader), len(train_loader))
    dl_source_iter= iter(Source_loader)
    dl_target_iter= iter(train_loader)
    model_G.train()
    model_c.train()
    train_loss1=0.0
    train_loss2=0.0
    train_loss3=0.0
    Source_accuracy=0.0
    Domain_accuracy=0.0
    for batch_idx in range(len_dataloader):
        img_s,y_s=next(dl_source_iter)
        img_t,y_t=next(dl_target_iter)
        batch_size=len(y_s)
        batch_size1=len(y_t)
        
        feat_h=feat_l[0:batch_size,:,:,:]
        feat_h_kl=feat_h.reshape(-1,1024)
        y_s=y_s.unsqueeze(1)
        y_t=y_t.unsqueeze(1)

        #a=np.random.normal(0,1,(len(y_s)*14,256)).astype("float32")
        #zn=Variable(torch.tensor(np.random.normal(0,1,(len(y_s)*14,1024)).astype("float32")))
        img_s = Variable(img_s)
        img_t = Variable(img_t)
        
        #UDA by Backpropagation


        opt_g.zero_grad()
        opt_c.zero_grad()
        feat_s=model_G(img_s)
        output_s=model_c.classifier(feat_s)
        output_loss=loss_fn_class(output_s,y_s)
        
        
        feat_s_kl=feat_s.view(-1,1024)
        
        loss_kld_s=F.kl_div(F.log_softmax(feat_s_kl),F.softmax(feat_h_kl))
        
        loss1=output_loss+loss_kld_s
        loss1.backward(retain_graph=True)
        opt_g.step()
        opt_c.step()
        opt_g.zero_grad()
        opt_c.zero_grad()

        # #loss_s=loss_fn_class(output_s,y_s)
        # X_t_train, X_t_domain,y_t_train, y_t_domain1 = train_test_split(img_t,y_t ,
        #                                     random_state=104, 
        #                                     test_size=0.70, 
        #                                     shuffle=True)
        # feat_t_output=model_G(X_t_train)
        
        
        feat_t_output=model_G(img_t)
        output_t=model_c.classifier(feat_t_output)
        loss_output=loss_fn_class(output_t,y_t)
        feat_t_kl=feat_t_output.view(-1,1024)
        
        loss_kld_t=F.kl_div(F.log_softmax(feat_t_kl),F.softmax(feat_h_kl))
        #feat_zn_recon=model_G.decode(feat_h)
        # feat_t_recon=model_G(img_t,is_deconv=True)
        # feat_h1=model_G(zn)
        
        # loss_dal=criterionDAL(feat_t_recon,feat_zn_recon)
     

        loss2=loss_output+loss_kld_t
        loss2.backward()
        opt_g.step()
        opt_c.step()
        opt_g.zero_grad()
        opt_c.zero_grad()

@ptrblck and @albanD can you please help me with 2 step loss calculation here?

As already asked in this thread: could you explain why retain_graph=True is used in your code as it is often applied as a workaround for another error which causes the invalid inplace operation.

I have a similar issue and haven’t been able to debug. I need retrain_graph=True because I am training a MobileVOD model. It Combines a mobilenet basenet with a bottleneck lstm and basically introduces a temporal element to object detection. I’ll paste my code below and any help is much appreciated.

Training…

"""Script for training the MobileVOD with 1 Bottleneck Bottleneck LSTM layers. As in mobilenet, here we use depthwise seperable convolutions 
for reducing the computation without affecting accuracy much. Model is trained on Imagenet VID 2015 dataset.
Here we unroll LSTM for 10 steps and gives 10 consecutive frames of video as input.
Few global variables defined here are explained:
Global Variables
----------------
args : dict
	Has all the options for changing various variables of the model as well as hyper-parameters for training.
dataset : VIDDataset (torch.utils.data.Dataset, For more info see datasets/vid_dataset.py)
optimizer : optim.RMSprop
scheduler : CosineAnnealingLR, MultiStepLR (torch.optim.lr_scheduler)
config : mobilenetv1_ssd_config (See config/mobilenetv1_ssd_config.py for more info, where you can change input size and ssd priors)
loss : MultiboxLoss (See network/multibox_loss.py for more info)
"""
import argparse
import os
import logging
import sys
import itertools

import torch
from torch.utils.data import DataLoader, ConcatDataset
from torch.optim.lr_scheduler import CosineAnnealingLR, MultiStepLR

from torch.utils.tensorboard import SummaryWriter

from utils.misc import str2bool, Timer, store_labels
from network.mvod_bottleneck_lstm1 import MobileVOD, SSD, MobileNetV1, MatchPrior
from datasets.vid_dataset_new import VIDDataset
from network.multibox_loss import MultiboxLoss
from config import mobilenetv1_ssd_config
from dataloaders.data_preprocessing import TrainAugmentation, TestTransform

parser = argparse.ArgumentParser(
	description='Mobile Video Object Detection (Bottleneck LSTM) Training With Pytorch')

parser.add_argument('--datasets', help='Dataset directory path')
parser.add_argument('--cache_path', help='Cache directory path')
parser.add_argument('--freeze_net', action='store_true',
					help="Freeze all the layers except the prediction head.")
parser.add_argument('--width_mult', default=1.0, type=float,
					help='Width Multiplifier')

# Params for SGD
parser.add_argument('--lr', '--learning-rate', default=0.0003, type=float,
					help='initial learning rate')
parser.add_argument('--momentum', default=0.9, type=float,
					help='Momentum value for optim')
parser.add_argument('--weight_decay', default=5e-4, type=float,
					help='Weight decay for SGD')
parser.add_argument('--gamma', default=0.1, type=float,
					help='Gamma update for SGD')
parser.add_argument('--base_net_lr', default=None, type=float,
					help='initial learning rate for base net.')
parser.add_argument('--ssd_lr', default=None, type=float,
					help='initial learning rate for the layers not in base net and prediction heads.')


# Params for loading pretrained basenet or checkpoints.
parser.add_argument('--pretrained', help='Pre-trained model')
parser.add_argument('--resume', default=None, type=str,
					help='Checkpoint state_dict file to resume training from')

# Scheduler
parser.add_argument('--scheduler', default="multi-step", type=str,
					help="Scheduler for SGD. It can one of multi-step and cosine")

# Params for Multi-step Scheduler
parser.add_argument('--milestones', default="80,100", type=str,
					help="milestones for MultiStepLR")

# Params for Cosine Annealing
parser.add_argument('--t_max', default=120, type=float,
					help='T_max value for Cosine Annealing Scheduler.')

# Train params
parser.add_argument('--batch_size', default=1, type=int,
					help='Batch size for training')
parser.add_argument('--num_epochs', default=200, type=int,
					help='the number epochs')
# this was originally 4, set to 0 - https://stackoverflow.com/questions/64772335/pytorch-w-parallelnative-cpp206
parser.add_argument('--num_workers', default=0, type=int,
					help='Number of workers used in dataloading')
parser.add_argument('--validation_epochs', default=5, type=int,
					help='the number epochs')
parser.add_argument('--debug_steps', default=100, type=int,
					help='Set the debug log output frequency.')
parser.add_argument('--sequence_length', default=10, type=int,
					help='sequence_length of video to unfold')
parser.add_argument('--use_cuda', default=True, type=str2bool,
					help='Use CUDA to train model')

parser.add_argument('--checkpoint_folder', default='models/',
					help='Directory for saving checkpoint models')


logging.basicConfig(stream=sys.stdout, level=logging.INFO,
					format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
args = parser.parse_args()
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() and args.use_cuda else "cpu")
print('DEVICE',DEVICE)

# tensorboard
writer = SummaryWriter()

# RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 256, 1, 1]] is at version 5; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
torch.autograd.set_detect_anomaly(True)

if args.use_cuda and torch.cuda.is_available():
	torch.backends.cudnn.benchmark = True
	logging.info("Use Cuda.")

def train(loader, net, criterion, optimizer, device, debug_steps=100, epoch=-1, sequence_length=10):
	""" Train model
	Arguments:
		net : object of MobileVOD class
		loader : validation data loader object
		criterion : Loss function to use
		device : device on which computation is done
		optimizer : optimizer to optimize model
		debug_steps : number of steps after which model needs to debug
		sequence_length : unroll length of model
		epoch : current epoch number
	"""
	net.train(True)
	running_loss = 0.0
	running_regression_loss = 0.0
	running_classification_loss = 0.0
	for i, data in enumerate(loader):
		images, boxes, labels = data
		for image, box, label in zip(images, boxes, labels):
			image = image.to(device)
			box = box.to(device)
			label = label.to(device)

			optimizer.zero_grad()
			confidence, locations = net(image)
			regression_loss, classification_loss = criterion(confidence, locations, label, box)  # TODO CHANGE BOXES
			loss = regression_loss + classification_loss
			loss.backward(retain_graph=True)
			optimizer.step()

			running_loss += loss.item()
			running_regression_loss += regression_loss.item()
			running_classification_loss += classification_loss.item()
		net.detach_hidden()
		if i and i % debug_steps == 0:
			avg_loss = running_loss / (debug_steps*sequence_length)
			avg_reg_loss = running_regression_loss / (debug_steps*sequence_length)
			avg_clf_loss = running_classification_loss / (debug_steps*sequence_length)
			logging.info(
				f"Epoch: {epoch}, Step: {i}, " +
				f"Average Loss: {avg_loss:.4f}, " +
				f"Average Regression Loss {avg_reg_loss:.4f}, " +
				f"Average Classification Loss: {avg_clf_loss:.4f}"
			)
			running_loss = 0.0
			running_regression_loss = 0.0
			running_classification_loss = 0.0
	net.detach_hidden()


def val(loader, net, criterion, device):
	""" Validate model
	Arguments:
		net : object of MobileVOD class
		loader : validation data loader object
		criterion : Loss function to use
		device : device on which computation is done
	Returns:
		loss, regression loss, classification loss
	"""
	net.eval()
	running_loss = 0.0
	running_regression_loss = 0.0
	running_classification_loss = 0.0
	num = 0
	for _, data in enumerate(loader):
		images, boxes, labels = data
		for image, box, label in zip (images, boxes, labels):
			image = image.to(device)
			box = box.to(device)
			label = label.to(device)
			num += 1

			with torch.no_grad():
				confidence, locations = net(image)
				regression_loss, classification_loss = criterion(confidence, locations, label, box)
				loss = regression_loss + classification_loss

			running_loss += loss.item()
			running_regression_loss += regression_loss.item()
			running_classification_loss += classification_loss.item()
		net.detach_hidden()
	return running_loss / num, running_regression_loss / num, running_classification_loss / num

def initialize_model(net):
	""" Loads learned weights from pretrained checkpoint model
	Arguments:
		net : object of MobileVOD
	"""
	if args.pretrained:
		logging.info("Loading weights from pretrained netwok")
		pretrained_net_dict = torch.load(args.pretrained)
		model_dict = net.state_dict()
		# 1. filter out unnecessary keys
		pretrained_dict = {k: v for k, v in pretrained_net_dict.items() if k in model_dict and model_dict[k].shape == pretrained_net_dict[k].shape}
		# 2. overwrite entries in the existing state dict
		model_dict.update(pretrained_dict)
		net.load_state_dict(model_dict)

if __name__ == '__main__':
	timer = Timer()

	logging.info(args)
	config = mobilenetv1_ssd_config	#config file for priors etc.
	train_transform = TrainAugmentation(config.image_size, config.image_mean, config.image_std)
	target_transform = MatchPrior(config.priors, config.center_variance,
								  config.size_variance, 0.5)

	test_transform = TestTransform(config.image_size, config.image_mean, config.image_std)

	logging.info("Prepare training datasets.")
	train_dataset = VIDDataset(args.datasets, args.cache_path, transform=train_transform,
								 target_transform=target_transform, batch_size=args.batch_size)
	label_file = os.path.join("models/", "vid-model-labels.txt")
	store_labels(label_file, train_dataset._classes_names)
	num_classes = len(train_dataset._classes_names)
	logging.info(f"Stored labels into file {label_file}.")
	logging.info("Train dataset size: {}".format(len(train_dataset)))
	train_loader = DataLoader(train_dataset, args.batch_size,
							  num_workers=args.num_workers,
							  shuffle=True)
	# logging.info("Prepare Validation datasets.")
	# val_dataset = VIDDataset(args.datasets, args.cache_path, transform=test_transform,
	# 							 target_transform=target_transform, is_val=True)
	# logging.info(val_dataset)
	# logging.info("validation dataset size: {}".format(len(val_dataset)))

	# val_loader = DataLoader(val_dataset, args.batch_size,
	# 						num_workers=args.num_workers,
	# 						shuffle=False)

	logging.info("Build network.")
	pred_enc = MobileNetV1(num_classes=num_classes, alpha = args.width_mult)
	pred_dec = SSD(num_classes=num_classes, batch_size = args.batch_size, alpha = args.width_mult, is_test=False)
	if args.resume is None:
		net = MobileVOD(pred_enc, pred_dec)
		initialize_model(net)
	else:
		net = MobileVOD(pred_enc, pred_dec)
		print("Updating weights from resume model")
		net.load_state_dict(
			torch.load(args.resume,
					   map_location=lambda storage, loc: storage))

	min_loss = -10000.0
	last_epoch = -1

	base_net_lr = args.base_net_lr if args.base_net_lr is not None else args.lr
	ssd_lr = args.ssd_lr if args.ssd_lr is not None else args.lr
	if args.freeze_net:
		logging.info("Freeze net.")
		for param in pred_enc.parameters():
			param.requires_grad = False
		net.pred_decoder.conv13.requires_grad = False

	net.to(DEVICE)

	criterion = MultiboxLoss(config.priors, iou_threshold=0.5, neg_pos_ratio=10,
							 center_variance=0.1, size_variance=0.2, device=DEVICE)
	optimizer = torch.optim.RMSprop([{'params': [param for name, param in net.pred_encoder.named_parameters()], 'lr': base_net_lr},
		{'params': [param for name, param in net.pred_decoder.named_parameters()], 'lr': ssd_lr},], lr=args.lr,
								weight_decay=args.weight_decay, momentum=args.momentum)
	logging.info(f"Learning rate: {args.lr}, Base net learning rate: {base_net_lr}, "
				 + f"Extra Layers learning rate: {ssd_lr}.")

	# if args.scheduler == 'multi-step':
	# 	logging.info("Uses MultiStepLR scheduler.")
	# 	milestones = [int(v.strip()) for v in args.milestones.split(",")]
	# 	scheduler = MultiStepLR(optimizer, milestones=milestones,
	# 												 gamma=0.1, last_epoch=last_epoch)
	# elif args.scheduler == 'cosine':
	# 	logging.info("Uses CosineAnnealingLR scheduler.")
	# 	scheduler = CosineAnnealingLR(optimizer, args.t_max, last_epoch=last_epoch)
	# else:
	# 	logging.fatal(f"Unsupported Scheduler: {args.scheduler}.")
	# 	parser.print_help(sys.stderr)
	# 	sys.exit(1)

	print('net', net)

	output_path = os.path.join(args.checkpoint_folder, f"lstm1")
	if not os.path.exists(output_path):
		os.makedirs(os.path.join(output_path))
	logging.info(f"Start training from epoch {last_epoch + 1}.")
	for epoch in range(last_epoch + 1, args.num_epochs):
		#scheduler.step()
		train(train_loader, net, criterion, optimizer,
			  device=DEVICE, debug_steps=args.debug_steps, epoch=epoch, sequence_length=args.sequence_length)
		
		if epoch % args.validation_epochs == 0 or epoch == args.num_epochs - 1:
			val_loss, val_regression_loss, val_classification_loss = val(val_loader, net, criterion, DEVICE)
			logging.info(
				f"Epoch: {epoch}, " +
				f"Validation Loss: {val_loss:.4f}, " +
				f"Validation Regression Loss {val_regression_loss:.4f}, " +
				f"Validation Classification Loss: {val_classification_loss:.4f}"
			)
			model_path = os.path.join(output_path, f"WM-{args.width_mult}-Epoch-{epoch}.pth")
			torch.save(net.state_dict(), model_path)
			logging.info(f"Saved model {model_path}")

			# log to tensorboard
			writer.add_scalar("val_loss/train", val_loss, epoch)
			writer.add_scalar("val_regression_loss/train", val_regression_loss, epoch)
			writer.add_scalar("val_classification_loss/train", val_classification_loss, epoch)
			writer.add_scalar("Learning rate", args.lr, epoch)
			writer.add_scalar("Base net learning rate", base_net_lr, epoch)
			writer.add_scalar("Extra Layers learning rate", ssd_lr, epoch)

Network…

#!/usr/bin/python3
"""Script for creating basenet with one Bottleneck LSTM layer after conv 13.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from typing import List, Tuple
from utils import box_utils
from collections import namedtuple
from collections import OrderedDict
from torch.autograd import Variable
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np
import logging


def SeperableConv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0):
	"""Replace Conv2d with a depthwise Conv2d and Pointwise Conv2d.
	Arguments:
		in_channels : number of channels of input
		out_channels : number of channels of output
		kernel_size : kernel size for depthwise convolution
		stride : stride for depthwise convolution
		padding : padding for depthwise convolution
	Returns:
		object of class torch.nn.Sequential
	"""
	return nn.Sequential(
		nn.Conv2d(in_channels=int(in_channels), out_channels=int(in_channels), kernel_size=kernel_size,
			   groups=int(in_channels), stride=stride, padding=padding),
		nn.ReLU6(),
		nn.Conv2d(in_channels=int(in_channels), out_channels=int(out_channels), kernel_size=1),
	)

def conv_bn(inp, oup, stride):
	"""3x3 conv with batchnorm and relu
	Arguments:
		inp : number of channels of input
		oup : number of channels of output
		stride : stride for depthwise convolution
	Returns:
		object of class torch.nn.Sequential
	"""
	return nn.Sequential(
				nn.Conv2d(int(inp), int(oup), 3, stride, 1, bias=False),
				nn.BatchNorm2d(int(oup)),
				nn.ReLU6(inplace=True)
			)
def conv_dw(inp, oup, stride):
	"""Replace Conv2d with a depthwise Conv2d and Pointwise Conv2d having batchnorm and relu layers in between.
	Here kernel size is fixed at 3.
	Arguments:
		inp : number of channels of input
		oup : number of channels of output
		stride : stride for depthwise convolution
	Returns:
		object of class torch.nn.Sequential
	"""
	return nn.Sequential(
				nn.Conv2d(int(inp), int(inp), 3, stride, 1, groups=int(inp), bias=False),
				nn.BatchNorm2d(int(inp)),
				nn.ReLU6(inplace=True),

				nn.Conv2d(int(inp), int(oup), 1, 1, 0, bias=False),
				nn.BatchNorm2d(int(oup)),
				nn.ReLU6(inplace=True),
			)
class MatchPrior(object):
	"""Matches priors based on the SSD prior config
	Arguments:
		center_form_priors : priors generated based on specs and image size in config file
		center_variance : a float used to change the scale of center
		size_variance : a float used to change the scale of size
		iou_threshold : a float value of thresholf of IOU
	"""
	def __init__(self, center_form_priors, center_variance, size_variance, iou_threshold):
		self.center_form_priors = center_form_priors
		self.corner_form_priors = box_utils.center_form_to_corner_form(center_form_priors)
		self.center_variance = center_variance
		self.size_variance = size_variance
		self.iou_threshold = iou_threshold

	def __call__(self, gt_boxes, gt_labels):
		"""
		Arguments:
			gt_boxes : ground truth boxes
			gt_labels : ground truth labels
		Returns:
			locations of form (batch_size, num_priors, 4) and labels
		"""
		if type(gt_boxes) is np.ndarray:
			gt_boxes = torch.from_numpy(gt_boxes)
		if type(gt_labels) is np.ndarray:
			gt_labels = torch.from_numpy(gt_labels)
		boxes, labels = box_utils.assign_priors(gt_boxes, gt_labels,
												self.corner_form_priors, self.iou_threshold)
		boxes = box_utils.corner_form_to_center_form(boxes)
		locations = box_utils.convert_boxes_to_locations(boxes, self.center_form_priors, self.center_variance, self.size_variance)
		return locations, labels

class BottleneckLSTMCell(nn.Module):
	""" Creates a LSTM layer cell
	Arguments:
		input_channels : variable used to contain value of number of channels in input
		hidden_channels : variable used to contain value of number of channels in the hidden state of LSTM cell
	"""
	def __init__(self, input_channels, hidden_channels):
		super(BottleneckLSTMCell, self).__init__()

		assert hidden_channels % 2 == 0

		self.input_channels = int(input_channels)
		self.hidden_channels = int(hidden_channels)
		self.num_features = 4
		self.W = nn.Conv2d(in_channels=self.input_channels, out_channels=self.input_channels, kernel_size=3, groups=self.input_channels, stride=1, padding=1)
		self.Wy  = nn.Conv2d(int(self.input_channels+self.hidden_channels), self.hidden_channels, kernel_size=1)
		self.Wi  = nn.Conv2d(self.hidden_channels, self.hidden_channels, 3, 1, 1, groups=self.hidden_channels, bias=False)  
		self.Wbi = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.Wbf = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.Wbc = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.Wbo = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.relu = nn.ReLU6()
		# self.Wci = None
		# self.Wcf = None
		# self.Wco = None
		logging.info("Initializing weights of lstm")
		self._initialize_weights()

	def _initialize_weights(self):
		"""
		Returns:
			initialized weights of the model
		"""
		for m in self.modules():
			if isinstance(m, nn.Conv2d):
				nn.init.xavier_uniform_(m.weight)
				if m.bias is not None:
					m.bias.data.zero_()
			elif isinstance(m, nn.BatchNorm2d):
				m.weight.data.fill_(1)
				m.bias.data.zero_()
			
	def forward(self, x, h, c): #implemented as mentioned in paper here the only difference is  Wbi, Wbf, Wbc & Wbo are commuted all together in paper
		"""
		Arguments:
			x : input tensor
			h : hidden state tensor
			c : cell state tensor
		Returns:
			output tensor after LSTM cell 
		"""
		x = self.W(x)
		y = torch.cat((x, h),1) #concatenate input and hidden layers
		i = self.Wy(y) #reduce to hidden layer size
		b = self.Wi(i)	#depth wise 3*3
		ci = torch.sigmoid(self.Wbi(b))
		cf = torch.sigmoid(self.Wbf(b))
		cc = cf * c + ci * self.relu(self.Wbc(b))
		co = torch.sigmoid(self.Wbo(b))
		ch = co * self.relu(cc)
		return ch, cc

	def init_hidden(self, batch_size, hidden, shape):
		"""
		Arguments:
			batch_size : an int variable having value of batch size while training
			hidden : an int variable having value of number of channels in hidden state
			shape : an array containing shape of the hidden and cell state 
		Returns:
			cell state and hidden state
		"""
		# if self.Wci is None:
		# 	self.Wci = Variable(torch.zeros(1, hidden, shape[0], shape[1])).cuda()
		# 	self.Wcf = Variable(torch.zeros(1, hidden, shape[0], shape[1])).cuda()
		# 	self.Wco = Variable(torch.zeros(1, hidden, shape[0], shape[1])).cuda()
		# else:
		# 	assert shape[0] == self.Wci.size()[2], 'Input Height Mismatched!'
		# 	assert shape[1] == self.Wci.size()[3], 'Input Width Mismatched!'
		return (Variable(torch.zeros(batch_size, hidden, shape[0], shape[1])).cuda(),
				Variable(torch.zeros(batch_size, hidden, shape[0], shape[1])).cuda()
				)

class BottleneckLSTM(nn.Module):
	def __init__(self, input_channels, hidden_channels, height, width, batch_size):
		""" Creates Bottleneck LSTM layer
		Arguments:
			input_channels : variable having value of number of channels of input to this layer
			hidden_channels : variable having value of number of channels of hidden state of this layer
			height : an int variable having value of height of the input
			width : an int variable having value of width of the input
			batch_size : an int variable having value of batch_size of the input
		Returns:
			Output tensor of LSTM layer
		"""
		super(BottleneckLSTM, self).__init__()
		self.input_channels = int(input_channels)
		self.hidden_channels = int(hidden_channels)
		self.cell = BottleneckLSTMCell(self.input_channels, self.hidden_channels)
		(h, c) = self.cell.init_hidden(batch_size, hidden=self.hidden_channels, shape=(height, width))
		self.hidden_state = h
		self.cell_state = c

	def forward(self, input):
		new_h, new_c = self.cell(input, self.hidden_state, self.cell_state)
		self.hidden_state = new_h
		self.cell_state = new_c
		return self.hidden_state

def crop_like(x, target):
	"""
	Arguments:
		x : a tensor whose shape has to be cropped
		target : a tensor whose shape has to assert on x
	Returns:
		x having same shape as target
	"""
	if x.size()[2:] == target.size()[2:]:
		return x
	else:
		height = target.size()[2]
		width = target.size()[3]
		crop_h = torch.FloatTensor([x.size()[2]]).sub(height).div(-2)
		crop_w = torch.FloatTensor([x.size()[3]]).sub(width).div(-2)
	# fixed indexing for PyTorch 0.4
	return F.pad(x, [int(crop_w.ceil()[0]), int(crop_w.floor()[0]), int(crop_h.ceil()[0]), int(crop_h.floor()[0])])


class MobileNetV1(nn.Module):
	def __init__(self, num_classes=1024, alpha=1):
		"""torch.nn.module for mobilenetv1 upto conv12
		Arguments:
			num_classes : an int variable having value of total number of classes
			alpha : a float used as width multiplier for channels of model
		"""
		super(MobileNetV1, self).__init__()
		# upto conv 12
		self.model = nn.Sequential(
			conv_bn(3, 32*alpha, 2),
			conv_dw(32*alpha, 64*alpha, 1),
			conv_dw(64*alpha, 128*alpha, 2),
			conv_dw(128*alpha, 128*alpha, 1),
			conv_dw(128*alpha, 256*alpha, 2),
			conv_dw(256*alpha, 256*alpha, 1),
			conv_dw(256*alpha, 512*alpha, 2),
			conv_dw(512*alpha, 512*alpha, 1),
			conv_dw(512*alpha, 512*alpha, 1),
			conv_dw(512*alpha, 512*alpha, 1),
			conv_dw(512*alpha, 512*alpha, 1),
			conv_dw(512*alpha, 512*alpha, 1),
			)
		logging.info("Initializing weights of base net")
		self._initialize_weights()
		#self.fc = nn.Linear(1024, num_classes)
	def _initialize_weights(self):
		"""
		Returns:
			initialized weights of the model
		"""
		for m in self.modules():
			if isinstance(m, nn.Conv2d):
				nn.init.xavier_uniform_(m.weight)
				if m.bias is not None:
					m.bias.data.zero_()
			elif isinstance(m, nn.BatchNorm2d):
				m.weight.data.fill_(1)
				m.bias.data.zero_()
			
	def forward(self, x):
		"""
		Arguments:
			x : a tensor which is used as input for the model
		Returns:
			a tensor which is output of the model 
		"""
		x = self.model(x)
		return x


class SSD(nn.Module):
	def __init__(self,num_classes, batch_size, alpha = 1, is_test=False, config = None, device = None):
		"""
		Arguments:
			num_classes : an int variable having value of total number of classes
			batch_size : an int variable having value of batch size
			alpha : a float used as width multiplier for channels of model
			is_Test : a bool used to make model ready for testing
			config : a dict containing all the configuration parameters 
		"""
		super(SSD, self).__init__()
		# Decoder
		self.is_test = is_test
		self.config = config
		self.num_classes = num_classes
		if device:
			self.device = device
		else:
			self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
		if is_test:
			self.config = config
			self.priors = config.priors.to(self.device)
		self.conv13 = conv_dw(512*alpha, 1024*alpha, 2) #not using conv14 as mentioned in paper
		self.bottleneck_lstm1 = BottleneckLSTM(input_channels=1024*alpha, hidden_channels=256*alpha, height=10, width=10, batch_size=batch_size)
		self.fmaps_1 = nn.Sequential(	
			nn.Conv2d(in_channels=int(256*alpha), out_channels=int(128*alpha), kernel_size=1),
			nn.ReLU6(inplace=True),
			SeperableConv2d(in_channels=128*alpha, out_channels=256*alpha, kernel_size=3, stride=2, padding=1),
		)
		self.fmaps_2 = nn.Sequential(	
			nn.Conv2d(in_channels=int(256*alpha), out_channels=int(64*alpha), kernel_size=1),
			nn.ReLU6(inplace=True),
			SeperableConv2d(in_channels=64*alpha, out_channels=128*alpha, kernel_size=3, stride=2, padding=1),
		)
		self.fmaps_3 = nn.Sequential(	
			nn.Conv2d(in_channels=int(128*alpha), out_channels=int(64*alpha), kernel_size=1),
			nn.ReLU6(inplace=True),
			SeperableConv2d(in_channels=64*alpha, out_channels=128*alpha, kernel_size=3, stride=2, padding=1),
		)
		self.fmaps_4 = nn.Sequential(	
			nn.Conv2d(in_channels=int(128*alpha), out_channels=int(32*alpha), kernel_size=1),
			nn.ReLU6(inplace=True),
			SeperableConv2d(in_channels=32*alpha, out_channels=64*alpha, kernel_size=3, stride=2, padding=1),
		)
		self.regression_headers = nn.ModuleList([
		SeperableConv2d(in_channels=512*alpha, out_channels=6 * 4, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=256*alpha, out_channels=6 * 4, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=256*alpha, out_channels=6 * 4, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=128*alpha, out_channels=6 * 4, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=128*alpha, out_channels=6 * 4, kernel_size=3, padding=1),
		nn.Conv2d(in_channels=int(64*alpha), out_channels=6 * 4, kernel_size=1),
		])

		self.classification_headers = nn.ModuleList([
		SeperableConv2d(in_channels=512*alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=256*alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=256*alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=128*alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
		SeperableConv2d(in_channels=128*alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
		nn.Conv2d(in_channels=int(64*alpha), out_channels=6 * num_classes, kernel_size=1),
		])

		logging.info("Initializing weights of SSD")
		self._initialize_weights()

	def _initialize_weights(self):
		"""
		Returns:
			initialized weights of the model
		"""
		for m in self.modules():
			if isinstance(m, nn.Conv2d):
				nn.init.xavier_uniform_(m.weight)
				if m.bias is not None:
					m.bias.data.zero_()
			elif isinstance(m, nn.BatchNorm2d):
				m.weight.data.fill_(1)
				m.bias.data.zero_()
			
	def compute_header(self, i, x): #ssd method to calculate headers
		"""
		Arguments:
			i : an int used to use particular classification and regression layer
			x : a tensor used as input to layers
		Returns:
			locations and confidences of the predictions
		"""
		confidence = self.classification_headers[i](x)
		confidence = confidence.permute(0, 2, 3, 1).contiguous()
		confidence = confidence.view(confidence.size(0), -1, self.num_classes)

		location = self.regression_headers[i](x)
		location = location.permute(0, 2, 3, 1).contiguous()
		location = location.view(location.size(0), -1, 4)

		return confidence, location

	def forward(self, x):
		"""
		Arguments:
			x : a tensor which is used as input for the model
		Returns:
			confidences and locations of predictions made by model during training
			or
			confidences and boxes of predictions made by model during testing
		"""
		confidences = []
		locations = []
		header_index=0
		confidence, location = self.compute_header(header_index, x)
		header_index += 1
		confidences.append(confidence)
		locations.append(location)
		x = self.conv13(x)
		x = self.bottleneck_lstm1(x)
		confidence, location = self.compute_header(header_index, x)
		header_index += 1
		confidences.append(confidence)
		locations.append(location)
		x = self.fmaps_1(x)
		confidence, location = self.compute_header(header_index, x)
		header_index += 1
		confidences.append(confidence)
		locations.append(location)
		x = self.fmaps_2(x)
		confidence, location = self.compute_header(header_index, x)
		header_index += 1
		confidences.append(confidence)
		locations.append(location)
		x = self.fmaps_3(x)
		confidence, location = self.compute_header(header_index, x)
		header_index += 1
		confidences.append(confidence)
		locations.append(location)
		x = self.fmaps_4(x)
		confidence, location = self.compute_header(header_index, x)
		header_index += 1
		confidences.append(confidence)
		locations.append(location)
		confidences = torch.cat(confidences, 1)
		locations = torch.cat(locations, 1)
		
		if self.is_test: #while testing convert locations to boxes
			confidences = F.softmax(confidences, dim=2)
			boxes = box_utils.convert_locations_to_boxes(
				locations, self.priors, self.config.center_variance, self.config.size_variance
			)
			boxes = box_utils.center_form_to_corner_form(boxes)
			return confidences, boxes
		else:
			return confidences, locations

class MobileVOD(nn.Module):
	"""
		Module to join encoder and decoder of predictor model
	"""
	def __init__(self, pred_enc, pred_dec):
		"""
		Arguments:
			pred_enc : an object of MobilenetV1 class
			pred_dec : an object of SSD class
		"""
		super(MobileVOD, self).__init__()
		self.pred_encoder = pred_enc
		self.pred_decoder = pred_dec
		

	def forward(self, seq):
		"""
		Arguments:
			seq : a tensor used as input to the model  
		Returns:
			confidences and locations of predictions made by model
		"""
		x = self.pred_encoder(seq)
		confidences, locations = self.pred_decoder(x)
		return confidences , locations

	def detach_hidden(self):
		"""
		Detaches hidden state and cell state of all the LSTM layers from the graph
		"""
		self.pred_decoder.bottleneck_lstm1.hidden_state.detach_()
		self.pred_decoder.bottleneck_lstm1.cell_state.detach_()
		

My error…

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 256, 1, 1]] is at version 5; expected version 4 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Process finished with exit code 1

Which references…

2023-04-18 12:22:57,045 - root - INFO - Start training from epoch 0.
/home/steven/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
[W python_anomaly_mode.cpp:104] Warning: Error detected in CudnnConvolutionBackward0. Traceback of forward call that caused the error:

Thank you so much, it works

Thank you very much for providing the solution.
I was having the same issue and was trying different solutions by replacing the in-place operations to out-places and even by enabling the anomaly detection (no exceptions were raised).
Found out the issue was in the arrangement of backward function:

Rearranged from this:

def backward(self, unet_loss, dis_loss):
    dis_loss.backward(retain_graph = True)
    self.dis_optimizer.step()

    unet_loss.backward()
    self.unet_optimizer.step()

To this (has solved the issue)

def backward(self, unet_loss, dis_loss):
dis_loss.backward(retain_graph = True)
unet_loss.backward()

    self.dis_optimizer.step()
    self.unet_optimizer.step()