Autograd backward() call not updating loss

Hi I’m working on translating some style transfer Torch code to PyTorch and I’m running into some issues probably because I’m not using autograd correctly. I’m able to run all the way through the building of my network as well as optimization steps but the loss never decreases (it just outputs the same thing for every iteration). I’m not particularly experienced with Torch and even less so with PyTorch so chances are I’m missing something obvious.

I’ve built up my network (a frozen vgg19) into an nn.Sequential that looks like this:

Net
Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): StyleLoss(
    (gram): GramMatrix()
    (mse): MSELoss()
  )
  (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): ReLU()
  (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU()
  (8): StyleLoss(
    (gram): GramMatrix()
    (mse): MSELoss()
  )
  (9): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (10): ReLU()
  (11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (12): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (13): ReLU()
  (14): StyleLoss(
    (gram): GramMatrix()
    (mse): MSELoss()
  )
  (15): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (16): ReLU()
  (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (18): ReLU()
  (19): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (20): ReLU()
  (21): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (22): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (23): ReLU()
  (24): StyleLoss(
    (gram): GramMatrix()
    (mse): MSELoss()
  )
  (25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (26): ReLU()
  (27): ContentLoss(
    (mse): MSELoss()
  )
  (28): ReLU()
  (29): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (30): ReLU()
  (31): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (33): ReLU()
  (34): StyleLoss(
    (gram): GramMatrix()
    (mse): MSELoss()
  )
)

With ContentLoss and StyleLoss defined as follows:

ContentLoss
class ContentLoss(torch.nn.Module):
	def __init__(self, strength, target, normalize):
		super(ContentLoss, self).__init__()
		self.strength = strength
		self.target = target
		self.normalize = normalize
		self.loss = 0
		self.mse = torch.nn.MSELoss()

	def forward(self, input):
		print(input.shape)
		print(self.target.shape)
		print(input.nelement())
		print(self.target.nelement())
		if input.nelement() == self.target.nelement():
			self.loss = self.mse.forward(input, self.target) * self.strength
		else:
			print('WARNING: Skipping content loss')
		output = input
		return output

	def backward(self, input, grad_output):
		if input.nelement() == self.target.nelement():
			grad_input = self.mse.backward(input, self.target)
		if self.normalize:
			grad_input.div(torch.norm(grad_input, 1) + 1e-8)
		grad_input.mul(self.strength)
		grad_input.add(grad_output)
		return grad_input
Style Loss
class GramMatrix(torch.nn.Module):
    def forward(self, input):
		a, b, c, d = input.shape  # a=batch size(=1)
		features = input.contiguous().view(a * b, c * d)  # resise F_XL into \hat F_XL
		G = torch.mm(features, features.t()).float()  # compute the gram product
		return G.div(a * b * c * d)

class StyleLoss(torch.nn.Module):
	def __init__(self, strength, target, normalize):
		super(StyleLoss, self).__init__()
		self.normalize = normalize
		self.strength = strength
		self.target = target
		self.loss = 0
		self.gram = GramMatrix()

		self.G = None
		self.mse = torch.nn.MSELoss()

	def forward(self, input):
		self.G = self.gram.forward(input)
		self.G.div(input.nelement())
		self.loss = self.mse.forward(self.G, self.target)
		self.loss = self.loss * self.strength
		output = input
		return output

	def backward(self, input, grad_output):
		dG = self.mse.backward(self.G, self.target)
		dG.div(input.nelement())
		grad_input = self.gram.backward(input, dG)
		if self.normalize:
			grad_input.div(torch.norm(grad_input, 1) + 1e-8)
		grad_input.mul(self.strength)
		grad_input.add(grad_output)
		return grad_input

And then finally I’m trying to run my optimization like this:

y = net.forward(img)
dy = torch.zeros(y.shape)

def closure():
	# optimizer.zero_grad()
	net.forward(img) #Torch code uses x for img here
	torch.autograd.backward(img, dy) # and here calls net.backwards instead of autograd
	loss = 0
	for mod in content_losses: loss += mod.loss
	for mod in temporal_losses: loss += mod.loss
	for mod in style_losses: loss += mod.loss
	# loss.requires_grad_(True)
	# loss.backward()
	print(loss.item())
	return loss

# Run optimization.
optimizer = torch.optim.LBFGS([img.requires_grad_()], lr=args.learning_rate, max_iter=args.num_iterations, tolerance_change=args.tol_loss_relative)
for iter in range(args.num_iterations): # this for loop is weird to me as I thought LBFGS handled this internally with the max_iter parameter...
	optimizer.step(closure)

img_out = np.moveaxis(img.detach().squeeze().numpy(),0,-1)
skimage.io.imsave(args.img_filename.format(0, 1), img_out)

The Torch code I’m following feeds the image into the closure code as x and then substitutes that for the img variable within the closure definition (see comments in closure()), however I wasn’t able to get that working using step() as I needed to give it the reference to the function.

I’ve also tried to call backward() on the loss value directly in closure() as can be seen in the commented out lines near the bottom.

Either way the loss ends up printing the same value for every iteration and the final optimized picture does not look stylized (which I think is because only one iteration of optimization is being run).

How can I make sure that the image is optimized correctly using the style & content losses I’ve defined?

I’m not quite sure, how the xxx_losses are calculated.
From your model definition it looks like you’ve embedded the losses directly in the model, but I’m not sure where the variables come from.

Also as a quick side note: you should call your model directly (model(input)) to perform the forward pass instead of model.forward(input) as this will make sure all hooks are properly handled.

1 Like

Thanks for the reply!

I’m building my network up in two steps which loop through the vgg model (although I’d like to be able to do this for many different models eventually) inserting layers and losses. First I insert all the layers from vgg (which are frozen after insertion) and the StyleLosses as these do not change throughout optimization.

First net building loop
content_layers = args.content_layers.split(",")
style_layers = args.style_layers.split(",")

next_content_i, next_style_i, next_temporal_i, current_layer_index = 0, 0, 0, 0

cnn = torchvision.models.vgg19(pretrained=True)
net = torch.nn.Sequential()

block = 1
conv = 1
for i, layer in enumerate(cnn.features):
	if next_content_i < len(content_layers) or next_style_i < len(style_layers):
		name = 'uhhh'
		if isinstance(layer, torch.nn.Conv2d):
			name = 'conv'+str(block)+'_'+str(conv)
		elif isinstance(layer, torch.nn.ReLU):
			name = 'relu'+str(block)+'_'+str(conv)
			conv += 1
			layer = torch.nn.ReLU(inplace=False)
		elif isinstance(layer, torch.nn.MaxPool2d):
			name = 'pool'+str(block)
			if args.pooling == 'avg':
				assert(layer.padW == 0 and layer.padH == 0)
				kW, kH = layer.kW, layer.kH
				dW, dH = layer.dW, layer.dH
				avg_pool_layer = torch.nn.SpatialAveragePooling(kW, kH, dW, dH)
				net.add_module("pooling_"+str(block), avg_pool_layer)
			else:
				net.add_module("pooling_"+str(block), layer)
			block += 1
			conv = 1
		else:
			continue
		net.add_module(name, layer)
		current_layer_index = current_layer_index + 1
		if next_content_i < len(content_layers) and name == content_layers[next_content_i]:
			print("Earmarking content loss "+str(next_content_i+1)+": "+name)
			losses_indices.append(current_layer_index)
			losses_type.append('content_'+name)
			next_content_i = next_content_i + 1
		if next_style_i < len(style_layers) and name == style_layers[next_style_i]:
			print("Setting up style loss "+str(next_style_i+1)+": "+name)
			gram = GramMatrix()
			target = []
			for s in range(len(style_images)):
				target_features = net.forward(style_images[s]).detach()
				target_i = gram.forward(target_features)
				target_i.div(target_features.nelement())
				target_i.mul(style_blend_weights[s])
				if s == 0:
					target = target_i
				else:
					target.append(target_i)
			loss_module = StyleLoss(args.style_weight, target, args.normalize_gradients)
			net.add_module("style_loss_"+name, loss_module)
			current_layer_index = current_layer_index + 1
			style_losses.append(loss_module)
			next_style_i = next_style_i + 1

del cnn
for module in net.children():
    if isinstance(module, torch.nn.Conv2d):
        # remove these, not used, but uses gpu memory
        module.gradWeight = None
        module.gradBias = None

for param in net.parameters():
    param.requires_grad = False

Afterwards I want to loop through many different content images so I insert the content loss into the network, run the optimization, remove it, and repeat for each image.

Content loss insertion
content_losses, prev_plus_flow_losses = [], []
additional_layers = 0

for i in range(len(losses_indices)):
	if losses_type[i].startswith('content'):
		content_loss = get_content_loss_module(net, losses_indices[i] + additional_layers, content_image, args)
		# insert content
		index = losses_indices[i]+additional_layers
		new_modules = list(net.children())[:index] + [content_loss] + list(net.children())[index+1:]
		net = torch.nn.Sequential(*new_modules)

		additional_layers = additional_layers + 1
		content_losses.append(content_loss)

def get_content_loss_module(net, layer_idx, target_img, args):
	tmpNet = torch.nn.Sequential()
	for i, layer in enumerate(net.children()):
		if i == layer_idx: break
		tmpNet.add_module("temp"+str(i), layer)
	target = tmpNet.forward(target_img)
	loss_module = ContentLoss(args.content_weight, target, args.normalize_gradients)
	return loss_module

run_optimization(args, net, content_losses, style_losses, temporal_losses, img, mod_idx, run)

for i,l in enumerate(losses_type):
    if l.startswith('content'):
        additional_layers = additional_layers - 1
        index = losses_indices[i]+additional_layers
        new_modules = list(net.children())[:index] + list(net.children())[index+1:]
        net = torch.nn.Sequential(*new_modules)

Regarding replacing model.forward() calls: I’ve tried this in the optimization part but am still running into the same problem, the printed loss value never changes. Is the model(input) call better in all situations? Should I replace all model.forward() calls throughout? Or only in the optimization closure?

N.B. there are some references to temporal losses throughout the code, these are basically slightly modified content losses but are currently not being inserted yet so temporal_losses is just an empty list. This should eventually be handled in a similar way to the content_losses: being inserted for each individual content image.

What kind of error do you get, if you call loss.backward() in your closure?

You should always use the direct model call instead of forward() unless there is a reason to avoid hooks being called (which is probably a very unusual use case).

I don’t get an error, the loss just never changes and it saves the same picture over and over again.

I’ve decided to rewrite the code without the custom losses inserted into the network and just calculate them in place (this is a bit more PyTorch-y?). I’ll report back in a few days…

Alright that didn’t turn out to be so hard, but I’m still running into the same exact error! The loss is the same every single time. To me it seems like the backward() call just isn’t updating anything in the pastiche image even though I’ve set it to require_grad.

I’ve now changed the building of the network to just insert the modules as read from the vgg19 model without any losses. I’ve wrapped it in a custom Sequential class that returns a list of features to use for calculating the losses when optimizing.

build_net()
content_layers = args.content_layers.split(",")
style_layers = args.style_layers.split(",")

modules_dict = OrderedDict()
feature_list = []

next_content_i, next_style_i = 0, 0

block = 1
conv = 1
for i, layer in enumerate(cnn.features):
    if next_content_i < len(content_layers) or next_style_i < len(style_layers):
        name = 'uhhh'
        if isinstance(layer, torch.nn.Conv2d):
            name = 'conv'+str(block)+'_'+str(conv)
        elif isinstance(layer, torch.nn.ReLU):
            name = 'relu'+str(block)+'_'+str(conv)
            layer = torch.nn.ReLU(inplace=False)
            conv += 1
        elif isinstance(layer, torch.nn.MaxPool2d):
            name = 'pool_'+str(block)
            if args.pooling == 'avg':
                assert(layer.padW == 0 and layer.padH == 0)
                kW, kH = layer.kW, layer.kH
                dW, dH = layer.dW, layer.dH
                avg_pool_layer = torch.nn.SpatialAveragePooling(kW, kH, dW, dH)
                layer = avg_pool_layer
                block += 1
                conv = 1
        else:
            continue

        modules_dict[name] = layer

        if next_content_i < len(content_layers) and name == content_layers[next_content_i]:
            print("Earmarking content loss "+str(next_content_i+1)+": "+name)
            feature_list.append('content_'+name)
            next_content_i = next_content_i + 1

        if next_style_i < len(style_layers) and name == style_layers[next_style_i]:
            print("Earmarking style loss "+str(next_style_i+1)+": "+name)
            feature_list.append('style_'+name)
            next_style_i = next_style_i + 1

return SelectiveSequential(feature_list, modules_dict)

class SelectiveSequential(torch.nn.Module):
	def __init__(self, to_select, modules_dict):
		super(SelectiveSequential, self).__init__()
		for key, module in modules_dict.items():
			self.add_module(key, module)
		self._to_select = to_select

	def forward(self, x, layers_to_return):
		layers = [[] for i in range(len(layers_to_return))]
		for name, module in self._modules.iteritems():
			x = module(x)
			for i,l in enumerate(layers_to_return):
				if l+"_"+name in self._to_select:
					layers[i].append(x)
		return layers if len(layers_to_return) > 1 else layers[0]

def gram_matrix(input):
	a, b, c, d = input.shape
	features = input.contiguous().view(a * b, c * d)
	G = torch.mm(features, features.t()).float()
	return G.div(a * b * c * d)

My optimization now looks like this:

cnn = torchvision.models.vgg19(pretrained=True)
net = utils.build_net(args, cnn)
del cnn
for module in net.children():
    if isinstance(module, torch.nn.Conv2d):
        # remove these, not used, but uses gpu memory
        module.gradWeight = None
        module.gradBias = None
for param in net.parameters():
    param.requires_grad = False

style_images = utils.get_style_images(args, 1)
file_name = args.content_filename.format(mod_idx)
content_image = skimage.io.imread(file_name)
content_image = skimage.transform.rescale(content_image, scale_current)
content_image = utils.match_color(content_image, np.moveaxis(style_images[0].cpu().squeeze().numpy(), 0, -1))
content_image = torch.from_numpy(content_image.reshape(1, content_image.shape[2], content_image.shape[0], content_image.shape[1])).float()
img = content_image.clone()

# get content & flow weighted features
content_features = net(img, ['content'])
f_xc_c = []
for m in range(len(content_features)):
    f_xc_c.append(torch.autograd.Variable(content_features[m].data, requires_grad=False))

# get style features
gram_style = None
for i, style_image in enumerate(style_images):
    style_features = net(style_image, ['style'])
    target_i = [utils.gram_matrix(y) for y in style_features]
    target_i = [t / style_features[n].nelement() for n,t in enumerate(target_i)]
    target_i = [t * style_blend_weights[i] for t in target_i]
    if i == 0:
        gram_style = target_i
    else:
        gram_style = [sum(x) for x in zip(gram_style, target_i)]

# init optimizer
pastiche = torch.autograd.Variable(img.data, requires_grad=True)
optimizer = torch.optim.Adam([pastiche], lr=args.learning_rate)
mse_loss = torch.nn.MSELoss()

# optimize the images
for e in range(args.num_iterations):
    optimizer.zero_grad()
    c_feat, s_feat = net(pastiche, ['content','style'])
    content_loss = 0.
    for m in range(len(c_feat)):
        content_loss += args.content_weight * mse_loss(c_feat[m], f_xc_c[m])

    style_loss = 0.
    for m in range(len(s_feat)):
        gram_y = utils.gram_matrix(s_feat[m])
        gram_s = torch.autograd.Variable(gram_style[m].data, requires_grad=False)
        style_loss += args.style_weight * mse_loss(gram_y, gram_s)

    total_loss = content_loss + style_loss
    loss = torch.autograd.Variable(total_loss.data, requires_grad=True)
    loss.backward()
    optimizer.step()
    print(total_loss.data.cpu().numpy())

How can I get pastiche to be updated? Aren’t the calls to loss.backward() and/or optimizer.step() supposed to handle this?

Yes, the backward call and the optimizer are supposed to update your input image.
However, one condition on this is, that you don’t detach the computation graph.
This happens e.g. if you use some numpy operations on your data, which autograd cannot trace. Also, this happens if you are re-wrapping variables as it seems to be the case for your loss.
Currently you are using loss = Variable(total_loss.data, requires_grad=True), which cuts the graph right at the end so that no gradients will be calculated.
Have a look at this small code example:

model = nn.Linear(10, 2)
x = torch.randn(1, 10, requires_grad=True)
y = torch.randn(1, 2)

output = model(x)
loss = ((output - y)**2).mean()
loss.backward()
print(model.weight.grad)
print(x.grad)

# Now let's re-wrap the loss
model = nn.Linear(10, 2)
x = torch.randn(1, 10, requires_grad=True)
y = torch.randn(1, 2)

output = model(x)
loss = ((output - y)**2).mean()
det_loss = torch.tensor(loss.data, requires_grad=True)
det_loss.backward()
print(model.weight.grad)
print(x.grad)

Could you try to fix this and then check the gradients in your input image?

Also as a small side note, Variables and tensors were merged in 0.4.0, so you should update to the current stable or preview version. You will find install instructions here.

Hmm ok, I did have some stray numpy operations in gram_matrix() and a few other places which I’ve gone ahead and translated to torch. I had added the Variable wrap because I was getting ‘RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn’ which I’m still getting now even though I think I’ve replaced all numpy operations and have the pastiche set to require_grad = True.

Can you spot what detaches the computation graph?

def gram_matrix(input):
	a, b, c, d = input.size()
	features = input.view(a * b, c * d)
	G = torch.mm(features, features.t())
	return G.div(a * b * c * d)

Class SelectiveSequential(torch.nn.Module)
        ...
	def forward(self, x, layers_to_return):
		layers = [[] for i in range(len(layers_to_return))]
		for name, module in self._modules.iteritems():
			x = module(x)
			for i,l in enumerate(layers_to_return):
				if l+"_"+name in self._to_select:
					layers[i].append(x)
		return layers if len(layers_to_return) > 1 else layers[0]

style_images = utils.get_style_images(args, scale_current)
style_images = [s.to(device) for s in style_images]

# get style features
gram_style = None
for i, style_image in enumerate(style_images):
	style_features = net(style_image, ['style'])
	target_i = [utils.gram_matrix(y) for y in style_features]
	target_i = [t.div(style_features[n].numel()) for n,t in enumerate(target_i)]
	target_i = [t.mul(style_blend_weights[i]) for t in target_i]
	if i == 0:
		gram_style = target_i
	else:
		gram_style = [sum(x) for x in zip(gram_style, target_i)]

file_name = args.content_filename.format(mod_idx)
content_image = skimage.io.imread(file_name)
content_image = skimage.transform.rescale(content_image, scale_current)
content_image = utils.match_color(content_image, np.moveaxis(style_images[0].cpu().squeeze().numpy(),0,-1)) #TODO hist match multiple styles
content_image = torch.from_numpy(content_image.reshape(1, content_image.shape[2],content_image.shape[0],content_image.shape[1])).float()
content_image = content_image.to(device)

# get content features
content_features = net(content_image, ['content'])

init_image = content_image
pastiche = init_image.to(device)
pastiche.requires_grad = True

optimizer = torch.optim.Adam([pastiche], lr=args.learning_rate)
mse_loss = torch.nn.MSELoss()

# optimize the images
for e in range(args.num_iterations):
	optimizer.zero_grad()
	c_feat, s_feat = net(pastiche, ['content','style'])
	content_loss = 0.
	for m in range(len(c_feat)):
		content_loss += args.content_weight * mse_loss(c_feat[m], content_features[m])

	style_loss = 0.
	for m in range(len(s_feat)):
		gram_y = utils.gram_matrix(s_feat[m])
		gram_s = gram_style[m]
		style_loss += args.style_weight * mse_loss(gram_y, gram_s)

	total_loss = content_loss + style_loss
	total_loss.backward()
	optimizer.step()
	print(total_loss.data.cpu().numpy())

# save frame
img_out = pastiche.clone().squeeze().mul(255).cpu().clamp(0, 255).numpy()
img_out = img_out.transpose(1, 2, 0).astype('uint8')
skimage.io.imsave(args.img_filename.format(run, mod_idx), img_out)
if run == args.passes_per_scale - 1 and mod_idx == 1:
	skimage.io.imsave(args.img_filename.format(max(*img.shape), mod_idx), img_out)

I’m using pytorch==0.4 so I think Variable was just an alias for Tensor, but I’ve removed all Variables anyway. Some other parts of my pipeline rely on pytorch<=0.4 so I’d like to get it working for this version.

Thanks so much for all your help!

Could you point me to the line throwing the RuntimeError?
I can’t see any obvious errors skimming through your code.

It happens on total_loss.backward().

I’ve tried rewriting the loop into a closure and calling lbfgs and it still says there is no gradient when calling backward() on the loss.

Could you please print the content of requires_grad for gram_y, gram_s, c_feat[m] and content_features [m]?

All of those return false. The only variable that prints true is pastiche, which I’ve explicitly set. Is a variable’s requires_grad property supposed to propagate to variables made from it?

Usualy yes (if you don’t use torch.no_grad or something similar. Could you provide a gist with a minimum working example? This would be helpful, since we would able to debug ourselves.

I found it… One of the scripts from a different part of my pipeline that was imported but wasn’t being used at all had a torch.no_grad in it facepalm

In any case thanks for all the help and sorry for the wild goose chase