Trying to backward through the graph a second time even when create graph and retain graph is true

Sherine_Brahma · November 30, 2020, 1:30pm

Hi, I am trying to implement WGAN-GP (in a conditional probability setting). When I use inplace=True in the ReLU activation layers, I get the error “RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.” during my Critic training.

I have come across a similar post here in Freeing buffer strange behavior where replacing the inplace=False in the ReLU activation solved the problem. It “solved” my problem too but my model became very very slow and the model output values dropped drastically (I do not think they are correct).

At the time when the aforementioned post was asked, the poster was using pytorch 0.4.1. But I am using pytorch 1.1.0. Strange thing is this exact code was working fine for a few days and then this error started popping up. Could I please have some suggestions about what I can do or any leads to what might trigger this behaviour? Please let me know if I need to provide more information.

This is my critic training code-

def train_disc_net(bx, by, gen_net, disc_net, disc_net_optmzr):

# Zeroing out Gradients
disc_net_optmzr.zero_grad()  
gen_net.zero_grad()     
disc_net.zero_grad()

# Reset requires_grad
for p in disc_net.parameters():
    p.requires_grad = True
	
# Training sign convention
one = torch.ones(bx.shape[0],1).cuda()
neg_one = -1*torch.ones(bx.shape[0],1).cuda()

# True data
dval_true = disc_net(by, bx)
dval_true.backward(neg_one)

# Generated data
by_gen = gen_net(bx)  # Generated data
dval_gen = disc_net(by_gen.detach(), bx)
dval_gen.backward(one)

# Wasserstein distance
was_dist = dval_true.mean() - dval_gen.mean()

Get drift regularization
d_drift_reg = dval_true**2 + dval_gen**2
d_drift_reg = 10e-9 * d_drift_reg.mean()
d_drift_reg.backward(one)

# Train with gradient regularization to ensure lipschitz 1 constraint
grad_reg = 10 * get_grad_reg(by, by_gen.detach(), bx, disc_net)
grad_reg.backward() <----- Giving the error

# Objective function
d_cost = -was_dist + grad_reg #+ d_drift_reg

# Update the networks
disc_net_optmzr.step()

return d_cost, was_dist, grad_reg

This is my lipschitz constraining code-

def get_grad_reg(by, by_gen, bx, disc_net):

# Mixing real and fake inputs in a random fashion
epsilon = torch.FloatTensor(by.shape[0], 1, 1, 1, 1).uniform_(0.0, 1.0).cuda()
by_hat = epsilon * by + (1-epsilon) * by_gen
by_hat = torch.autograd.Variable(by_hat, requires_grad=True)

# Get output
d_hat = disc_net(by_hat, bx.detach())

# I concatenate the inputs of disc_net inside the disc_net function and I want to compute the gradients with respect to this concatenated input
d_in = disc_net.x_xyt 

# Getting gradient regularization
grad = torch.autograd.grad(outputs=d_hat, inputs=d_in, grad_outputs=torch.ones(d_hat.size()).cuda(), retain_graph=True, create_graph=True)[0]
grad_norm = torch.sqrt(1e-8+torch.sum(grad**2, dim=(1,2, 3, 4)))
one = torch.ones(grad_norm.shape).cuda()
grad_reg =  (grad_norm-one)**2
grad_reg =  grad_reg.mean()

return grad_reg

Sherine_Brahma · November 30, 2020, 7:29pm

I am sorry to bump up this question. I believe I have sufficiently searched for similar issues on the internet but I was unsucessful. It is currently posing a blockage for me. I share resources with other people and I wanted to know if an upgrade of software requirements is needed. I do not have a computer science background, so I do not know if I might unintentionally break others code in the server trying to upgrade (or if I cannot upgrade by myself, I will have to convince the server admin that this is my problem say).

ptrblck · December 1, 2020, 9:06am

I would recommend to update to the latest released version (1.7.0) as it would ship with bug fixes and new features.

You might have accidentally changes something in the code, as it shouldn’t break “by itself”

The inplace error is raised, since a tensor was overridden, which would be needed for the gradient calculation as described here.

That being said, the model behavior and convergence shouldn’t change by using the inplace or out-of-place relu.

Sherine_Brahma · December 7, 2020, 6:26pm

Thank you so much. I debugged the problem and the output was different because of the layer normalization layers I coincidently started using when I put inplace=false. When I found that there was a bug related to this before, I started attributing the change in output to this. Thank you for clearing that up.