How to find best solution to backward error

nzinc · January 14, 2020, 1:05pm

This code show training process for one batch (I missed all stuff before, cause its unnecessary)

for j in range(critic_policy(epoch)):
    output = netC(train_full)

    generator_loss = torch.mean(cramer_critic(train_full, generated_full_2) * w_full * w_x_2 -
                        cramer_critic(generated_full_1, generated_full_2) * w_x_1  * w_x_2)

    alpha = torch.empty(train_full.shape[0], 1, device=device).normal_(0.0,1.0)
    interpolates = alpha * train_full + (1.0 - alpha) * generated_full_1
    disc_interpolates = cramer_critic(interpolates, generated_full_2)
    gradients = grad(outputs=disc_interpolates, inputs=interpolates, 
                     grad_outputs=torch.ones_like(disc_interpolates))[0]
    slopes = torch.norm(torch.reshape(gradients, (list(gradients[0].shape)[0], -1)), dim=1)
    gradient_penalty = torch.mean(torch.pow(torch.max(torch.abs(slopes) - 1, 
                                                      torch.zeros(8, device=device)), 2))

    critic_loss = lambda_pt(epoch) * gradient_penalty - generator_loss
    critic_loss.backward()
    optC.step()
    optC.zero_grad()

And after one success iteration raise that error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-2654469844e0> in <module>
     32 
     33             critic_loss = lambda_pt(epoch) * gradient_penalty - generator_loss
---> 34             critic_loss.backward()
     35             optC.step()
     36             optC.zero_grad()

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    164                 products. Defaults to ``False``.
    165         """
--> 166         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    167 
    168     def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

All ideas, that i found is all about hidden states (netC is just 5 linear layers), so i think it’s doesn’t match (I’m new in pytorch, so maybe it’s wrong state). And idea to use retain_graph=True is not good, because somebody write, that it make training slower, I don’t want it, but if that necessary, please help to implement it (for which iteration do it)

albanD · January 14, 2020, 3:43pm

Hi,

If you want to be able to backward through your first call to .grad (to get gradients for your gradient penalty), you need to give it create_graph=True. See the doc for more details.

nzinc · January 14, 2020, 4:21pm

Thank you for help) it work. But now cuda is out of memory. Maybe after adding this critic_loss.backward(create_graph=True) optC.zero_grad() don’t clear all calculates (I thought about gradients and gradient_penalty). Error looks like

CUDA out of memory. Tried to allocate 50.00 MiB (GPU 1; 11.17 GiB total capacity; 2.37 GiB already allocated; 18.06 MiB free; 60.63 MiB cached)\\

I find one solution - decrease batch size.
My batch_size was 1e5, and now I decrease it to 10 (just for check assumption) and its work, but why i want to understand why this error happened.

albanD · January 14, 2020, 4:57pm

You don’t want it for critic_loss.backward()! You want it for grad(outputs=disc_interpolates, inputs=interpolates, grad_outputs=torch.ones_like(disc_interpolates), create_graph=True)[0] only.

But doing gradient of gradient will require some extra memory. So this is expected to need more than without the create_graph. reducing the batch_size is the right way to go here.

nzinc · January 14, 2020, 5:22pm

Strange things, but code work only in case, when critic_loss.backward(create_graph=True). grad(outputs=disc_interpolates, inputs=interpolates, grad_outputs=torch.ones_like(disc_interpolates), create_graph=True)[0] doesn’t work and still raise Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

albanD · January 14, 2020, 5:28pm

That means that you have another issue

Can you give the code that shows how you get train_full and generated_full_X and w_x_X?
In particular:

Do they require gradients?
Do you compute them in a differentiable manner outside of the loop?

nzinc · January 14, 2020, 5:51pm

for i, data in enumerate(all_dataloader):
        train_full, w_full, train_x_1, w_x_1, train_x_2, w_x_2 = data
        train_full = train_full.to(device)
        w_full = w_full.to(device)
        train_x_1 = train_x_1.to(device)
        w_x_1 = w_x_1.to(device)
        train_x_2 = train_x_2.to(device)
        w_x_2 = w_x_2.to(device)
        
        noise_1 = torch.empty(train_x_1.shape[0], LATENT_DIMENSIONS, device=device).normal_(mean=0,std=1.0)
        noise_2 = torch.empty(train_x_2.shape[0], LATENT_DIMENSIONS, device=device).normal_(mean=0,std=1.0)
        generated_y_1 = netG(torch.cat((noise_1, train_x_1), dim=1))
        generated_full_1 = torch.cat((generated_y_1, train_x_1), dim=1)
        generated_y_2 = netG(torch.cat((noise_2, train_x_2), dim=1))
        generated_full_2 = torch.cat((generated_y_2, train_x_2), dim=1)

all of them are just tensors from Dataset class and define like:

    def __init__(self, dataset, batch_size):
        self.dataset = torch.Tensor(dataset)

all_data = torch.utils.data.TensorDataset(train_full_.dataset, w_full_.dataset, train_x_1_.dataset, 
                                          w_x_1_.dataset, train_x_2_.dataset, w_x_2_.dataset)
all_dataloader = DataLoader(all_data, batch_size=BATCH_SIZE, shuffle=True)

that’s all about them, no any other computation

albanD · January 14, 2020, 6:46pm

How are these loop nested? Like this:

for i, data in enumerate(all_dataloader):
  # Generate all train_full, generated_y etc
  for j in range(critic_policy(epoch)):
    # The code from your first post

If so, what most likely happens is that netG has some parameters that require gradients. So generated_y_X requires gradients as well.
But this part of the history is shared between the iterations of the inner loop.
This means that when you call backward on the second iteration, you already called backward in netG during the first iteration, and that gives you the error you see.

If you don’t want gradients for netG, you can simply do:

    with torch.no_grad():
        generated_y_1 = netG(torch.cat((noise_1, train_x_1), dim=1))
        generated_full_1 = torch.cat((generated_y_1, train_x_1), dim=1)
        generated_y_2 = netG(torch.cat((noise_2, train_x_2), dim=1))
        generated_full_2 = torch.cat((generated_y_2, train_x_2), dim=1)

That will disable the autograd for the part where you don’t need gradients. And remove the shared part of the graph between iterations.
That will reduce the memory usage of your overall code a lot as well because you won’t store all the informations necessary to compute the gradients for netG two times even though you don’t want to compute them ever.