Runtime Error in backpropagation of the graph second time

I3aer · June 28, 2021, 7:20pm

Hi,

I am implementing a GNN, which uses Chebyshev’s polynomials. I have a for loop that uses the recursive relation between Chebyshev’s polynomials:

# for k = 0, T_0(L) = I
Tx = T0_x = graphs.clone()
    
if self.cheby_K > 1: # T_1(L) = L
    T1_x = laplacians @ graphs
    Tx = torch.cat((T0_x, T1_x),dim=-1)
    # T_k(x) = 2*L*T_{k-1}(x) - T_{k-2}(x)
for k in range(2, self.cheby_K):
   Tk_x = 2 * laplacians @ self.T1_x - T0_x
   Tx = torch.cat((Tx,Tk_x),dim=-1)
   T1_x, T0_x = Tk_x, T1_x

When I test my network with the following snippet:

with torch.autograd.set_detect_anomaly(True):
        with torch.enable_grad():
            
            graphs = torch.unsqueeze(nodes,0).contiguous().to(device)
            
            laplacians = torch.unsqueeze(laplacian,0).to(device)
                
            for i in range(0,100):
                
                optimizer.zero_grad()
                
                logits, out_nodes, out_laplacians = gnn(graphs,laplacians)
                
                loss = criterion(logits,labels)
                
                print('epoch:{1:d}, loss: {0:.4f}'.format(loss,i))
                
                probs = torch.softmax(logits,dim=-1)
                
                print('predictions:\n',probs.type(torch.float16).data.numpy())
                
                loss.backward()
                
                optimizer.step()

I receive the following error message

Blockquote “RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.”

According to the error, I must specify the retain_graph=True when calling .backward in my loop above, but the training code again raises another RunTimeError:

Blockquote RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [8, 32]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I dont use any hidden state. Also, if I set self.cheby_K=1, the training runs without an error.
Could you help me to figure out what I am doing wrong?

Thanks

Edit:Typo

pascal_notsawo · June 28, 2021, 8:14pm

Is the line concerned by the error not pointed out in the whole error message? If so, please let me see it.

I3aer · June 28, 2021, 9:26pm

The following is the whole error message when I dont use the retain_graph=True in .backward():

Blockquote C:\Users\I3ase\miniconda3\lib\site-packages\torch\autograd_init_.py:145: UserWarning: Error detected in ReluBackward0. Traceback of forward call that caused the error:
File “C:\Users\I3ase\miniconda3\lib\runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “C:\Users\I3ase\miniconda3\lib\runpy.py”, line 87, in run_code
exec(code, run_globals)
File "C:\Users\I3ase\miniconda3\lib\site-packages\spyder_kernels\console_main.py", line 23, in
start.main()
File “C:\Users\I3ase\miniconda3\lib\site-packages\spyder_kernels\console\start.py”, line 296, in main
kernel.start()
File “C:\Users\I3ase\miniconda3\lib\site-packages\ipykernel\kernelapp.py”, line 612, in start
self.io_loop.start()
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\platform\asyncio.py”, line 199, in start
self.asyncio_loop.run_forever()
File “C:\Users\I3ase\miniconda3\lib\asyncio\base_events.py”, line 596, in run_forever
self._run_once()
File “C:\Users\I3ase\miniconda3\lib\asyncio\base_events.py”, line 1890, in _run_once
handle._run()
File “C:\Users\I3ase\miniconda3\lib\asyncio\events.py”, line 80, in _run
self._context.run(self._callback, *self._args)
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\ioloop.py”, line 688, in
lambda f: self._run_callback(functools.partial(callback, future))
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\ioloop.py”, line 741, in _run_callback
ret = callback()
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\gen.py”, line 814, in inner
self.ctx_run(self.run)
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\gen.py”, line 775, in run
yielded = self.gen.send(value)
File “C:\Users\I3ase\miniconda3\lib\site-packages\ipykernel\kernelbase.py”, line 365, in process_one
yield gen.maybe_future(dispatch(*args))
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\gen.py”, line 234, in wrapper
yielded = ctx_run(next, result)
File “C:\Users\I3ase\miniconda3\lib\site-packages\ipykernel\kernelbase.py”, line 268, in dispatch_shell
yield gen.maybe_future(handler(stream, idents, msg))
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\gen.py”, line 234, in wrapper
yielded = ctx_run(next, result)
File “C:\Users\I3ase\miniconda3\lib\site-packages\ipykernel\kernelbase.py”, line 543, in execute_request
self.do_execute(
File “C:\Users\I3ase\miniconda3\lib\site-packages\tornado\gen.py”, line 234, in wrapper
yielded = ctx_run(next, result)
File “C:\Users\I3ase\miniconda3\lib\site-packages\ipykernel\ipkernel.py”, line 306, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File “C:\Users\I3ase\miniconda3\lib\site-packages\ipykernel\zmqshell.py”, line 536, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File “C:\Users\I3ase\miniconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 2894, in run_cell
result = self._run_cell(
File “C:\Users\I3ase\miniconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 2940, in _run_cell
return runner(coro)
File “C:\Users\I3ase\miniconda3\lib\site-packages\IPython\core\async_helpers.py”, line 68, in pseudo_sync_runner
coro.send(None)
File “C:\Users\I3ase\miniconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 3165, in run_cell_async
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
File “C:\Users\I3ase\miniconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 3357, in run_ast_nodes
if (await self.run_code(code, result, async=asy)):
File “C:\Users\I3ase\miniconda3\lib\site-packages\IPython\core\interactiveshell.py”, line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 1, in
runfile(‘C:/Users/I3ase/Documents/projects/agcn/unit_test_sgc_ll.py’, wdir=‘C:/Users/I3ase/Documents/projects/agcn’)
File “C:\Users\I3ase\miniconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 565, in runfile
exec_code(file_code, filename, ns_globals, ns_locals,
File “C:\Users\I3ase\miniconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 453, in exec_code
exec(compiled, ns_globals, ns_locals)
File “C:\Users\I3ase\Documents\projects\agcn\unit_test_sgc_ll.py”, line 96, in
logits, out_nodes, out_laplacians = gnn(graphs,laplacians)
File “C:\Users\I3ase\miniconda3\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\I3ase\Documents\projects\agcn\unit_test_sgc_ll.py”, line 24, in forward
graphs2, laplacians2 = self.layer2(graphs1, laplacians1)
File “C:\Users\I3ase\miniconda3\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “C:\Users\I3ase\Documents\projects\agcn\sgc_ll.py”, line 183, in forward
out_graphs = F.relu(res_graph + sc_graph)
File “C:\Users\I3ase\miniconda3\lib\site-packages\torch\nn\functional.py”, line 1206, in relu
result = torch.relu(input)
(Triggered internally at …\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
Variable._execution_engine.run_backward(
Traceback (most recent call last):

File “C:\Users\I3ase\Documents\projects\agcn\unit_test_sgc_ll.py”, line 106, in
loss.backward()

File “C:\Users\I3ase\miniconda3\lib\site-packages\torch\tensor.py”, line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File “C:\Users\I3ase\miniconda3\lib\site-packages\torch\autograd_init_.py”, line 145, in backward
Variable._execution_engine.run_backward(

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time.

In addition, DGL uses a similar loop ChebNet

I3aer · June 29, 2021, 5:13am

Hi,

I have solved the problem. I have changed the name of the updated laplacian in the network. I named both that tensor and the input tensor with the same name ‘laplacians’. In addition, I’ve changed the addition-assignment to addition, i.e.,
laplacians += res_laplacians → u_laplacians = laplacians + res_laplacians