How to visualize Backward (and perhaps DoubleBackward) pass of variable?

AlphaBetaGamma96 · January 10, 2021, 9:09pm

Hi All,

I was just wondering if it’s possible to visualize the backward pass (including the shape of the Tensors in the computation). This is because I’ve been writing my own custom Autograd function with a custom Backward and custom DoubleBackward and when I run my network, I get a mismatch error when using my custom function. So, clearly, I’ve miscalculated the shape of some Tensor which will result in a mismatch.

For example, I get the following error when running my code with my custom function,

[W python_anomaly_mode.cpp:60] Warning: Error detected in CustomDeterminantBackward. Traceback of forward call that caused the error:
  File "file.py", line 307, in <module>
    myLocal = calc_local_energy(myNet, myX)
  File "file.py", line 236, in calc_local_energy
    kinetic = calc_kinetic_energy(Net, x)
  File "file.py", line 214, in calc_kinetic_energy
    psi_walker = Net(x_walker) 
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "file.py", line 180, in forward
    Psi = self.slater_det(A)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "file.py", line 152, in forward
    return CustomDeterminant.apply(A)
 (function print_stack)
Traceback (most recent call last):
  File "file.py", line 318, in <module>
    loss.backward()
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  
RuntimeError: The size of tensor a (2500) must match the size of tensor b (4) at non-singleton dimension 2

The significance of size (2500) and size (4) is that 2500 corresponds to the number in my batch, and 4 corresponds to the number of input nodes. So, I pass a Tensor of size (2500, 4) into a feed-forward like network that returns a single value.

I would assume that the error’s somewhere within my CustomDeterminantBackward class (as PyTorch states in the python_anomaly_mode warning at the top of the trace), although not much information is given as to where precisely this mismatch in dims is occurring between Tensor a and b. Is there an easy way to track the shapes of the Tensor as they are called within the backward pass?

Thanks in advance, and hopefully this question makes sense!

albanD · January 11, 2021, 3:45pm

Hi,

Since the error shows that the failure happens in your custom Function. I think the simplest is to print the size directly in your manual backward.
In general, for the backward, the grad you get as input will have the same size as the output of the forward. And you must return Tensors with the same size as the input of the forward.

AlphaBetaGamma96 · January 11, 2021, 4:37pm

Hi @albanD,

Thank you for the quick response! So, I’ve checked the shapes of the forward and backward methods in my custom Autograd Function, and they’re the same shape of [1, 4, 4].

I did have a look at the library torchviz to compare between torch.det and my custom_det function. My custom_det function seems to add in a Tensor of shape [25, 4, 4] corresponding to the number of matrices within a given batch to the function which (might?) be causing the error?. Whereas DetBackward doesn’t add anything (unless it does a simplier thing behind the scenes?)

Thank you for the help!

albanD · January 11, 2021, 4:56pm

If you’re computing a determinant, shouldn’t the output be of a different size? If so the backward sizes will also be different.

The extra Tensor from your custom function is because torchviz is able to see what you saved for backward while it cannot see it for the builtin functions (implemented in c++). So that’s not surprising to me.

Also you can try to set TORCH_SHOW_CPP_STACKTRACES=1 to get more information about where the error hapens in the backward pass.

AlphaBetaGamma96 · January 11, 2021, 5:09pm

Sorry I should’ve made this a bit clearer, the output of the custom_det function returns [B, ] for an input of [B, N, N]. What I mentioned above was the size of the input matrix, the size of the 1st order gradient w.r.t the input matrix (from the first backward method), and the size of the 2nd order gradient w.r.t the input matrix (from the second backward method).

Also, I think I’ve just realised that the Tensor of [25, 4, 4] could be the grad_output which is returned in the Double Backward method?

Also, also, how exactly do I set TORCH_SHOW_CPP_STACKTRACES=1 ? I saw from this question you state it needs to be enabled before importing torch. Would this be enable via something like this?

import os 
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = "1"
import torch

Thank you

Edit: I’ve added these lines to the code of my code, and nothing has changed with the torchviz diagram. (So I’ve probably implemented it wrong?)

albanD · January 11, 2021, 7:16pm

Yes this will work.
It won’t change the torchviz diagram. It will change the stack trace you see when it crashes.

Sorry I should’ve made this a bit clearer, the output of the custom_det function returns [B, ] for an input of [B, N, N].

So that means that the backward should expect a Tensor of size [B, ] as input and should return a Tensor of size [B, N, N]. Is that what you’re doing?

Also, I think I’ve just realised that the Tensor of [25, 4, 4] could be the grad_output which is returned in the Double Backward method?

No that would be the input that you save for backward in your custom Function most likely.

AlphaBetaGamma96 · January 11, 2021, 8:09pm

I’ve just ran it again with the TORCH_SHOW_CPP_STACKTRACES=1 enabled and the stack trace is quite extensive, which is,

RuntimeError: The size of tensor a (25) must match the size of tensor b (4) at non-singleton dimension 2
Exception raised from infer_size at /opt/conda/conda-bld/pytorch_1595629427478/work/aten/src/ATen/ExpandUtils.cpp:24 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f63bd7bc77d in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::infer_size(c10::ArrayRef<long>, c10::ArrayRef<long>) + 0x4b8 (0x7f63f46d4a28 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::compute_shape(at::TensorIteratorConfig const&) + 0x10c (0x7f63f4af6e6c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIterator::build(at::TensorIteratorConfig&) + 0x55 (0x7f63f4af8435 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::TensorIterator::TensorIterator(at::TensorIteratorConfig&) + 0xdd (0x7f63f4af8abd in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x14a (0x7f63f4af8c6a in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x47 (0x7f63f4836317 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe543d9 (0x7f63f4d6b3d9 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x7b1990 (0x7f63f46c8990 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f63f4eb0c7c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::mul(at::Tensor const&, at::Tensor const&) + 0x4b (0x7f63f4e01c8b in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2c71dc8 (0x7f63f6b88dc8 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x7b1990 (0x7f63f46c8990 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f63f4eb0c7c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::Tensor::mul(at::Tensor const&) const + 0x4b (0x7f63f4f971cb in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1bc737 (0x7f63faf9d737 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x1bcf66 (0x7f63faf9df66 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: THPFunction_apply(_object*, _object*) + 0x8b5 (0x7f63fb3156f5 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #34: torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x183 (0x7f63fb314033 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #35: <unknown function> + 0x30d1017 (0x7f63f6fe8017 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #36: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f63f6fe3860 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #37: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f63f6fe4401 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #38: torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x37c (0x7f63f6fe1b1c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #39: torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x3c (0x7f63fb30bdcc in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #40: torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x803 (0x7f63f6fe0e53 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #41: torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x4e (0x7f63fb30bbbe in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #42: THPEngine_run_backward(THPEngine*, _object*, _object*) + 0xa29 (0x7f63fb30c889 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #60: __libc_start_main + 0xe7 (0x7f6417d47bf7 in /lib/x86_64-linux-gnu/libc.so.6)

So, my loss term contains 1 term which is dependent on the Laplacian of the output (along with some other terms which don’t require any custom autograd functions). So, what I do is I take my batch of inputs (x) of shape, [25, 4] then pass it one-by-one through the function to get the Laplacian. So, along the lines of

lap = torch.zeros(x.shape[0]) #tensor to store laplacian for each batch value
for i, xi in range(x.shape[0]): #iterate over Tensor
  hess = torch.autograd.functional.hessian(Net, xi.unsqueeze(0), create_graph=True) #calculate hessian
  lap[i] = torch.diagonal(hess.view(x.shape[1], x.shape[1])).sum() #sum over diagonal to get the Laplacian.

So, it can do this Laplacian calculation for each input. I then combined with some other loss values, then sum over this to get a single value to represent the loss for a given batch of inputs. However, when I called loss.backward() I get the error from above.

RuntimeError: The size of tensor a (2500) must match the size of tensor b (4) at non-singleton dimension 2

Because, my function is a determinant, and my backward is the cofactor matrix (by definition), and the DoubleBackward is the derivative of the cofactor w.r.t the input matrix. So, that reduced sum should get rid of the need for the size of [B, ]?

Unless I messed something within my Backward or DoubleBackward definitions, but the gradients that are returned by those functions are of the same shape as its input within its forward method so it seems the shapes are fine? Granted I could have (and most likely have) done something wrong within there, if the shapes are correct and the calculations are wrong in some way, I would just get the incorrect gradients that would never hit a local minima of my loss function. I shouldn’t get a mismatch error, no?

albanD · January 11, 2021, 8:23pm

Given the stack trace, it seems that the error comes from a mul op that has Tensors with different sizes.

So you’re most likely doing a computation with things that have the wrong size in your custom backward or returning something that does not have the expected size (but this one should rise a nice error).
Could you get a small code sample of 30/40 lines that reproduces the crash?

AlphaBetaGamma96 · January 11, 2021, 9:02pm

Indeed I can, I’ll write that up and post it here as soon as possible! Thank you for all the help!

AlphaBetaGamma96 · January 12, 2021, 2:07pm

Hello again!

I’ve tried my best to get the least amount of lines possible to reproduce the crash and I think it’s a bit too large to copy and paste within this box. So, I’ve uploaded it to GitHub (CustomDeterminant/pytorch_custom_determinant_example.py at main · AlphaBetaGamma96/CustomDeterminant · GitHub)

The reason I can’t write it down to around 50 lines (or so), is because within my derivation of the backward and DoubleBackward, I use some custom functions to get the required matrices I need.

Because I use a Single-Value Decomposition, there’s a chance that I could get a diagonal element within my diagonal matrix, Sigma, near 0 which when inverted overflows and/or returns a NaN. So, I wrote some custom functions to circumvent that. The rest is the definition of the custom function (along with backward, and DoubleBackward) and a nn.Module for PyTorch’s, and my own, determinant (which could probably be done in 1 line now that I think of it). I’ve also written out the shapes of the Tensors as they propagate through the custom functions and network so it’s clear what each lines doing!

(A stack trace of the error is attached below)

Traceback (most recent call last):
  File "pytorch_custom_determinant_example.py", line 195, in <module>
    loss.backward()                 #backprop
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: The size of tensor a (25) must match the size of tensor b (4) at non-singleton dimension 2
Exception raised from infer_size at /opt/conda/conda-bld/pytorch_1595629427478/work/aten/src/ATen/ExpandUtils.cpp:24 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f01b59e377d in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: at::infer_size(c10::ArrayRef<long>, c10::ArrayRef<long>) + 0x4b8 (0x7f01ec8fba28 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #2: at::TensorIterator::compute_shape(at::TensorIteratorConfig const&) + 0x10c (0x7f01ecd1de6c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorIterator::build(at::TensorIteratorConfig&) + 0x55 (0x7f01ecd1f435 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::TensorIterator::TensorIterator(at::TensorIteratorConfig&) + 0xdd (0x7f01ecd1fabd in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&, bool) + 0x14a (0x7f01ecd1fc6a in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::native::mul(at::Tensor const&, at::Tensor const&) + 0x47 (0x7f01eca5d317 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe543d9 (0x7f01ecf923d9 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x7b1990 (0x7f01ec8ef990 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f01ed0d7c7c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: at::mul(at::Tensor const&, at::Tensor const&) + 0x4b (0x7f01ed028c8b in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x2c71dc8 (0x7f01eedafdc8 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x7b1990 (0x7f01ec8ef990 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, at::Tensor const&, at::Tensor const&) const + 0xbc (0x7f01ed0d7c7c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::Tensor::mul(at::Tensor const&) const + 0x4b (0x7f01ed1be1cb in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x1bc737 (0x7f01f31c4737 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x1bcf66 (0x7f01f31c4f66 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: THPFunction_apply(_object*, _object*) + 0x8b5 (0x7f01f353c6f5 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #34: torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x183 (0x7f01f353b033 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #35: <unknown function> + 0x30d1017 (0x7f01ef20f017 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #36: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f01ef20a860 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #37: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f01ef20b401 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #38: torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x37c (0x7f01ef208b1c in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #39: torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>) + 0x3c (0x7f01f3532dcc in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #40: torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x803 (0x7f01ef207e53 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #41: torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge> > const&) + 0x4e (0x7f01f3532bbe in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #42: THPEngine_run_backward(THPEngine*, _object*, _object*) + 0xa29 (0x7f01f3533889 in /home/user/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #60: __libc_start_main + 0xe7 (0x7f020ff2ebf7 in /lib/x86_64-linux-gnu/libc.so.6)

Thank you for all the help!

albanD · January 12, 2021, 2:27pm

Thanks for the code.
Given that the error is a mismatch between at dimension 2 where one is 25 and the other is 4. I guess one is batch size and the other N.
I think you can try and add a bunch of print for the sizes of the Tensors in this block: CustomDeterminant/pytorch_custom_determinant_example.py at d1926d4572d0ddcfb0b1b915c2000b9a393652c6 · AlphaBetaGamma96/CustomDeterminant · GitHub
To make sure they have the size you expect. And in particular, the inputs for each multiplication op.

AlphaBetaGamma96 · January 12, 2021, 2:58pm

I’ve double-checked the size of the Tensors and they match the values shown in the comments. For reference, those sizes are

A:  torch.Size([1, 4, 4]) Detbar:  torch.Size([1])
U:  torch.Size([1, 4, 4])  S:  torch.Size([1, 4]) V:  torch.Size([1, 4, 4])
M:  torch.Size([1, 4, 4]) Cbar:  torch.Size([1, 4, 4])
diag_M:  torch.Size([1, 4])
rho_matrix:  torch.Size([1, 4, 4])
mask_off_diagonal:  torch.Size([1, 4, 4])
diag_Xi:  torch.Size([1, 4])
masked_presum_Xi:  torch.Size([1, 4, 4])
non_diag_Xi:  torch.Size([1, 4, 4])
Xi:  torch.Size([1, 4, 4])
Abar:  torch.Size([1, 4, 4])
DetBarBar:  torch.Size([1, 4, 4])

And in particular, the inputs for each multiplication op.

One line that could be an issue could be

diag_Xi = torch.sum((diag_M * rho_matrix)*mask_off_diagonal, dim=-1) #diag_Xi [1, N]

as it’s ([1, 4] * [1, 4, 4]) * [1, 4, 4]. What I wanted with this line here is to take the diagonal of M and element-wise multiply along each row of rho_matrix then zero out the diagonal elements then sum over the columns of each matrix within the batch (which in this case is a batch of 1).

or maybe

non_diag_Xi = (-M*rho_matrix)*(1 - torch.eye(M.shape[-1])) #non_diag_Xi = [1, N ,N]

as the 1 - torch.eye(M.shape[-1]) Tensor is [N, N].

I did reshape them to the same Tensor shape for element-wise multiplication but I got the same error.

Also, out of curiosity, how were you able to tell it’s a mul op in the stack trace that’s the error?

EDIT: One bit that’s confusing me is that I’m taking the gradient of the loss which is a scalar, and the first loss term contains the laplacian which only takes a single input (as that’s a restriction of torch.autograd.functional.hessian so why would it crash saying there’s a mismatch with size 25 and size 4? Surely it would only have size 1 and size 4? Unless, the crash is actually happen in the first backward call or the forward call where it would see the entire batch? Or, perhaps, could there be a mismatch in the forward method that results in the wrong dimension being summed over so the batch dimension remains?

albanD · January 12, 2021, 6:36pm

In the stack, you see THPFunction_apply which means it is entering the the python implementation of a custom Function.
Then you see <omitting python frames> that means that some python code is being run.
And then just above that at::Tensor::mul which means that the python code called into the mul operator.

Given that the python stack trace points to loss.backward() in your file “pytorch_custom_determinant_example.py”, line 195.
So I think it is while the second backward that the failure happens? Or this is a single backprop?
Also I don’t think it is linked to the functional hessian function given that stack trace right?

AlphaBetaGamma96 · January 12, 2021, 7:41pm

A silly question, but are you reading the THPFunction_apply, <omitting python frames>, at::Tensor::mul backward? from frame #24 to frame #14?

Also, one thing I tried was running the code with the batch size as the same size as N, so it would “trick” the backward pass into work and I tracked the size of the Tensor during this. This is what I got

(1st order) U, S, V, G:  torch.Size([1, 4, 4]) torch.Size([1, 4]) torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) Abar:  torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) U, S, V, G:  torch.Size([1, 4, 4]) torch.Size([1, 4]) torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) Abar:  torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) U, S, V, G:  torch.Size([1, 4, 4]) torch.Size([1, 4]) torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) Abar:  torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) U, S, V, G:  torch.Size([1, 4, 4]) torch.Size([1, 4]) torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) Abar:  torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(2nd order) Abar, DetBarBar:  torch.Size([1, 4, 4]) torch.Size([1, 4, 4])
(1st order) U, S, V, G:  torch.Size([4, 4, 4]) torch.Size([4, 4]) torch.Size([4, 4, 4]) torch.Size([4, 4, 4])
(1st order) Abar:  torch.Size([4, 4, 4])

where U, S, V are the elements from the torch.svd and G = torch.diag_embed(gamma(S)). The (1st order) and (2nd order) correspond to CustomDeterminantBackward.forward() and CustomDeterminantBackward.backward(). Weirdly enough, it seems that on the 4th time calling CustomDeterminantBackward.backward() messes up what’s seen in CustomDeterminantBackward.forward(), but only on the final item in the batch. This also seems to happen in CustomDeterminant.backward() too.

albanD · January 12, 2021, 7:48pm

Yes.
And these frames, contrary to the python ones are in reverse order. So the line below is the top level function, and the line above is the function that was called.

Weirdly enough, it seems that on the 4th time calling CustomDeterminantBackward.backward() messes up what’s seen in CustomDeterminantBackward.forward(), but only on the final item in the batch.

It is indeed surprising.
I am wondering, why is the batch size never present in these sizes that you print? Is there an outer loop that extract each element of the batch? Maybe this one is doing something funky?

AlphaBetaGamma96 · January 12, 2021, 7:52pm

Within the calc_loss function, there’s a enumerate over x in order to calculate the hessian. So, I take each element in x, then unsqueeze(0) in order to add back in the batch. Maybe this could be an issue? Even though, this is perfectly fine with torch.det

  y = Net(x)
  laplacian = torch.zeros(x.shape[0])
  for i, xi in enumerate(x):
    yi = Net(xi.unsqueeze(0))
    hess = hessian(Net, xi.unsqueeze(0), create_graph=False)
    laplacian[i] = hess.view(x.shape[1], x.shape[1]).diagonal(offset=0).sum()
  loss1 = (laplacian/y)

EDIT: I’ve been messing around with printing out the shapes of the Tensors and I’ve just had an idea. Within my loss function I have loss1 which is

  y = Net(x)
  laplacian = torch.zeros(x.shape[0])
  for i, xi in enumerate(x):
    yi = Net(xi.unsqueeze(0))
    hess = hessian(Net, xi.unsqueeze(0), create_graph=True)
    laplacian[i] = hess.view(x.shape[1], x.shape[1]).diagonal(offset=0).sum()
  loss1 = (laplacian/y) #calculate loss1 (including laplacian term)

each element of laplacian is a single value where y was create in batch. Could this be the source of the problem? Because y is going have the batch dimension involved within its backward pass but each element of laplacian won’t. If I change the code above to,

  loss1 = torch.zeros(x.shape[0])
  for i, xi in enumerate(x):
    yi = Net(xi.unsqueeze(0))
    hess = hessian(Net, xi.unsqueeze(0), create_graph=False)
    lap = hess.view(x.shape[1], x.shape[1]).diagonal(offset=0).sum()
    loss1[i] = lap/yi

and set B to say 10, I can run the code and the shapes seem to be ok. Could this be the problem?

albanD · January 12, 2021, 8:08pm

Doing the unsqueeze is ok but maybe it leads to some unexpected broadcasting for you?

AlphaBetaGamma96 · January 12, 2021, 8:14pm

Sorry, I’ve just added an edit to my reply above. I think it could be the issue with the fact that the elements within laplacian have their batch as 1 and y having its batch as B? That could surely result in an error? If I make y have a batch of 1 like in my edit. It gives the correct shape, but if I put B = 1000 I get a recurssion error which I assume it a different problem entirely and nothing to do with my custom function?

EDIT: I think that might be the problem as A is stored within ctx in the forward method of CustomDeterminant. So the y = Net(x) line will have a stored A of size [B, N, N] but the A stored within the Net of the laplacian Tensor will have an A of size [1, N, N]. That surely would cause a mismatch in dimensions?

albanD · January 12, 2021, 8:16pm

Ho if one has 1 as batch and the other B, then it will broadcast which is not what you want to do here for sure. This is most likely the issue indeed.

AlphaBetaGamma96 · January 12, 2021, 8:28pm

So, it seems that my custom function is completely fine it was a loss function that was causing the issue?

Also, the “recursion error” I got was due to make_dot from torchviz when trying to build a graph which has a network that calls its hessian 1000 times… which makes sense why it causes an error!

So, it seems that all the error was due to my loss function not accommodating the fact that my laplacian is calculated with a batch of 1!

Thank you ever so much with putting up with me! Now I look back on it, it was kinda sitting their looking me right in the face and hiding in plain sight!

But that just begs another question, why does PyTorch’s det function handle that fine then?