Dimension mis-match error after editing gradients

afshin67 · April 22, 2019, 10:04pm

Hi,

I changed the gradients of the network in a function starting by torch::autograd::GradMode::set_enabled(false);. Then, when I call optimizer->step();, I am getting a dimension mis-match error. Without calling the function, everything works fine.

The error is: (note that my network is {10,32,64,64,64,32,12}).

terminate called after throwing an instance of 'c10::Error'
  what():  The size of tensor a (32) must match the size of tensor b (31) at non-singleton dimension 0 (infer_size at /pytorch/aten/src/ATen/ExpandUtils.cpp:23)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f5f4ab877d1 in /opt/libtorch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f5f4ab8710a in /opt/libtorch/lib/libc10.so)
frame #2: at::infer_size(c10::ArrayRef<long>, c10::ArrayRef<long>) + 0x487 (0x7f5f3db0ebe7 in /opt/libtorch/lib/libcaffe2.so)
frame #3: at::TensorIterator::compute_shape() + 0x85 (0x7f5f3dd23045 in /opt/libtorch/lib/libcaffe2.so)
frame #4: at::TensorIterator::Builder::build() + 0x2f (0x7f5f3dd24aaf in /opt/libtorch/lib/libcaffe2.so)
frame #5: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x31f (0x7f5f3dd2594f in /opt/libtorch/lib/libcaffe2.so)
frame #6: at::native::add_out(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Scalar) + 0x21d (0x7f5f3db85afd in /opt/libtorch/lib/libcaffe2.so)
frame #7: at::TypeDefault::add_(at::Tensor&, at::Tensor const&, c10::Scalar) const + 0x68 (0x7f5f3e0025a8 in /opt/libtorch/lib/libcaffe2.so)
frame #8: torch::autograd::VariableType::add_(at::Tensor&, at::Tensor const&, c10::Scalar) const + 0x3ac (0x7f5f4b3bee3c in /opt/libtorch/lib/libtorch.so.1)
frame #9: torch::optim::Adam::step() + 0x27b (0x7f5f4b9af34b in /opt/libtorch/lib/libtorch.so.1)
frame #10: /net/ge.unx.sas.com/vol/vol110/u11/aforoo/RL/rllib/cmake-build-debug/rllib() [0x475662]
frame #11: /net/ge.unx.sas.com/vol/vol110/u11/aforoo/RL/rllib/cmake-build-debug/rllib() [0x46fa47]
frame #12: /net/ge.unx.sas.com/vol/vol110/u11/aforoo/RL/rllib/cmake-build-debug/rllib() [0x46a541]
frame #13: /net/ge.unx.sas.com/vol/vol110/u11/aforoo/RL/rllib/cmake-build-debug/rllib() [0x41ad3d]
frame #14: __libc_start_main + 0xf5 (0x7f5f0a1723d5 in /lib64/libc.so.6)
frame #15: /net/ge.unx.sas.com/vol/vol110/u11/aforoo/RL/rllib/cmake-build-debug/rllib() [0x419dc9]

Signal: SIGABRT (Aborted)

I printed the number of weights and gradients in each layer, before and after calling the function and they are equal:

(gdb) p shape_w0
$3 = std::vector of length 12, capacity 16 = {320, 32, 2048, 64, 4096, 64, 4096, 64, 2048, 32, 384, 12}
(gdb) p shape_g0
$4 = std::vector of length 12, capacity 16 = {320, 32, 2048, 64, 4096, 64, 4096, 64, 2048, 32, 384, 12}
(gdb) p shape_w1
$5 = std::vector of length 12, capacity 16 = {320, 32, 2048, 64, 4096, 64, 4096, 64, 2048, 32, 384, 12}
(gdb) p shape_g1
$6 = std::vector of length 12, capacity 16 = {320, 32, 2048, 64, 4096, 64, 4096, 64, 2048, 32, 384, 12}

I appreciate any help or comment.
Afshin

yf225 · July 23, 2019, 1:15am

Can you provide a minimal example for reproducing this error?

afshin67 · July 23, 2019, 5:26am

Actually, this is solved. Thanks for the follow up.