Detect anomaly in C++

facug91 · June 26, 2019, 12:27pm

I’m having -nan loss in my training, so I’ve looked up in this forum for a solution. I’ve found this discussion where the Anomaly Detection module is recommended. But I’m working in C++, and couldn’t found if it’s available there, or any documentation. If someone has an example or documentation, I’ll appreciate it.
Thanks in advance.

tom · June 26, 2019, 12:49pm

Have you tried torch::autograd::AnomalyMode::set_enabled(true); from torch/csrc/autograd/anomaly_mode.h?

Best regards

Thomas

facug91 · June 26, 2019, 1:39pm

I tried that, but I’d forgotten to add the include path. Thank you Thomas.
Anyway, now the forward pass it’s throwing me a SIGSEGV, don’t know why.

pascal.soveaux · April 27, 2020, 11:09am

Hello,

Just to let you know that it does not work for me neither.

I had the Exception in thread pool task: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDAFloatType [122, 512]], which is output 0 of SelectBackward, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Every time I enable AnomalyMode with v1.5, I have a SIGSEGV in my module when I call x.to(device). I do not have enough time to debug this right now and it is not a critical issue on my side. I will take a look into it as soon as I will have time but I would like to know if this is working for someone.

Thank you

Pascal

DuaneNielsen · October 2, 2020, 5:05pm

//
// Created by duane on 10/2/20.
//

#include <torch/torch.h>


int main(int arg, char *argv[]){
    
    torch::autograd::AnomalyMode::set_enabled(true);
    auto x = torch::tensor({5.0}, torch::requires_grad());
    x.backward();
    std::cout << x.grad() << std::endl;
}

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

I suppose you need to do a bit more than just set anomaly mode.

DuaneNielsen · October 2, 2020, 5:10pm

This works OK, in case anyone was wondering about my environment.

//
// Created by duane on 10/2/20.
//

#include <torch/torch.h>


int main(int arg, char *argv[]){

    //torch::autograd::AnomalyMode::set_enabled(true);
    auto x = torch::tensor({5.0}, torch::requires_grad());
    x.backward();
    std::cout << x.grad() << std::endl;
}

1
[ CPUFloatType{1} ]

tom · October 2, 2020, 6:08pm

@DuaneNielsen Thank you! In particular for posting the examples here, to.

So indeed, the anomaly detection doesn’t work in C++. Mea culpa.
What happens is that the “anomaly detection” that is used in C++ stack has nulled out functions (i.e. it’s just a placeholder). And when you flip the switch as described, the engine will call such a null function. Boom.
In Python, you have the PythonEngine that is, a relatively small subclass of (Autograd) Engine that crucially makes make_anomaly_metadata return a PyAnomalyMetadata instance as its AnomalyMetadata instance. Then the Engine calls the metadata’s store_stack method (upon execution) and the print_stack method (upon error). One would need to amend the AnomalyMetadata class (or a subclass) to implement these, likely using the c10 method get_backtrace (which is there conveniently) to attach a C++ backtrace.

Maybe I should make a live coding thing of fixing it.

Best regards

Thomas

DuaneNielsen · October 2, 2020, 7:12pm

@tom If you could, I think it would be pretty awesome.

In my case, I would use it to resolve backward pass problems in erwins https://github.com/google-research/tiny-differentiable-simulator

I suspect some tooling would be helpful in this endeavor. I’ll go at it the old fashioned way for a bit, and if it’s too hard, then perhaps I’ll see if I can hack my own checks.

Thanks for explaining the code, appreciated.

tom · October 6, 2020, 8:36pm

anomaly_cpp.cpp

#include <torch/torch.h>

int main(int arg, char *argv[]){    
    torch::autograd::AnomalyMode::set_enabled(true);
    auto x = torch::tensor({5.0}, torch::requires_grad());
    auto y = x * x;
    auto z = y * y;
    y += 1;
    z.backward();
    std::cout << x.grad() << std::endl;
}

$ g++ -g anomaly_cpp.cpp  -I ../pytorch/build/lib.linux-x86_64-3.8/torch/include/torch/csrc/api/include/ -I ../pytorch/build/lib.linux-x86_64-3.8/torch/include/ -L ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/   -ltorch -lc10 -ltorch_cpu

$ LD_LIBRARY_PATH=../pytorch/build/lib.linux-x86_64-3.8/torch/lib/ ../scripts/a.out  
[W anomaly_mode.cpp:24] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error:
frame #0: torch::autograd::AnomalyMetadata::store_stack() + 0x22 (0x7fd3f57f5a32 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #1: <unknown function> + 0x1e3a3e9 (0x7fd3f54bb3e9 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x1f1f612 (0x7fd3f55a0612 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xded642 (0x7fd3f446e642 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x15dd2aa (0x7fd3f4c5e2aa in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #5: at::Tensor::mul(at::Tensor const&) const + 0x9d (0x7fd3f4c6104d in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x4fbe (0x563f08413fbe in ../scripts/a.out)
frame #7: <unknown function> + 0x50e0 (0x563f084140e0 in ../scripts/a.out)
frame #8: __libc_start_main + 0xea (0x7fd3f329acca in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x467a (0x563f0841367a in ../scripts/a.out)
 (function _print_stack)
terminate called after throwing an instance of 'std::runtime_error'
  what():  one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [1]], which is output 0 of MulBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Aborted

Hehehe. Now I need add fix skipping the PyTorch frames, see if I get a line number for my frame and upload the video.

DuaneNielsen · October 6, 2020, 10:06pm

Awesome! Can’t wait to give this a try!

tom · October 28, 2020, 3:22pm

So here is the video and a link to the PR: https://lernapparat.de/pytorch-cpp-anomaly/

Best regards

Thomas