I’m having -nan loss in my training, so I’ve looked up in this forum for a solution. I’ve found this discussion where the Anomaly Detection module is recommended. But I’m working in C++, and couldn’t found if it’s available there, or any documentation. If someone has an example or documentation, I’ll appreciate it.
Thanks in advance.
Have you tried torch::autograd::AnomalyMode::set_enabled(true);
from torch/csrc/autograd/anomaly_mode.h
?
Best regards
Thomas
I tried that, but I’d forgotten to add the include path. Thank you Thomas.
Anyway, now the forward pass it’s throwing me a SIGSEGV, don’t know why.
Hello,
Just to let you know that it does not work for me neither.
I had the Exception in thread pool task: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDAFloatType [122, 512]], which is output 0 of SelectBackward, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Every time I enable AnomalyMode with v1.5, I have a SIGSEGV in my module when I call x.to(device). I do not have enough time to debug this right now and it is not a critical issue on my side. I will take a look into it as soon as I will have time but I would like to know if this is working for someone.
Thank you
Pascal
//
// Created by duane on 10/2/20.
//
#include <torch/torch.h>
int main(int arg, char *argv[]){
torch::autograd::AnomalyMode::set_enabled(true);
auto x = torch::tensor({5.0}, torch::requires_grad());
x.backward();
std::cout << x.grad() << std::endl;
}
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
I suppose you need to do a bit more than just set anomaly mode.
This works OK, in case anyone was wondering about my environment.
//
// Created by duane on 10/2/20.
//
#include <torch/torch.h>
int main(int arg, char *argv[]){
//torch::autograd::AnomalyMode::set_enabled(true);
auto x = torch::tensor({5.0}, torch::requires_grad());
x.backward();
std::cout << x.grad() << std::endl;
}
1
[ CPUFloatType{1} ]
@DuaneNielsen Thank you! In particular for posting the examples here, to.
So indeed, the anomaly detection doesn’t work in C++. Mea culpa.
What happens is that the “anomaly detection” that is used in C++ stack has nulled out functions (i.e. it’s just a placeholder). And when you flip the switch as described, the engine will call such a null function. Boom.
In Python, you have the PythonEngine that is, a relatively small subclass of (Autograd) Engine that crucially makes make_anomaly_metadata return a PyAnomalyMetadata instance as its AnomalyMetadata instance. Then the Engine calls the metadata’s store_stack method (upon execution) and the print_stack method (upon error). One would need to amend the AnomalyMetadata class (or a subclass) to implement these, likely using the c10 method get_backtrace (which is there conveniently) to attach a C++ backtrace.
Maybe I should make a live coding thing of fixing it.
Best regards
Thomas
@tom If you could, I think it would be pretty awesome.
In my case, I would use it to resolve backward pass problems in erwins https://github.com/google-research/tiny-differentiable-simulator
I suspect some tooling would be helpful in this endeavor. I’ll go at it the old fashioned way for a bit, and if it’s too hard, then perhaps I’ll see if I can hack my own checks.
Thanks for explaining the code, appreciated.
anomaly_cpp.cpp
#include <torch/torch.h>
int main(int arg, char *argv[]){
torch::autograd::AnomalyMode::set_enabled(true);
auto x = torch::tensor({5.0}, torch::requires_grad());
auto y = x * x;
auto z = y * y;
y += 1;
z.backward();
std::cout << x.grad() << std::endl;
}
$ g++ -g anomaly_cpp.cpp -I ../pytorch/build/lib.linux-x86_64-3.8/torch/include/torch/csrc/api/include/ -I ../pytorch/build/lib.linux-x86_64-3.8/torch/include/ -L ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/ -ltorch -lc10 -ltorch_cpu
$ LD_LIBRARY_PATH=../pytorch/build/lib.linux-x86_64-3.8/torch/lib/ ../scripts/a.out
[W anomaly_mode.cpp:24] Warning: Error detected in MulBackward0. Traceback of forward call that caused the error:
frame #0: torch::autograd::AnomalyMetadata::store_stack() + 0x22 (0x7fd3f57f5a32 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #1: <unknown function> + 0x1e3a3e9 (0x7fd3f54bb3e9 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x1f1f612 (0x7fd3f55a0612 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xded642 (0x7fd3f446e642 in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x15dd2aa (0x7fd3f4c5e2aa in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #5: at::Tensor::mul(at::Tensor const&) const + 0x9d (0x7fd3f4c6104d in ../pytorch/build/lib.linux-x86_64-3.8/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x4fbe (0x563f08413fbe in ../scripts/a.out)
frame #7: <unknown function> + 0x50e0 (0x563f084140e0 in ../scripts/a.out)
frame #8: __libc_start_main + 0xea (0x7fd3f329acca in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x467a (0x563f0841367a in ../scripts/a.out)
(function _print_stack)
terminate called after throwing an instance of 'std::runtime_error'
what(): one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [1]], which is output 0 of MulBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Aborted
Hehehe. Now I need add fix skipping the PyTorch frames, see if I get a line number for my frame and upload the video.
Awesome! Can’t wait to give this a try!
So here is the video and a link to the PR: https://lernapparat.de/pytorch-cpp-anomaly/
Best regards
Thomas