I am trying to port ResNet to PyTorch for C++ and train it on MNIST. The training code I have used is taken from the PyTorch example linked in the documentation.
While running my code I receive one of two errors when calling loss.backward()
. My understanding is that these errors are probably red-herrings and that there is some underlying error in my network that I haven’t discovered.
When I run my code directly I receive:
terminate called after throwing an instance of 'std::runtime_error'
what(): cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:258
When I step through my code with a debugger I receive:
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=24 error=59 : device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorMath.cu:24
To debug this it’s recommended that I use torch.cuda.synchronize()
to avoid the error being deferred to later operations.
My question is: How can I call torch.cuda.synchronize()
from C++ for PyTorch? I see the underlying call here but don’t know how to call it from C++.
If you’re interested in a repro, I have made the code available at: https://gist.github.com/JoshVarty/143aa35c0efc25d29d18ac523fbb597c
It’s possible I’m doing things very obviously wrong as I’m not super familiar with PyTorch or C++.