Pytorch Errors with Titan RTX [Windows 10]

I get errors after running for an arbitrary number of iterations for my attention model. (Either in the encoder or during the loss backwards). I’ve literally tried everything I could for this, but it simply just won’t work for this machine. This doesn’t seem to be an issue on my two other machines. I’ve tried with “pytorch=1.4.0=py3.7_cuda101_cudnn7_0” and that would just crash after 100 iterations; then I tried using “pytorch=1.2.0=py3.7_cuda100_cudnn7_1” and was working for a bit then after changing my code slightly caused it to crash with the following error:

“Exception has occurred: RuntimeError
CUDA error: unspecified launch failure
THCudaCheck FAIL file=…\aten\src\THC\THCCachingHostAllocator.cpp line=296 error=4 : unspecified launch failure”

Adding the following lines stops it from crashing:
import os
os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”

But that severely impacts the training time. I’m at a loss on how to fix this error and haven’t found any good solutions as of yet from general searching. Can someone who has encountered this issue or seen a post describing a fix for this please help?

  • I updated my GPU drivers to the current release; was I not supposed to do this?

Would you please open a new issue at https://github.com/pytorch/pytorch/issues and post the code and your specs there?