I’m writing some custom code (for a deep RL project) and I’m getting a device side assert triggered. I was wondering if it’s technically possible to clear the error flag and retrieve the tensors stored in the GPU memory regardless of the fact that some of them may have incorrect values.
assert is triggered the CUDA context is corrupt and executing CUDA operations would reraise errors. I would not recommend trying to read any of the data as it could also be corrupt.
What I don’t understand is, how can data not involved with the operations that caused the error be corrupted? I would guess that only the memory locations accessed in the kernell calls involved with the error can be corrupted, not the other tensors in device memory.
Regardless of the fact that some tensors may be corrupted, getting the tensors at the moment of the error would greatly simplify the debugging, especially if the error is a rare logic error. Also, being able to access the tensors and selectively saving the parts that are “safe” would mean not wasting days of training in case one error shows up late in the training.
I agree that it’s not advisable in general for the non advanced users, but is it possible to turn off the warning and still retrieve the data? If pytorch doesn’t natively support it, is it possible to bypass it and call a cudaMemcpy and retrieve the contet of the gpu memory in other ways?
No, there is no way to recover from a “sticky” device-side assert as it’s corrupting the CUDA context.
Non-sticky errors, such as OOMs via
cudaMalloc, are more forgiving and you can continue the execution by e.g. allocating less memory. However, once a sticky device-side assert is triggered, the CUDA context is in an invalid state. It would be similar to trying to read any data from the host after a segfault was triggered.