I have been receiving a strange CUDA error that I couldn’t place without a lot of searching:
Traceback (most recent call last):
File "/panfs/roc/groups/13/suo-yang/dikem003/DimensionReductionNLE/auto_ode/AETrainingConditionNum.py", line 241, in <module>
batch_loss = reconstruction_loss + stiffness_loss*EPOCH_SCALER
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The line reported in the stack track is pretty benign, but after searching and finding this forum post I found out that Python exceptions can cause issues in memory deallocation in CUDA.
In my objective function, I have to invert a matrix and since I choose the data I use stochastically from each batch I can’t be sure if the matrix is invertible or not, and have a try…catch statement in case this operation fails. This looks like this:
for batch_num, start_samp in zip(range(num_batches), start_points):
try:
start_idx = start_samp
end_idx = start_samp + self.sample_size
# slice offset matrices from dynamics
Y1, Y2 = torch.transpose(latent_batches[batch_num,start_idx:end_idx-1,:], 0, 1), torch.transpose(latent_batches[batch_num,start_idx+1:end_idx,:], 0, 1)
# find linearized Jacobian ODE dynamics
D = (Y2 @ torch.transpose(Y1,0,1)) @ torch.linalg.inv(Y1 @ torch.transpose(Y1,0,1))
# find condition number of estimated dynamics matrix
cond = torch.linalg.cond(D, p=2)
# add calculated condition number to accumulator
condition_number_penalty = torch.add(cond, condition_number_penalty)
except RuntimeError as err:
print('singmatrix')
Am I right in thinking this error is a potential memory deallocation problem within CUDA from this exception? If so, is there something I can do directly (keeping the exception handling that I have) that can fix the issue?
One other potential cause of the issue might be the case where every iteration in the training hits the exception, so that I am just passing a torch.tensor([0])
to my loss function. It never causes an issue on a CPU, but due to some CUDA optimization with the computation graph could that behavior cause an issue when training on GPU?
One thing I was thinking of doing is making a copy of the matrix I want to invert that is detached from the computation graph that I can then perform some check with to make sure that this matrix is invertible. Before I spend too much time checking potential solutions though I would like to make sure I have diagnosed the problem correctly
Edit: forgot to mention, the reason I think that this exception catching is the cause of the problem is
- typically a job will fail right AFTER a few of the print statements in the catch statement are shown
- other custom loss functions I use which are treated identically but without the try…catch and matrix inversion work fine