RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device that’s current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.
This error is popping up while trying to train a transformer model from scratch in Colab. It runs on CPU, Though!
Actually I was solving the problem of Machine Translation via Seq2Seq LSTM encoder-decoder model and via Seq2Seq Transformer architecture.
I swapped the Transformer model with Seq2Seq LSTM encoder-decoder model for the dataloaders to check if it’s caused by the dimensionality errors related to I/O of layers inside the model. Turns out, It didn’t work out for Seq2Seq LSTM encoder-decoder too. To be Noted, The Seq2Seq model trained fine on Colab previously.
Haven’t been able to ensure that it runs on Colab, yet!
Colab has lately been throwing the following error too ! Restricting pip installs within runtime.
NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968
I’m unsure if you are still running into the same issue or a new one using the latest nightly binaries.
In any case, could you rerun the code with CUDA_LAUNCH_BLOCKING=1 and check the stacktrace to see what exactly is failing?
which I believe was related to mismatch of the version between CUDA and CUDNN on host instance(Colab)
The persistent error log remains to be -
RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device that's current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.
P.S : To ensure, Dataloader’s multiprocessing was not serving the cause - Turned num_workers =1, too.
Also tried with torch.backends.cudnn.benchmark = True
Thanks for your patience, I already set the environment variable using the following command
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
An observation that I noticed was with or without the aforementioned snippet of setting environment variable, When I run the following training script of a simple sequential model in PyTorch. It runs effectively.
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)
model = torch.nn.Sequential(
model = model.to(device)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-6
for t in range(2000):
xx = xx.to(device)
y = y.to(device)
y_pred = model(xx)
loss = loss_fn(y_pred, y)
if t % 100 == 99:
# Zero the gradients before running the backward pass.
But when I run the model with my data loader for the quoted task beforehand, neither it’s training loop starts, nor the simple sequential model’s training loop quoted above. In turn, both yield either the following Debug Log. It gets resolved when I restart the runtime and run the simple sequential model.
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I am still looking into what my model (Seq2Seq model or dataloader) is breaking !
If you run the snippets 1 and 2, You’ll see it runs perfectly !
Reduced the batch size to 1.
However, When I pass in the data loader for english and german language tokens(original ones from the dataset), They run for an hour or so before the error whose log is quoted below is generated
Unfortunately, your current code does not reproduce the issue, so that I won’t be able to debug it.
I doubt it and would claim your input tensors might contain vocabulary indices which might be out of bounds for e.g. your embeddings.
The error reporting might be broken if you are using torch==1.13.1 and the blocking launches env variable might be set too late so that you are running into CUDA errors which look random.
Thanks ! I think I missed the OOV> token while vocabulary building with the assumption that just the START> and END> should suffice. I did that with the assumption that vocabulary was built on the entire data frame content hence during training there won’t be any OOV> token encountered. Though, It maybe encountered during inference.
This is one of the error that maybe causing the highlighted error ! Will recheck the code.
The explanation looks intuitive enough, especially looking at how code breaks at 19k steps of epoch 0 !
Sure, let me know once you’ve isolated the issue.
For the sake of clarity: the error reporting on torch==1.13.1 was not properly capturing all CUDA asserts and was thus causing error messages, which are not really helpful.
Since you are seeing seemingly “random” CUDA errors, it seems you are also running into such a case.
The current nightly binaries would fix it, so you could also consider updating, which should then yield a proper error message again.
CUDA_LAUNCH_BLOCKING=1 is used for debugging and disables asynchronous kernel launches.
I would not recommend setting it inside your Python script but to properly export this environment variable in your terminal.