or
RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device thatâs current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.
This error is popping up while trying to train a transformer model from scratch in Colab. It runs on CPU, Though!
I am not familiar with Colab, but maybe you could post some potentially relevant code as well so itâs easier for us to diagnosis the cause of your issue?
Your code seems to be running into a sticky error, which wasnât raised properly. Could you update to the latest nightly binary and rerun your code again to check if the error message improves, please?
Actually I was solving the problem of Machine Translation via Seq2Seq LSTM encoder-decoder model and via Seq2Seq Transformer architecture.
I swapped the Transformer model with Seq2Seq LSTM encoder-decoder model for the dataloaders to check if itâs caused by the dimensionality errors related to I/O of layers inside the model. Turns out, It didnât work out for Seq2Seq LSTM encoder-decoder too. To be Noted, The Seq2Seq model trained fine on Colab previously.
Havenât been able to ensure that it runs on Colab, yet!
Colab has lately been throwing the following error too ! Restricting pip installs within runtime.
NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968
Iâm unsure if you are still running into the same issue or a new one using the latest nightly binaries.
In any case, could you rerun the code with CUDA_LAUNCH_BLOCKING=1 and check the stacktrace to see what exactly is failing?
which I believe was related to mismatch of the version between CUDA and CUDNN on host instance(Colab)
The persistent error log remains to be -
RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device that's current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.
P.S : To ensure, Dataloaderâs multiprocessing was not serving the cause - Turned num_workers =1, too.
Also tried with torch.backends.cudnn.benchmark = True
Since you are still getting random error messages, the env variable was most likely not set correctly.
Could you post a minimal and executable code snippet to reproduce the issue, please?
Thanks for your patience, I already set the environment variable using the following command
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
An observation that I noticed was with or without the aforementioned snippet of setting environment variable, When I run the following training script of a simple sequential model in PyTorch. It runs effectively.
import torch
import math
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)
model = torch.nn.Sequential(
torch.nn.Linear(3, 1),
torch.nn.Flatten(0, 1)
)
model = model.to(device)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-6
for t in range(2000):
xx = xx.to(device)
y = y.to(device)
y_pred = model(xx)
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero the gradients before running the backward pass.
model.zero_grad()
loss.backward()
But when I run the model with my data loader for the quoted task beforehand, neither itâs training loop starts, nor the simple sequential modelâs training loop quoted above. In turn, both yield either the following Debug Log. It gets resolved when I restart the runtime and run the simple sequential model.
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
I am still looking into what my model (Seq2Seq model or dataloader) is breaking !
Changed the transformer model to a simpler version. Snippet of which is quoted below
import torch
import torch.nn as nn
INPUT_DIM = 358#len(eng_vocab.stoi) 3
OUTPUT_DIM = 617#len(ger_vocab.stoi)
dim_model = 256
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.english_embedding = nn.Embedding(INPUT_DIM, dim_model)
self.german_embedding = nn.Embedding(OUTPUT_DIM, dim_model)
self.transformer = nn.Transformer(d_model=dim_model,
num_encoder_layers=2, num_decoder_layers=2,
dropout=0.5, dim_feedforward=2048)
self.fc1 = nn.Linear(dim_model, OUTPUT_DIM)
def forward(self, inputs, targets):
x = self.english_embedding(inputs)
y = self.german_embedding(targets)
tgt_mask = torch.triu(torch.ones(targets.size(0), targets.size(0)), diagonal=1).bool().to(device)
out = self.transformer(x, y, tgt_mask=tgt_mask)
out = self.fc1(out.permute(1, 0, 2)) # (batch, sequence, feature)
return out.permute(1, 0, 2).reshape(-1, OUTPUT_DIM) # (sequence, batch, feature)
model = Net().to(device)
Passed in a random tensor of size yielded by the data loader generator function where input_seq= english_language_tokens, output_seq = german_language_tokens. Snippet of which is quoted below
If you run the snippets 1 and 2, Youâll see it runs perfectly !
Reduced the batch size to 1.
However, When I pass in the data loader for english and german language tokens(original ones from the dataset), They run for an hour or so before the error whose log is quoted below is generated
Unfortunately, your current code does not reproduce the issue, so that I wonât be able to debug it.
I doubt it and would claim your input tensors might contain vocabulary indices which might be out of bounds for e.g. your embeddings.
The error reporting might be broken if you are using torch==1.13.1 and the blocking launches env variable might be set too late so that you are running into CUDA errors which look random.
Thanks ! I think I missed the OOV> token while vocabulary building with the assumption that just the START> and END> should suffice. I did that with the assumption that vocabulary was built on the entire data frame content hence during training there wonât be any OOV> token encountered. Though, It maybe encountered during inference.
This is one of the error that maybe causing the highlighted error ! Will recheck the code.
The explanation looks intuitive enough, especially looking at how code breaks at 19k steps of epoch 0 !
Sure, let me know once youâve isolated the issue.
For the sake of clarity: the error reporting on torch==1.13.1 was not properly capturing all CUDA asserts and was thus causing error messages, which are not really helpful.
Since you are seeing seemingly ârandomâ CUDA errors, it seems you are also running into such a case.
The current nightly binaries would fix it, so you could also consider updating, which should then yield a proper error message again.
CUDA_LAUNCH_BLOCKING=1 is used for debugging and disables asynchronous kernel launches.
I would not recommend setting it inside your Python script but to properly export this environment variable in your terminal.