Transformer training unstability - CUDA error

Hi, I just started working on a new workstation with a 4080 super and I’m training a transformer with optuna to find the best hyperparameters.

During the training, something strange happens:
for first epoch everything is ok but during the second one I get this error

Traceback (most recent call last):

Cell In[65], line 1
study.optimize(objective, n_trials=50)

File ~.conda\envs\torch\lib\site-packages\optuna\study\study.py:451 in optimize
_optimize(

File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:62 in _optimize
_optimize_sequential(

File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:159 in _optimize_sequential
frozen_trial = _run_trial(study, func, catch)

File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:247 in _run_trial
raise func_err

File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:196 in _run_trial
value_or_values = func(trial)

Cell In[62], line 145 in objective
loss.backward()

File ~.conda\envs\torch\lib\site-packages\torch_tensor.py:522 in backward
torch.autograd.backward(

File ~.conda\envs\torch\lib\site-packages\torch\autograd_init_.py:266 in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)


The strange part is that I’m running the same code on two other pcs (one with an A4000 and one with a GTX2070) and there are no problems at all.

In order to keep the gpus as free as possible, I clear the cache every now and then.
I’m also pretty sure the memory shouldn’t be an issue because, before starting the optuna optimization, I tried the maximum model dimensions alongside the maximum batch size and there were no problems.

virtual environment specifics:
python 3.10.14
torch version 2.2.2
cuda version 12.1

UPDATE:

Just to be sure I run this simple line and I got another error:
t = torch.tensor([1,2], device=device)
Traceback (most recent call last):

Cell In[18], line 1
t = torch.tensor([1,2], device=device)

RuntimeError: CUDA error: unspecified launch failure
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I restarted the environment and the line works again. It seems that at some point, during the training, moving tensors to the gpu causes some problems but I don’t understand the reasons.