Hi, I just started working on a new workstation with a 4080 super and I’m training a transformer with optuna to find the best hyperparameters.
During the training, something strange happens:
for first epoch everything is ok but during the second one I get this error
Traceback (most recent call last):
Cell In[65], line 1
study.optimize(objective, n_trials=50)
File ~.conda\envs\torch\lib\site-packages\optuna\study\study.py:451 in optimize
_optimize(
File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:62 in _optimize
_optimize_sequential(
File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:159 in _optimize_sequential
frozen_trial = _run_trial(study, func, catch)
File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:247 in _run_trial
raise func_err
File ~.conda\envs\torch\lib\site-packages\optuna\study_optimize.py:196 in _run_trial
value_or_values = func(trial)
Cell In[62], line 145 in objective
loss.backward()
File ~.conda\envs\torch\lib\site-packages\torch_tensor.py:522 in backward
torch.autograd.backward(
File ~.conda\envs\torch\lib\site-packages\torch\autograd_init_.py:266 in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
The strange part is that I’m running the same code on two other pcs (one with an A4000 and one with a GTX2070) and there are no problems at all.
In order to keep the gpus as free as possible, I clear the cache every now and then.
I’m also pretty sure the memory shouldn’t be an issue because, before starting the optuna optimization, I tried the maximum model dimensions alongside the maximum batch size and there were no problems.
virtual environment specifics:
python 3.10.14
torch version 2.2.2
cuda version 12.1
UPDATE:
Just to be sure I run this simple line and I got another error:
t = torch.tensor([1,2], device=device)
Traceback (most recent call last):
Cell In[18], line 1
t = torch.tensor([1,2], device=device)
RuntimeError: CUDA error: unspecified launch failure
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I restarted the environment and the line works again. It seems that at some point, during the training, moving tensors to the gpu causes some problems but I don’t understand the reasons.