In my trainer if I set accumulate_grad_batches=n, it doesn’t work (n>1, for n=1 it trains perfectly so the problem shouldn’t come from my model/implementation). I really can’t find where is the trouble maybe one of you got the same error and struggled with it before me and found a solution?
venv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7f2624d4a9f0> returned NULL without setting an error
Epoch 0: 0%| | 2/2864 [00:07<2:49:55, 3.56s/it, v_num=5]
if it might be related to the hardware, I’m running on 2 GPUs:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:1B:00.0 Off | Off |
| 30% 38C P8 31W / 300W | 5MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:1C:00.0 Off | Off |
| 30% 33C P8 24W / 300W | 5MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+