Pytest + RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Nick_McCarthy · July 19, 2022, 10:10pm

Hi all,

Is anyone aware of a problem when running pytest and getting the error:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

The tests are for use within a large training pipeline (e.g. testing complete model training for a couple epochs, retraining of models, etc) so I’m not sure I can reduce it down to a simple snippet demonstrating the issue.

Basically, if I run the entire test suite using:

pytest tests/

then some of the tests will fail.

If I run those failing tests at the module level:

pytest tests/modulename/module/test_module.py

then they pass.

The tests all pass on my local machine (I have to exclude some as my machine can’t handle the overhead), and they pass when I run them individually on the remote machine. It’s pretty annoying as it’s interfering with our deployment procedure and I know the tests pass.

Local environment (windows):

NVIDIA-SMI 497.29       Driver Version: 497.29       CUDA Version: 11.5     

pytorch                   1.11.0          py3.8_cuda11.3_cudnn8_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.11.0               py38_cu113    pytorch
torchio                   0.18.43                  pypi_0    pypi
torchsummary              1.5.1                    pypi_0    pypi
torchvision               0.12.0               py38_cu113    pytorch

Remote environment (ubuntu):

NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4   

pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.11.0               py38_cu113    pytorch
torchio                   0.18.43                  pypi_0    pypi
torchsummary              1.5.1                    pypi_0    pypi
torchvision               0.12.0               py38_cu113    pytorch

ptrblck · July 19, 2022, 10:13pm

Could you check if you’ve disabled gradient computation in your test scripts?
It could also be valid to check if previous tests were disabling it without a proper cleanup.
Maybe try to shuffle the tests or exclude some previous tests to see if the problematic test would still fail.

Nick_McCarthy · July 20, 2022, 12:26pm

There’s no torch.no_grad() set in the tests.

There were a few tests that had random Tensors input to models - I initially thought it might have been that so set everything to requires_grad=True where possible.

When this issue started it was just one module’s tests (call it module_A) that were causing the issue - couldn’t decipher the issue then (incl. driver and environment reinstall), so the workaround was to disable those module tests and run them locally for the time being. Last week the same error started with another module’s tests, that had definitely been running fine before (locally, remotely, and on our action-runner tests). I noticed that they only seemed to fail when running the entire test suite (as in original post), but not for individual modules.

I’ve re-written our test scripts to simply run as:

pytest tests/module_A
pytest tests/module_B
#etc

… and now the originally failing tests pass (so hooray).

My immediate problem is resolved, but I hadn’t seen anyone posting with this particular issue before, so there you go.

ptrblck · July 20, 2022, 7:33pm

I wouldn’t check for no_grad() usages as they would be local (and your local tests seem to work) but for global switches such as torch.enable_grad etc.
If any test disabled gradient calculations globally it should also reset it back to the default/previous settings. If that’s not the case other tests could of course fail.