CUDA issues during test-runs of distributed training loops within Pytest framework

yutanagano · August 5, 2022, 6:32am

Hello,

I have a relatively complex training loop system that I want to make sure continues to behave as expected as I continue to add code to it and develop. That’s why I maintain unit-testing and integration-testing code alongside my Pytorch code (the test code is written using the Pytest framework).

Most of my testing code is comprised of simple unit tests, but the last bit is a set of integration tests that execute a mock-training run of sorts, and ensures that the pytorch model and training logs etc. are saved correctly. I have:

One test that mock-runs a loop on the CPU
A second test that mock-runs a loop on the GPU
A third that mock-runs a distributed loop using 2 GPUs (via DistributedDataParallel)

When I run my test suite, all the tests pass except for that final one that tests whether the distributed loop works. The test fails with the first-spawned torch.multiprocessing process throwing the error:

E       RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

However, if I run ONLY the test function which mock-runs a distributed training run, then it passes.

It’s only when I either run the whole testing suite, or even just one other test and the distributed test, where the distributed test fails.

Interestingly, running the distributed test twice in a row within the same Pytest invocation still runs fine. Both instances of the distributed test passes.

My question here is: is there something to do with how Pytorch manages GPU resources that I am not taking account of in my testing code? Like for example do I need to somehow ‘free’ GPU resources after each unit test which makes use of the GPUs? The unstable nature of my testing suite makes me think that I am probably not following the best practices for this kind of situation.

Thank you for your help in advance!

Yanli_Zhao · August 9, 2022, 11:58am

right, each unit test should be a separated ‘binary’ program and processes should be completely destroyed along with the GPU resources, please refer to the ‘pytorch/distributed_test.py at master · pytorch/pytorch · GitHub’

yutanagano · September 1, 2022, 3:12pm

Dear Yanli,

Sorry for the late response.
Understood, I have restructured the tests so that each training loop is executed as its own process. It seems to have done the trick! Thank you.