Running multiple networks in succession crashes unit testing

pytorcher · June 3, 2021, 4:09pm

I have a system that is utilizing a few different open source networks for various applications. I’m just starting to write unit tests with nose. First plan is to test everything with one large test but that is causing crashes. If I comment things out all 3 of my networks will pass their individual tests. But if all called in succession in the same script the first one will pass and then the second one with crash on loss.backward() with either a “RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR” or a Segmentation Fault on different runs. Any thoughts on what might be causing this? Should I just write 3 separate tests?

eqy · June 3, 2021, 5:40pm

Can you keep an eye on the memory usage between the tests (e.g., via nvidia-smi)? It might be that something is kept in memory from each of the tests causing an OOM when all three are run in successsion.

pytorcher · June 3, 2021, 6:31pm

That could be it… beforehand is
1502MiB / 4043MiB
and if I stop in the middle its
2209MiB / 4043MiB.
Is there a way to clear that up? I do already have a call to torch.cuda.empty_cache()
. I also remember reading that with pytorch when GPU memory is freed nvidia-smi will still show it as used even though it can be accessed by future pytorch code.

eqy · June 3, 2021, 6:43pm

Can you check that nothing from the tests inadvertently keeps tensors/variables around when they aren’t needed (e.g., returned values or global variables)? A litmus test for this is to use trivially small inputs for the tests so that it is expected that everything should fit in memory even if it is all kept.

pytorcher · June 3, 2021, 7:02pm

Hmm, global variables definitely could be it… I know there are a couple global things defined, if any globals ever have a pointer to the network itself I guess that would mean the whole network is saved. Maybe just having multiple tests is a better idea. Thanks for the help!