Torch + pytest leads to memory fragmentation: How to do proper integration testing of a lot of torch models?

orcunderscore · April 19, 2024, 12:19pm

Hi there,

we serve a bunch of torch models in production.
To minimize issues, we have a whole bunch of integration tests. Basically, each model is loaded, some inference is done with it and the result is checked.
Most of the tests live in different files.
We start pytest by calling

pytest -s tests

where test is the name of the directory, where all the tests are located.

Recently, we ran into the following issue:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 10.96 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 3.40 GiB is allocated by PyTorch, and 75.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

From my understanding, there is (at least according to this message) enough memory to allocate this.
Also, all the tests run fine if I run the test-files one-by-one with pytest.

My best guess about what is happening:

Pytest somehow keeps the torch-gpu-memory state
After some time torch suffers from memory fragmentation (I have read about this, but I don’t really know how to confirm if this is happening; it makes sense though, since we are loading many different models and data onto the GPU and then releasing them again, so lots of different things are happening within the pytest run).

What I have so far tried to get rid of this problem:

torch.cuda.empty_cache() → Helped briefly, but the issue came back when a test was added, which makes sense, since this only frees cache that torch should in principle free by itself if needed.
use numba to reset the gpu, by from numba import cuda; cuda.get_current_device().reset(); cuda.close() (done before torch is imported in each test file). This did not help.

Our current solution is to split the test in multiple pytest calls, which doesn’t really solve the problem but works around it. I was wondering if there is a better way to do it. I saw that torch has internal test classes (e.g. from torch.testing._internal.common_utils import TestCase).

So how do the torch-devs solve this issue? I would imagine that this came up at some point.