How to disable model's gpu cache?

Klink · April 2, 2022, 7:36am

I use these options for model.eval: non_blocking = True, use_cache=False.

But my model use GPU cache on every response.

I know torch.cuda.empty_cache(); can delete the model caches.

But caching sequence make delay on response time.

It was not do caching before last train.

But after gradient training, It use more GPU memory every response. (More then last response, Before using torch.cuda.empty_cache()

How can I disable model’s caching?

ptrblck · April 2, 2022, 7:44am

Could you describe where use_cache=False is used?

Could you also explain how the caching memory allocator is delaying the response?
Being able to reuse already allocated memory from the cache will speed up the performance of your script as the synchronizing cudaMalloc calls would be avoided.
torch.cuda.empty_cache() will also synchronize your code and show most likely a performance drop.

If you want to globally disable the caching mechanism, use export PYTORCH_NO_CUDA_MEMORY_CACHING=1 which is used as a debug flag as it wouldn’t make sense to use it in “production” code.

Klink · April 2, 2022, 7:53am

I tested it on config.json, from_pretrained argument and model.generate argument.
But every responses make memory leak on my GPU.
Model’s memory usage is bigger then last response everytime.
And response delay is longer then last response time.

I think it is GPU caching, because additional memory usage can delete by torch.cuda.empty_cache().
Can I using PYTORCH_NO_CUDA_MEMORY_CACHING value?

ptrblck · April 2, 2022, 8:01am

The caching allocator does not cause the memory “leak” and I guess you are storing data on the GPU unnecessarily.

Klink · April 2, 2022, 8:08am

All environments are the same as when it work normally.
Only the weights file is different from before.
Could deepspeed training makes these issues on finished file?

ptrblck · April 2, 2022, 8:27am

I don’t know if Deepspeed or any other part of your code is responsible for the increased memory usage.
Usually tensors, which are still attached to the entire computation graph, are stored in e.g. a list which increases the memory usage and might look like a leak (but is in fact expected behavior in this case).
You can search this forum for similar issues and check how they were solved.

With that being said, you can certainly disable the cache and see if if helps. I would claim you would see the same increasing memory usage and would additionally slow down your code.

Klink · April 2, 2022, 8:46am

I think… I need more time to find any solutions.
Thx. Have a nice day.