Could you also explain how the caching memory allocator is delaying the response?
Being able to reuse already allocated memory from the cache will speed up the performance of your script as the synchronizing cudaMalloc calls would be avoided. torch.cuda.empty_cache() will also synchronize your code and show most likely a performance drop.
If you want to globally disable the caching mechanism, use export PYTORCH_NO_CUDA_MEMORY_CACHING=1 which is used as a debug flag as it wouldnāt make sense to use it in āproductionā code.
I tested it on config.json, from_pretrained argument and model.generate argument.
But every responses make memory leak on my GPU.
Modelās memory usage is bigger then last response everytime.
And response delay is longer then last response time.
I think it is GPU caching, because additional memory usage can delete by torch.cuda.empty_cache().
Can I using PYTORCH_NO_CUDA_MEMORY_CACHING value?
All environments are the same as when it work normally.
Only the weights file is different from before.
Could deepspeed training makes these issues on finished file?
I donāt know if Deepspeed or any other part of your code is responsible for the increased memory usage.
Usually tensors, which are still attached to the entire computation graph, are stored in e.g. a list which increases the memory usage and might look like a leak (but is in fact expected behavior in this case).
You can search this forum for similar issues and check how they were solved.
With that being said, you can certainly disable the cache and see if if helps. I would claim you would see the same increasing memory usage and would additionally slow down your code.