Llama-2 CUDA OOM during inference but not training

Sorry about the late reply, it took me some time to find the solution to the issue but your reproduction was really helpful!

Turns out the solution was to manually delete the model output with del output after every evaluation step. This seems to fix the memory leak. I still don’t understand why this isn’t necessary during training, though.

Anyway, thanks again!