I have an instance of quantized to int8 llama 3.1 model which is continuously fine-tuned on data (10 epochs around 1000-1500 instances of data). One cycle of fine-tune takes around 5-7 hours. after that it waits for sufficient amount of data around 1000 instances to collect to start the next cycle. At the same time in different thread the model is listening for calls for prediction. So after 24-30 hours any forward calls (either due to training or prediction) start to take 10 times more time. Also the gpu and cpu load drops significantly. There are no errors during that. The warnings include waitress saying that there is a queue when prediction is called repeatedly. The setup is 2 4060ti with 16gb of memory. The split is manual with 10 gigs on the first card and 14 on another. Pytorch 2.9.0+cu130
If there is a need for additional information to help please tell.
Very difficult to tell from the description, and admittedly I am no expert in llama, but the symptoms suggest that you might have a memory leak. Something is allocating memory and not giving it back, most likely in your code.
The problem is neither RAM nor VRAM nor shared memory have increase in load, this is the first thing I checked