I have a pre-trained model, and I added an actor-critic method into the model and trained only on the rl-related parameter (I fixed the parameters from pre-trained model). However, I noticed that the training speed gets slow down slowly at each batch and memory usage on GPU also increases. For example, the first batch only takes 10s and the 10k^th batch takes 40s to train.
I am sure that all the pre-trained model's parameters have been changed into mode "autograd=false". There are only four parameters that are changing in the current program. I also noticed that if I changed the gradient clip threshlod, it would mitigate this phenomenon but the training will eventually get very slow still. For example, if I do not use any gradient clipping, the 1st batch takes 10s and 100th batch taks 400s to train. And if I set gradient clipping to 5, the 100th batch will only takes 12s (comparing to 1st batch only takes 10s).
FYI, I am using SGD with learning rate equal to 0.0001.
Is there anyone who knows what is going wrong with my code? I have been working on fixing this problem for two week....