Backpropagation through the training procedure

It does look like what you want is very similar to truncated backpropagation through time. Only that in your case, the evaluations are not timestep but optimization steps.
You can check this post that gives an example of how to do it manually: Implementing Truncated Backpropagation Through Time - #4 by albanD

data_loader.tensors[0][idx_last_k1, :].detach_()

I am not sure what you’re trying to achieve here, but it most likely doesn’t work :smiley: The indexing of [idx_last_k1, :] is returning a temporary view on the original tensor. And then you detach that temporary view inplace. But that does not modify the original Tensor!

whether or not I have used retain_graph=True correctly

You should only ever use it if you compute gradients on the same graph multiple times. Otherwise, there is something wrong.