Training time keeps increasing

I am training an actor-critic algorithm using my own code and I noticed the training time keeps increasing at every epoch.
I thought it could be a problem of continuously increasing the Computational Graph, but I think I am not doing anything wrong and when keeping track of the actor_loss and critic_loss backward times it doesn’t seem to be the issue, since their respective training times don’t increase drastically.

This is my training function, and the graph shown below, shows time of training every 500 training epochs (1 epoch goes once through the function ‘train’ below).

Any hints?
Thanks a lot in advance!

def train(self, episode_num=None):
        for num_batch, obs in enumerate(self.data_loader):
            state, action, reward, next_state, done = obs[0]
            state =, copy=False)
            action =, copy=False)
            reward =, copy=False)
            next_state =, copy=False)
            done =, copy=False)
            current_Z, target_Z, tau_k =
                state, action, reward, next_state, done)
            self.critic_loss = self.criterion_critic(target_Z, current_Z, tau_k)
            if self.num_train_steps % self.policy_update_freq == 0: 
                # Update policy network:
                actor_loss_ = self.compute_actor_loss(state)
                self.actor_loss = self.to_min*actor_loss_


                # call @parameters.setter in NNQFunction
                self.target_Q_function.parameters = \
                # call @parameters.setter in NNPolicyFunction
                self.target_policy_function.parameters = \

Could it be that you keep increasing the size of self.memory there?
Or that you save on self more and more stuff?

Thanks so much for your reply,
self.memory keeps increasing indeed. self.memory is an Experience Replay buffer with a maximum size of 1e6, so I keep adding environment “observations” in it till it is full and then I keep updating it in a queue basis.

I finally found out that the step slowing time at every training epoch is when sampling the batch using the DataLoader. So I am know getting the data sampling directly from the buffer without using the DataLoader and things work far faster.
This slow performance by the DataLoader happens also when having a Dataset directly loaded from an hdf5 file (so its size doesn’t keep increasing with time).
Any insights what I am doing wrong?

Increasing num_workers doesn’t solve the issue since no part of my code is parallelized, so increasing num_workers won’t make any difference I guess.

The data loader is built with loading from disk in mind.
In your case here, you don’t want multiple workers for sure. But 0. Otherwise, it will try to use multiprocessing to load data which will increase the memory usage by a lot.

Okay, so if my data fits well in memory and ( hence don’t need to load from disk), it will be better in terms of speed to avoid the DataLoader part for getting the batch data.
Am I correct?

Using the dataloader is fine for sampling but you should make sure to use 0 workers.

The plot above showing the increase in training time was when using the DataLoader with num_workers = 0.
As soon as I removed the DataLoader for sampling, the speed increased a lot.