Training time increasing with steps

Hello! I have the next problem: my torch model slows down by the end of an epoch and starts to perform well in a new epoch.
So, I use tqdm to measure iter/second performance and have the next picture: at start of training performance is about 20 iter/sec, but it slows down with increasing iteration number and finished with about 4it/sec. In new epoch it start again from 20 and return to 4 by the end. I remove all from training loop, now that is like in pytorch tutorial, no any additional steps.
Use Dataloader with num_workes>1 + pin_memory, I have about 250 GB of data, which pass to Loader init, so my Dataset just returns index, no any loading data from disk and no transformations, all data stored in init in the way it passed to forward.
How can i find this bottleneck of my model?

Could you check if you are appending some tensors to a list and could thus be storing the computation graphs throughout the epoch? This should be visible in an increased memory usage in the first epoch. Also, are you using backward(retain_graph=True) or any other “special” setup?

Thanks for the answer!
I haven’t seen anything strange in the training loop:

    with tqdm(total=len(train_generator)) as prbar:
        for batch_idx, batch in enumerate(train_generator, 1):
            output = self._model(batch[:3]).squeeze(dim=1)
            target = batch[3].to(self._device)
            loss = loss_func(output, target)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            total_loss += float(loss.item()) * target.size(0)
            y_true.extend(list(target.data.cpu().numpy()))
            y_pred.extend(list(output.data.cpu().numpy()))
            l1_part += float(l1_weight * weights_norm_1 / float(loss) * 100)
            l2_part += float(l2_weight * weights_norm_2 / float(loss) * 100)

Can any things be in the impelemnted model class that can slows the perofmance?

Where are these values or tensors coming from?

            l1_part += float(l1_weight * weights_norm_1 / float(loss) * 100)
            l2_part += float(l2_weight * weights_norm_2 / float(loss) * 100)

Could you check if some of them are attached to a computation graph and would thus have a valid .grad_fn?
If so, detach() them before accumulating.

Yes, if your model increases the workload in each iteration e.g. by using longer sequences or by backpropagating through all previous iterations, you would see a slowdown.

This parts is just additional regularization + sample weights:
if use_weights:
weights = batch[4].to(self._device)
loss = (loss * weights).mean()
if l1_weight:
weights_norm_1 = 0
for i, params in enumerate(self._model.parameters(), 0):
if i != 0: # not embedding layer
weights_norm_1 += torch.norm(params, 1)
loss = loss + l1_weight * weights_norm_1
Where can I use .grad_fn to understand if they have childs?

List item

You can print it directly as it’s a tensor attribute:

print(loss.grad_fn)
print(l1_weight.grad_fn)
...

Thanks, will try to do this.
We have two similar models (torch/keras) and now keras performs two times faster (keras ~ 23 min per epoch/ torch ~ 40-50 min) that looks a bit strange.