Hello! I have the next problem: my torch model slows down by the end of an epoch and starts to perform well in a new epoch.

So, I use tqdm to measure iter/second performance and have the next picture: at start of training performance is about 20 iter/sec, but it slows down with increasing iteration number and finished with about 4it/sec. In new epoch it start again from 20 and return to 4 by the end. I remove all from training loop, now that is like in pytorch tutorial, no any additional steps.

Use Dataloader with num_workes>1 + pin_memory, I have about 250 GB of data, which pass to Loader **init**, so my Dataset just returns index, no any loading data from disk and no transformations, all data stored in init in the way it passed to forward.

How can i find this bottleneck of my model?

Could you check if you are appending some tensors to a list and could thus be storing the computation graphs throughout the epoch? This should be visible in an increased memory usage in the first epoch. Also, are you using `backward(retain_graph=True)`

or any other “special” setup?

Thanks for the answer!

I haven’t seen anything strange in the training loop:

```
with tqdm(total=len(train_generator)) as prbar:
for batch_idx, batch in enumerate(train_generator, 1):
output = self._model(batch[:3]).squeeze(dim=1)
target = batch[3].to(self._device)
loss = loss_func(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
total_loss += float(loss.item()) * target.size(0)
y_true.extend(list(target.data.cpu().numpy()))
y_pred.extend(list(output.data.cpu().numpy()))
l1_part += float(l1_weight * weights_norm_1 / float(loss) * 100)
l2_part += float(l2_weight * weights_norm_2 / float(loss) * 100)
```

Can any things be in the impelemnted model class that can slows the perofmance?

Where are these values or tensors coming from?

```
l1_part += float(l1_weight * weights_norm_1 / float(loss) * 100)
l2_part += float(l2_weight * weights_norm_2 / float(loss) * 100)
```

Could you check if some of them are attached to a computation graph and would thus have a valid `.grad_fn`

?

If so, `detach()`

them before accumulating.

Yes, if your model increases the workload in each iteration e.g. by using longer sequences or by backpropagating through all previous iterations, you would see a slowdown.

This parts is just additional regularization + sample weights:

if use_weights:

weights = batch[4].to(self._device)

loss = (loss * weights).mean()

if l1_weight:

weights_norm_1 = 0

for i, params in enumerate(self._model.parameters(), 0):

if i != 0: # not embedding layer

weights_norm_1 += torch.norm(params, 1)

loss = loss + l1_weight * weights_norm_1

Where can I use .grad_fn to understand if they have childs?

List item

You can print it directly as it’s a tensor attribute:

```
print(loss.grad_fn)
print(l1_weight.grad_fn)
...
```

Thanks, will try to do this.

We have two similar models (torch/keras) and now keras performs two times faster (keras ~ 23 min per epoch/ torch ~ 40-50 min) that looks a bit strange.