Performance slowing down after few batches

I am trying to implement a model in PyTorch. The training procedure is quite complex and take a while, but what I have noticed is that the model is very fast on the first few batches, and then suddenly gets about 500. I guess it is due to some memory leak issue, as if python was not really letting free the memory of released huge tensors.

At first I thought that the problem was linked to the storing gradient, but actually even with torch.no_grad() the same issue appears.

Here is an example to replicate the problem. (Note I am not trying to train this specific network, but the problem looks the same). To make things simpler I am not using the gradient and I am iterating on the same batch.

import torch
import torch.nn as nn
from torchvision.datasets import MNIST
import torchvision.transforms as T

dataset = MNIST(root='./MNIST', train=True, download=True,
                transform=T.Compose([T.ToTensor(), T.Lambda(lambda x: torch.flatten(x))]))
data_loader = torch.utils.data.DataLoader(dataset, batch_size=500)

X, _ = next(iter(data_loader))
X = X.to('cuda')

in_features = 28*28
out_features = 10
width= 15000

#defining huge network
NN = nn.Sequential(
          nn.Linear(in_features=28*28, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=width, bias=False),
          nn.ReLU(),
          nn.Linear(in_features=width, out_features=out_features, bias=False),
        ).to('cuda')

import time

iterations=100

X = X.to('cuda')

with torch.no_grad():
  for idx in range(iterations):
    print(f'Iteration {idx+1}')
    start = time.time()
    Y = NN(X)
    print(f'Time: {time.time() - start}')

The output shows that everything is very fast up to almost the 50th iteration, then it suddenly slows down.

Iteration 44
Time: 0.00035953521728515625
Iteration 45
Time: 0.00035309791564941406
Iteration 46
Time: 0.00035309791564941406
Iteration 47
Time: 0.048192501068115234
Iteration 48
Time: 0.1714644432067871
Iteration 49
Time: 0.16771984100341797
Iteration 50
Time: 0.1681973934173584
Iteration 51
Time: 0.16853046417236328
Iteration 52
Time: 0.16821908950805664

Why is there such a slow down? Is it possible to avoid it somehow?

Unfortunately, timing GPU functions like this can be unreliable because kernel launches are asynchronous. To get a more reliable timing measurement of the time per iteration, you can try adding a torch.cuda.synchronize() before timing stops. (Note that you probably want to remove this before actually running a long training job, as this can reduce performance)