Training time gets slower and slower on CPU

ocanevet · March 3, 2022, 9:31am

Hi all,

I am facing an issue when training an autoencoder on CPU (I am designing a lab for students to be made on a platform with no GPU, but the problem I will describe does not happen on GPU).

After some point, the time of an epoch starts increasing a lot. I provide a minimal working example code to reproduce the issue. At the beginning, the loop “get the batch, forward, grad, and optim step” takes 0.25 seconds, and after some time, it can be 2 or 3 times higher. This is what the graph below shows:

I tried many things including:

call del batch, loss and gc.collect() as suggested here I run out of memory after a certain amount of batches when training a resnet18 - #2 by smth
Remove variables which could keep computational graph in memory as suggested here Training gets slow down by each batch slowly - #21 by yqwang In particular, I removed statements like total_loss += loss.item() just in case that would not detach the loss from the graph, also removed nn.MSELoss(), the initial data-loader, and in the code below I used a random tensor as a minimal working example, but the problem started with a “real” data set.
The closest issue I found is Why the training slow down with time if training continuously? And Gpu utilization begins to jitter dramatically? - #4 by Musoy_King but was related to CUDA.
I commented out the optimizer.step() and the issue disappeared, but obviously the model is not training anymore.
use powers of 2 in all parameters

The issue arises on “PyTorch 1.10.2 + CUDA 11.3”, on “PyTorch 1.10.2 cpu only” and on “PyTorch with CUDA 11.1 on Google colab”.

Here is a minimal working example to reproduce the issue:

# coding=utf-8

import argparse
import time
import gc
import torch
import torch.nn as nn

class AutoEncoder(nn.Module):
    def __init__(self, h=128, e=128):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(1, h, 5, 1, 0), nn.ReLU(),
            nn.Conv2d(h, h, 5, 1, 0), nn.ReLU(),
            nn.Conv2d(h, h, 4, 2, 0), nn.ReLU(),
            nn.Conv2d(h, h, 3, 2, 0), nn.ReLU(),
            nn.Conv2d(h, e, 5, 1, 0),
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(e, h, 5, 1, 0), nn.ReLU(),
            nn.ConvTranspose2d(h, h, 3, 2, 0), nn.ReLU(),
            nn.ConvTranspose2d(h, h, 4, 2, 0), nn.ReLU(),
            nn.ConvTranspose2d(h, h, 5, 1, 0), nn.ReLU(),
            nn.ConvTranspose2d(h, 1, 5, 1, 0),
        )

    def forward(self, x):
        x = self.decoder(self.encoder(x))
        return x

model = AutoEncoder(32, 32)
print(model)
optimizer = torch.optim.Adam(model.parameters(), lr=0.002, weight_decay=1e-4)
train_images = torch.randn(16384, 1, 32, 32)
for epoch in range(40):
    for i, batch in enumerate(train_images.split(64)):
        tic = time.perf_counter()
        output = model(batch)
        loss = 0.5 * (output - batch).pow(2).sum() / batch.size(0)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        toc = time.perf_counter()
        gc.collect()
        del loss, output, batch

        print(f"epoch {epoch} {i} took {toc-tic:.2f} seconds")

Again, I have cleaned the code to show the issue. Do you have any idea? As I am designing a lab for students, I would like to avoid hacks like “reload the model every epoch to keep the training fast”.

Thanks for your help.

ptrblck · March 3, 2022, 8:13pm

Could you check if your workstation is reducing its clocks due to overheating etc.?
Based on your description I would probably start by profiling the system and make sure it can run at a high load for an extended period of time.

ocanevet · March 4, 2022, 4:59pm

Thanks a lot for your reply. I have tried to monitor the frequency of the CPU. I don’t know if I did that properly, I used psutil.cpu_freq().current.

However, I have ran the same experiment on many different platforms and the behaviour is consistent across them. The reported frequency does not seem to change (but I don’t know if this how we can check if a processor is down-clocking). But since the starting point of increasing time is always 1500, it would be strange if it did.

On the graph below,

the x-axis is the number of minibatch used for the forward/backward/step,
the left y-axis is the time is take for one forward/backward/step
the right y-axis is the CPU frequency returned by psutil.cpu_freq().current.

This is strangely very consistent across plateforms, when 1500 minibatches have been forwarded, the time increases. This is not the case on the AMD processor. On the JupyterHub/Kubernetes, the trend is less obvious, but still, the time increases a bit as of 1500.

I also tried (not shown here) PyTorch 1.1, 1.6, 1.7, 1.8, and 1.9 all showing the same trend.

Does it rind any bell to you? I am less sure about the AMD thing, as I am on a server and don’t master what is going on. I will try to find a laptop with an AMD proc to master what is running on the machine.

Thanks for your help.

albanD · March 4, 2022, 5:34pm

Hi,

Could you try adding torch.set_flush_denormal(True) at the top of your script to see if that fixes the issue?

ocanevet · March 7, 2022, 2:57pm

Thanks a lot for your suggestion: with torch.set_flush_denormal(True), this has solved the issue. I ran each experiment multiple times and the results are consistent: no more increase in time after step 1500.

Thanks a lot.

albanD · March 7, 2022, 3:05pm

Great to hear!

For future times, this happens when numbers gets very close to 0. Your CPU, to try and keep precision, has to do some fancy arithmetic and thus gets slowed down.
This happened to me before on “toy” examples where your model converges very well and thus a lot of numbers are very close to 0.