Training gets slow down by each batch slowly

cosmmb · June 30, 2017, 1:22am

Hi there,

I have a pre-trained model, and I added an actor-critic method into the model and trained only on the rl-related parameter (I fixed the parameters from pre-trained model). However, I noticed that the training speed gets slow down slowly at each batch and memory usage on GPU also increases. For example, the first batch only takes 10s and the 10k^th batch takes 40s to train.

I am sure that all the pre-trained model’s parameters have been changed into mode “autograd=false”. There are only four parameters that are changing in the current program. I also noticed that if I changed the gradient clip threshlod, it would mitigate this phenomenon but the training will eventually get very slow still. For example, if I do not use any gradient clipping, the 1st batch takes 10s and 100th batch taks 400s to train. And if I set gradient clipping to 5, the 100th batch will only takes 12s (comparing to 1st batch only takes 10s).

FYI, I am using SGD with learning rate equal to 0.0001.

Is there anyone who knows what is going wrong with my code? I have been working on fixing this problem for two week…

Many Thanks!

albanD · June 30, 2017, 9:00am

Hi,

This is most likely due to your training loop holding on to some things it shouldn’t.
You should make sure to wrap your input into a Variable at every iteration. Also makes sure that you are not storing some temporary computations in an ever growing list without deleting them.

cosmmb · June 30, 2017, 5:08pm

Thanks for your reply! Your suggestions are really helpful. I deleted some variables that I generated during training for each batch. Currently, the memory usage would not increase but the training speed still gets slower batch-batch. I double checked the calculation of loss and I did not find anything that is accumulated from the previous batch. Do you know why it is still getting slower?

I also tried another test. For example, the average training speed for epoch 1 is 10s. After I trained this model for a few hours, the average training speed for epoch 10 was slow down to 40s. So I just stopped the training and loaded the learned parameters from epoch 10, and restart the training again from epoch 10. I though if there is anything related to accumulated memory which slows down the training, the restart training will help. However, after I restarted the training from epoch 10, the speed got even slower, now it increased to 50s per epoch.

Thanks!

albanD · July 3, 2017, 8:18am

Hi,

If you are using custom network/loss function, it is also possible that the computation gets more expensive as you get closer to the optimal solution?

To track this down, you could get timings for different parts separately: data loading, network forward, loss computation, backward pass and parameter update. Hopefully just one will increase and you will be able to see better what is going on.

ywu36 · September 9, 2017, 1:14am

Problem confirmed. As for generating training data on-the-fly, the speed is very fast at beginning but significantly slow down after a few iterations (3000). At least 2-3 times slower.

dslate · November 1, 2017, 2:36pm

I have observed a similar slowdown in training with pytorch running under R using the reticulate package.
System: Linux pixel 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 16.04.2 LTS
R version 3.4.2 (2017-09-28) with reticulate_1.2
Python 3.6.3 with pytorch version ‘0.2.0_3’

Net architecture:

Sequential (
(Linear-1): Linear (277 -> 8)
(PReLU-1): PReLU (1)
(Linear-2): Linear (8 -> 6)
(PReLU-2): PReLU (1)
(Linear-3): Linear (6 -> 4)
(PReLU-3): PReLU (1)
(Linear-Last): Linear (4 -> 1)
)

Loss function: BCEWithLogitsLoss()
The run was CPU only (no GPU). Although the system had multiple Intel Xeon E5-2640 v4 cores @ 2.40GHz, this run used only 1.
The net was trained with SGD, batch size 32. Each batch contained a random selection of training records. There was a steady drop in number of batches processed per second over the course of 20000 batches, such that the last batches were about 4 to 1 slower than the first. Although memory requirements did increase over the course of the run, the system had a lot more memory than was needed, so the slowdown could not be attributed to paging.

dslate · November 6, 2017, 7:35pm

Turns out I had declared the Variable tensors holding a batch of features and labels outside the loop over the 20000 batches, then filled them up for each batch. Moving the declarations of those tensors inside the loop (which I thought would be less efficient) solved my slowdown problem. Now the final batches take no more time than the initial ones.

fengziyue · February 3, 2018, 4:20am

I had the same problem with you, and solved it by your solution. Do you know why moving the declaration inside the loop can solve it ?

albanD · February 5, 2018, 10:32am

Hi,

It is because, since you’re working with Variables, the history is saved for every operations you’re performing. And when you call backward(), the whole history is scanned.
So if you have a shared element in your training loop, the history just grows up and so the scanning takes more and more time.

negrinho · May 11, 2018, 7:44pm

What is the right way of handling this now that Tensor also tracks history?

I migrated to PyTorch 0.4 (e.g., removed some code wrapping tensors into variables), and now the training loop is getting progressily slower. I’m not sure where this problem is coming from. Is there a way of drawing the computational graphs that are currently being tracked by Pytorch? These issues seem hard to debug.

acgtyrant · July 5, 2018, 2:19am

If a shared tensor is not requires_grad, is its histroy still scanned?

albanD · July 5, 2018, 11:24am

No if a tensor does not requires_grad, it’s history is not built when using it. Note that you cannot change this attribute after the forward pass to change how the backward behaves on an already created computational graph. It has to be set to False while you create the graph.

mciccone · October 13, 2018, 6:53am

I’m experiencing the same issue with pytorch 0.4.1
I implemented adversarial training, with the cleverhans wrapper and at each batch the training time is increasing.
How can I track the problem down to find a solution?

Bassel · January 3, 2019, 9:51pm

Hi Why does the the speed slow down when generating data on-the-fly(reading every batch from the hard disk while training)? Does that continue forever or does the speed stay the same after a number of iterations?

marcinplata · February 21, 2019, 9:07am

I observed the same problem. The solution in my case was replacing itertools.cycle() on DataLoader by a standard iter() with handling StopIteration exception. You can also check if dev/shm increases during training.

unnir · February 27, 2019, 10:54am

Have the same issue:

0%| | 0/66 [00:00<?, ?it/s]
2%|▏ | 1/66 [05:53<6:23:05, 353.62s/it]
3%|▎ | 2/66 [06:11<4:29:46, 252.91s/it]
5%|▍ | 3/66 [06:28<3:11:06, 182.02s/it]
6%|▌ | 4/66 [06:41<2:15:39, 131.29s/it]
8%|▊ | 5/66 [06:43<1:34:15, 92.71s/it]
9%|▉ | 6/66 [06:46<1:05:41, 65.70s/it]
11%|█ | 7/66 [06:49<46:00, 46.79s/it]
12%|█▏ | 8/66 [06:51<32:26, 33.56s/it]
14%|█▎ | 9/66 [06:54<23:04, 24.30s/it]
15%|█▌ | 10/66 [06:57<16:37, 17.81s/it]
17%|█▋ | 11/66 [06:59<12:09, 13.27s/it]
18%|█▊ | 12/66 [07:02<09:04, 10.09s/it]
20%|█▉ | 13/66 [07:05<06:56, 7.86s/it]
21%|██ | 14/66 [07:07<05:27, 6.30s/it]

Cannot understand this behavior… sometimes it takes 5 minutes for a mini batch or just a couple of seconds.

my first epoch took me just 5 minutes.

94%|█████████▍| 62/66 [05:06<00:15, 3.96s/it]
95%|█████████▌| 63/66 [05:09<00:10, 3.56s/it]
97%|█████████▋| 64/66 [05:11<00:06, 3.29s/it]
98%|█████████▊| 65/66 [05:14<00:03, 3.11s/it]

unnir · February 27, 2019, 1:08pm

It turned out the batch size matters. So, my advice is to select a smaller batch size, also play around with the number of workers.

satheesh · March 5, 2019, 5:58pm

Hi, Could you please inform on how to clear the temporary computations ?

thanks,

albanD · March 5, 2019, 6:34pm

You should not save from one iteration to the other a Tensor that has requires_grad=True. If you want to save it for later inspection (or accumulating the loss), you should .detach() it before. So that pytorch knows you won’t try and backpropagate through it.

sizhky · April 26, 2019, 7:39pm

The answer comes from here - Why the training slow down with time if training continuously? And Gpu utilization begins to jitter dramatically?

I used torch.cuda.empty_cache() at end of every loop