Why models training is extremely slow?

ciruzzo · October 15, 2024, 12:17pm

I’m encountering strange behavior when training any model: models that previously required little time to complete a training epoch now require much more. For example, a relatively simple model like DGCNN with a dataset fixed and trained on cuda used to take approximately 100 seconds for each epoch, including forward and bacward pass, loss computation and other small computation related to the training script, but now the same identical script takes 30 minutes!
The same behavior occurs for training other models which has now become very slow.

And I also noticed that CPU training has slowed down (could it be related?)

I originally encountered this problem with an installation of pytorch 2.2 and cuda 12.1, so thinking it was due to these versions, I re-ran a clean install of both, but the same problem also occurs with versions 2.4 and 12.4

Can anyone tell me what the cause could be and/or what to check?

ptrblck · October 15, 2024, 12:24pm

Double post from here.

ciruzzo · October 15, 2024, 12:33pm

@ptrblck
Yes, i’m sorry. I will delete it