Training time increases

Mohammad_Hassan_Soha · May 24, 2020, 8:46pm

Hi

I train a speaker recognition model using ResNet34 model (with little modifications). The code is available at github. At the beginning epochs, each epoch takes about 15 seconds:

After a while, time to complete an epoch becomes about 30 seconds:

I really can’t understand what is going wrong! There are huge amount of free ram and the GPU memory footprint does not change during this transition.

P.S. I saw this behavior after adding learning rate scheduler. I’m not sure this may relates to scheduler or not.
Another weird behavior is that killing and running the process (even after rebooting the system!) does not reset the epoch time to 15 seconds! The only solution I found effective is to power off and on my system!

Any help is appreciated.
regards

ptrblck · May 25, 2020, 4:16am

Could you check the temperature and clocks of your system?
Does the performance recover after you kill the process and wait for some time?
Maybe your system is overheating and thus reducing the clocks.

Mohammad_Hassan_Soha · May 25, 2020, 3:25pm

Thanks a lot ptrblk. I eliminate learning rate scheduler and the problem goes away! So I did not reproduced the situation to look for overheating issue. I’m wondering if I do something wrong with learning rate scheduler?
Currently I replaced scheduler.step() with:

if (epoch + 1) % 5000 == 0:
    for g in optimizer.param_groups:
        g['lr'] = g['lr'] * 0.75

and I’m waiting to see whether the problem comes in or not, but tests are a little time consuming…

ptrblck · May 26, 2020, 2:29am

Let me know, if you can deterministically reproduce the issue using the scheduler and avoid it by removing the scheduler, please.

Mohammad_Hassan_Soha · May 28, 2020, 6:28am

Hi @ptrblck

I confirm that using scheduler will reproduce the issue (after about 8 hours of training), and removing scheduler will avoid the issue. Currently I’m running the model without the scheduler for 48 hours and I don’t observe such problem.
Here it is all scheduler related lines of my code:

# optimizer
optimizer = torch.optim.Adam(
    [
        {'params': model.parameters(), 'lr': args.lr},
        {'params': criterion.parameters(), 'lr': args.criterion_lr}
    ]
)

# lr schedule
scheduler = torch.optim.lr_scheduler.StepLR(
    optimizer,
    step_size=args.step_size,
    gamma=args.gamma
)

for epoch in range(scheduler.last_epoch, args.num_epochs):
    for x, target in tqdm(dataloader):
        x = x.to(device)
        target = target.to(device)
        y = model(x)
        scores, loss = criterion(y, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    scheduler.step()
    lr = optimizer.state_dict()['param_groups'][0]['lr']
    log.add_scalar('train-lr', lr, epoch + 1)

Could you please inform me if I used scheduler in a wrong way?!
regards

ptrblck · May 28, 2020, 6:32am

That’s really weird.
Is the linked code executable or would I need to get some input shapes or other arguments to run it?

Mohammad_Hassan_Soha · May 28, 2020, 6:48am

train.sh provides minimalist arguments needed to run the code, after downloading and converting audio formats. You could eliminate the data loader stuff and feed the model with a NxCxHxT shaped tensor, where: N=batch size (256), C=channels (1), H=num filter banks (40 lmfb/mfsc) and T=time steps (200).

I also see that other repos (like clovai baseline for SR, or deepspeech.pyotrch for ASR) avoid using schedulers. I’m not sure if they have the same experience or not?!

ptrblck · May 28, 2020, 6:54am

It’s the first time I see this issue, so I’ll try to run the code for some time and report back.
Please ping me in the next two days, if I haven’t posted an update.

Mohammad_Hassan_Soha · May 28, 2020, 6:59am

sure.

Thanks for your reply and patience.

Mohammad_Hassan_Soha · June 1, 2020, 5:57am

Hey @ptrblck

Do you succeed to reproduce the issue? I have some free time and I could spend some time to explore the issue. I wonder to write a test to reproduce the issue if it helps?

ptrblck · June 1, 2020, 7:10am

A minimal code snippet would be great, as I have trouble getting the code to run.
After removing the data loader, I get e.g. this error:

AttributeError: 'Namespace' object has no attribute 'num_spkr'

and train.sh doesn’t seem to specify this argument.

Mohammad_Hassan_Soha · June 1, 2020, 8:37am

The trunk network train with classification method, and num_spkr defines the number of the classes. Add:
args.num_spkr = 5994
will resolve that error. But it’s not straightforward process to run the code without downloading and pre processing the data, so I’ll write a snippet and inform you soon.

Mohammad_Hassan_Soha · June 22, 2020, 6:38am

The issue is irrelevant to the pytorch scheduler and the coincidence of occurring the issue with a hardware malfunction, make me mistake.

Indeed, the issue was occurred due to underpowered psu!

ptrblck · June 22, 2020, 7:34am

Thanks for the follow-up! That’s an unfortunate coincidence, but at least you’ve narrowed it down.

el_youssfi_azeddine · August 4, 2021, 8:11am

hello Mohammad,
I think I have same issue, I think you resolved it, can you just tell what is the cause?
also what do you mean by underpower of PSU, how we recognize that since ti the set seems working good,
thanks