Lower accuracy after speeding up training process

I am training an object detection model (draw bounding box around the target object and tell me what class it is) with a very common training structure like this below:

for i, (inputs, labels) in enumerate(data_loader):

    optimizer.zero_grad()

    inputs = inputs.to(device)
    labels = labels.to(device)

    preds = model(inputs)
    loss  = criterion(preds, labels)

    loss.backward() 

    optimizer.step()
    

The training is very time-consuming.
And due to hardware limitations, I found some strategies that can speed up my training process so as to reduce training time. I have tried it but the accuracy is worse than that of before implementing it.
(Before implementation, though the predicted class in incorrect, the bounding box can draw at an approximately correct position. But after implementation, both are incorrect.)

These are the strategies I tried to implement โ€“ 1) enable num_workers, 2) set pin_memory=True, 3) use cuDNN Autotuner, 4) use Automatic Mixed Precision (AMP), 5) use gradient accumulation

Below is the modified code:

accumulation_steps = 4
torch.backends.cudnn.benchmark = True

data_loader = DataLoader(..., num_workers=num_workers, pin_memory=True)

for i, (inputs, labels) in enumerate(data_loader):

    optimizer.zero_grad()

    inputs = inputs.to(device)
    labels = labels.to(device)

    with autocast():
        preds = model(inputs)
        loss  = criterion(preds, labels)
        loss = loss/accumulation_steps

    scaler.scale(loss).backward()

    if ((i+1) % accumulation_steps == 0) or (i+ 1 == len(data_loader)):
        scaler.step(optimizer)
        scaler.update()
    

The strategies are only to speed up the training process, it should have no impact on the accuracy of the model. But I really have no idea why the performance is getting so worse after that. Can someone help me out with this?

Could you try to use these implementations separately and see how the results would change?

1 Like

I am trying to take out the implementation one by one to see the result.

Found that there mentioned the AMP and gradient accumulation.
https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-accumulation

The problem might have come from the cuDNN. I will try it out to disable it.

Can you try training with torchvision reference scripts? They should have a proper recipe to get good results in decent time.