Half precision training

I would like to know if these lines of code arranged correctly:

lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, len(dataloader))               
for batch_i, (_, imgs, targets) in enumerate(dataloader):
    optimizer.zero_grad() 
    with torch.cuda.amp.autocast(enabled=use_amp):
        imgs = Variable(imgs.type(Tensor))
        targets = Variable(targets.type(Tensor), requires_grad=False)
        loss = model(imgs, targets)
    
    scaler.scale(loss).backward()

    # accumulate gradient for x batches before optimizing
    if ((batch_i + 1) % accumulated_batches == 0) or (batch_i == len(dataloader) - 1):
        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(optimizer)

        # Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
        # You may use the same value for max_norm here as you would without gradient scaling.
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.1)
        scaler.step(optimizer)
        lr_scheduler.step()
        scaler.update()
        optimizer.zero_grad() # set_to_none=True here can modestly improve performance

Thanks

The amp usage from your code snippet matches the AMP - Gradient accumulation example so I think it’s correct from this point of view.

However, from a general perspective, I think you should remove the first zero_grad() operation, as it would remove the accumulated gradients.
Also, Variables are deprecated since PyTorch 0.4, so you can use tensors now.

Thanks for your reply
I would also like to know when I resume training algorithm like yolov3 with darknet backbone should I use model.load_weights(darknet weights) or use model.load_state_dict(checkpoint[‘model’]) and what this should do with this piece of code:

Freeze darknet53.conv.74 layers for first some epochs

if freeze_backbone:
    if epoch < 20:
        for i, (name, p) in enumerate(model.named_parameters()):
            if int(name.split('.')[1]) < 75:  # if layer < 75
                p.requires_grad = False
    elif epoch >= 20:
        for i, (name, p) in enumerate(model.named_parameters()):
            if int(name.split('.')[1]) < 75:  # if layer < 75
                p.requires_grad = True

Also when I wrote imgs = torch.tensor(imgs, requires_grad=True) it said its recommended to use imgs.clone().detach().requires_grad_(True) then I got:
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

I don’t know, how load_weights is defined, as it’s not an nn.Module method so I assume it’s a custom darknet method?

This code snippet freezes some weights for 20 epochs. I would guess you would have to keep these weights frozen until the model was trained for 20 epochs. Depending when the model was saved (before or after reaching 20 epochs) you might need to add it.

This error points to a device mismatch, so make sure the input tensor is on the GPU via:

input = input.to('cuda:0') # change 'cuda:id' to the desired GPUid

Thanks for your reply
when reaching 20 epochs with frozen weights, should I stop using darknet weights and use model.load_state_dict(checkpoint[‘model’])

If “darknet weights” refer to a pretrained model, and you’ve fine-tuned it for 20 epochs, I would assume you want to load your fine-tuned weights instead. So yes, I guess you might want to use model.load_state_dict(your_fientuned_checkpoint), but it depends on your actual use case.