Currently doing contrastive learning on a dual-stream model with one XLM-RoBERTa and a CLIP-text model, loading the pretrained parameters and adding a new pooler for projecting [CLS], calculating with infoNCE loss.
But as the loss curve shows, contrastive loss decreases drastically, reaching near 0 at about 2000 steps (sometimes even faster with larger lr), and if it keeps training, the loss will suddenly jump to about 4, I think it is caused by gradient explosion.
I’ve added autograd and scaler for fp16 training and I think it’s used correctly, here is a snippet.
def ttc_iter(model, batch, optimizer, scaler, metric_logger, device):
train_batch = [t.to(device) if t is not None else None for t in batch]
optimizer.zero_grad()
with autocast():
loss = model(train_batch)
scaler.scale(loss['loss_itc']).backward()
metric_logger.update(loss_mln_eng_ttc=loss['loss_itc'].item())
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
return loss
I couldn’t figure out where the problem occurs, maybe somewhere in the model (since I have added additional encoder layers to the encoder structure)? Thanks for any suggestions in advance!