How to handle criterion with trainable params in DDP setup?

Related thread


I have this loss:

class Tacotron2Loss(nn.Module):
	def __init__(self, hparams):
		super(Tacotron2Loss, self).__init__()
		self.gate_loss_fn = nn.BCEWithLogitsLoss()
		self.emotion_loss_fn = torch.nn.CrossEntropyLoss(ignore_index=-1)

		num_losses = 3
		self.use_mmi = hparams.use_mmi
		if self.use_mmi:
			self.ctc_loss_fn = torch.nn.CTCLoss(
				blank=len(ctc_symbols), reduction='none')
			num_losses += 1

		# loss weights
		self.eta = nn.Parameter(torch.ones(num_losses, dtype=torch.float32))

	@staticmethod
	def masked_l2_loss(out, target, lengths):
		num_not_padded = lengths.sum() * out.size(1)
		loss = F.mse_loss(out, target, reduction="sum")
		loss = loss / num_not_padded
		return loss

	def forward(self, y_pred, y, output_lengths):
		mel_target, gate_target, ctc_text, ctc_text_lengths, emotion_label = y
		# mel_target.requires_grad = False
		# gate_target.requires_grad = False
		gate_target = gate_target.view(-1, 1)

		_, mel_out, mel_out_postnet, gate_out, _, log_probs, emotion_weights = y_pred

		gate_out = gate_out.view(-1, 1)

		losses = []

		mel_loss = self.masked_l2_loss(mel_out, mel_target, output_lengths) + \
			self.masked_l2_loss(mel_out_postnet, mel_target, output_lengths)
		losses.append(mel_loss)

		gate_loss = self.gate_loss_fn(gate_out, gate_target)
		losses.append(gate_loss)

		emotiom_loss = self.emotion_loss_fn(emotion_weights, emotion_label)
		losses.append(emotiom_loss)

		if self.use_mmi:
			ctc_loss = (self.ctc_loss_fn(log_probs, ctc_text, output_lengths, ctc_text_lengths) /
						output_lengths.float()).mean()
			losses.append(ctc_loss)

		total_loss = torch.stack(losses) * torch.exp(-self.eta) + self.eta
		return total_loss.sum(), losses, self.eta

Then i pu it in optimizer like this:

optimizer = torch.optim.AdamW(list(
		model.parameters()) + list(criterion.parameters()), lr=hparams.learning_rate)

So, what is right way to use it in DDP setup?
Should i put criterion in main model’s forwars function as submodule or use DDP wrapped on criterion, or something else?

IMHO, adding trainable parameters to the loss function makes it part of the network to be trained. We need to think out of the box a bit here. So what I reckon that you could do is to wrap the criterion into your network. This is going to require a bit of change to how people usually write foward() method. Here is an example.

def forward(self, x, y=None):
    # Regular forward pass
    output = self.model(x)
    # Insert your criterion here
    if self.training:
        assert y is not None, "Target should be passed during training."
        loss = self.criterion(output, y)
        return loss
    return output

Basically, the code snippet above includes the computation of loss as part of model’s forward pass. And because your criterion is part of your network, you don’t have to explicitly add its parameters to the optimiser anymore. And torch.nn.parallel.DistributedDataParallel will make sure even the parameters of the criterion are synced across GPUs during forward pass.

Hope this helps.

1 Like

Is here any difference between adding criterion to model or having it separate wrapped to ddp?