How to handle scheduler during DDP training?

mariosconsta · October 30, 2023, 8:18am

I have this function here for a training epoch:

def _run_epoch(self, epoch):
        self.train_data.sampler.set_epoch(epoch)
        self.model.train()

        assert self.model.training == True, "Model is in Eval mode while training"

        for idx, (_, img, fidt_map, point_map) in enumerate(self.train_data):
            img = img.to(self.gpu_id)
            fidt_map = fidt_map.type(torch.FloatTensor).unsqueeze(1).to(self.gpu_id)
            point_map = point_map.type(torch.FloatTensor).unsqueeze(1).to(self.gpu_id)

            output = self.model(img)

            loss = self.binary_fl(output, point_map)

            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

And this training loop:

    def train(self, max_epochs: int):
        for epoch in range(self.start_epoch, max_epochs):
            self._run_epoch(epoch=epoch)
            self.scheduler.step()
            self.writer.add_scalar(
                "Learning Rate", self.optimizer.param_groups[0]["lr"], epoch
            )

This is the scheduler:

    scheduler = CosineAnnealingLR(
        optimizer, T_max=args["epochs"], eta_min=args["eta_min"], verbose=False
    )

The reason I set T_max equal to epochs is because I want this curve:

The above LR curve is from using DP training. When I switch to DDP this happens:

Which is really weird.

Check these values here:

The learning rate at epoch 660 goes crazy…

Does anyone know why this is happening?