I have this function here for a training epoch:
def _run_epoch(self, epoch):
self.train_data.sampler.set_epoch(epoch)
self.model.train()
assert self.model.training == True, "Model is in Eval mode while training"
for idx, (_, img, fidt_map, point_map) in enumerate(self.train_data):
img = img.to(self.gpu_id)
fidt_map = fidt_map.type(torch.FloatTensor).unsqueeze(1).to(self.gpu_id)
point_map = point_map.type(torch.FloatTensor).unsqueeze(1).to(self.gpu_id)
output = self.model(img)
loss = self.binary_fl(output, point_map)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
And this training loop:
def train(self, max_epochs: int):
for epoch in range(self.start_epoch, max_epochs):
self._run_epoch(epoch=epoch)
self.scheduler.step()
self.writer.add_scalar(
"Learning Rate", self.optimizer.param_groups[0]["lr"], epoch
)
This is the scheduler:
scheduler = CosineAnnealingLR(
optimizer, T_max=args["epochs"], eta_min=args["eta_min"], verbose=False
)
The reason I set T_max equal to epochs is because I want this curve:
The above LR curve is from using DP training. When I switch to DDP this happens:
Which is really weird.
Check these values here:
The learning rate at epoch 660 goes crazy…
Does anyone know why this is happening?