I trained a model (ResNet18) on a data_set (step imbalanced TinyImageNet). Once I used the default weight decay of the SGD optimizer and set the lambda to 0.005. The accuracy was around 19 percent which is bad. I then tried with a lambda=0 and this time accuracy was near 45%. So far everything looks okay. Then I tried implementing L2 myself. The parameters summed as the regularization values are conv layer parameters and not the fc layer and BN parameters. I set lambda=0.005 but I get an accuracy similar to when lambda=0.
this is the relevant parts of my code:
# Reset gradients
# -------------------------------------------------------------
self._optimizer.zero_grad()
# -------------------------------------------------------------
# Run forward pass to get losses and outputs.
# -------------------------------------------------------------
loss_dict, _ , output_dict = self._model_and_loss(example_dict)
# -------------------------------------------------------------
# Check total_loss for NaNs
# -------------------------------------------------------------
loss = loss_dict[self._args.training_key] # xe loss - cross-entrpy loss for the current batch
# -------------------------------------------------------------
# Calculate the regularization term
# -------------------------------------------------------------
reg, L2_loss = calculate_regularization(self._model_and_loss._model.state_dict(), self._prior_mean, self._prior_variance)
# ------------------------------------------------------------------------------
# log xe and reg term seperttely
# ------------------------------------------------------------------------------
writer.add_scalar('Loss-train/xe', loss , self._epoch)
writer.add_scalar('Loss-train/L2-Reg', L2_loss , self._epoch)
writer.add_scalar('Loss-train/regularization-term', reg , self._epoch)
# ------------------------------------------------------------------------------
# add the term to the loss function
# pdb.set_trace()
#loss += reg
loss += L2_loss
# -------------------------------------------------------------
# Let's update loss dictioanry - apparently now I am only plotting xe and not the total loss
# --------------------------------------------------------------
loss_dict[self._args.training_key] = loss
# -------------------------------------------------------------
# Back propagation
# -------------------------------------------------------------
loss.backward()
# -------------------------------------------------------------
# Optimizer step
# -------------------------------------------------------------
torch.nn.utils.clip_grad_norm_(self._model_and_loss._model.parameters(), 0.1, norm_type='inf')
self._optimizer.step()
and this is how I compute the L2_loss:
# calculate L2 value = layer.pow(2.0).sum() if L2_loss is None: L2_loss = value else: L2_loss = L2_loss + value
The code above is part of a function and at the end I return:
return reg*1/2, 0.005*L2_loss
I use model.state_dict() rather than model.parameters() to access the values per layer.
I think my L2_loss does not take any effect. I appreciate it if you could tell me what could be wrong here