Using optimzer.weight_decay vs. implemnting L2 from scratch

samin_hamidi · February 16, 2022, 6:39pm

I trained a model (ResNet18) on a data_set (step imbalanced TinyImageNet). Once I used the default weight decay of the SGD optimizer and set the lambda to 0.005. The accuracy was around 19 percent which is bad. I then tried with a lambda=0 and this time accuracy was near 45%. So far everything looks okay. Then I tried implementing L2 myself. The parameters summed as the regularization values are conv layer parameters and not the fc layer and BN parameters. I set lambda=0.005 but I get an accuracy similar to when lambda=0.

this is the relevant parts of my code:





# Reset gradients

        # -------------------------------------------------------------

        self._optimizer.zero_grad()

        # -------------------------------------------------------------

        # Run forward pass to get losses and outputs.

        # -------------------------------------------------------------

        loss_dict, _ , output_dict = self._model_and_loss(example_dict)

        # -------------------------------------------------------------

        # Check total_loss for NaNs

        # -------------------------------------------------------------

        loss = loss_dict[self._args.training_key] # xe loss - cross-entrpy loss for the current batch

        # -------------------------------------------------------------

        # Calculate the regularization term

        # -------------------------------------------------------------

        reg, L2_loss = calculate_regularization(self._model_and_loss._model.state_dict(), self._prior_mean, self._prior_variance)

        # ------------------------------------------------------------------------------

        # log xe and reg term seperttely

        # ------------------------------------------------------------------------------

        writer.add_scalar('Loss-train/xe', loss , self._epoch)

        writer.add_scalar('Loss-train/L2-Reg', L2_loss , self._epoch)

       

        writer.add_scalar('Loss-train/regularization-term', reg , self._epoch)

        # ------------------------------------------------------------------------------

        # add the term to the loss function

        # pdb.set_trace()

        #loss += reg

        loss += L2_loss

        # -------------------------------------------------------------

        # Let's update loss dictioanry - apparently now I am only plotting xe and not the total loss

        # --------------------------------------------------------------

        loss_dict[self._args.training_key] = loss

     

        # -------------------------------------------------------------

        # Back propagation

        # -------------------------------------------------------------

        loss.backward()

        # -------------------------------------------------------------

        # Optimizer step

        # -------------------------------------------------------------

        torch.nn.utils.clip_grad_norm_(self._model_and_loss._model.parameters(), 0.1, norm_type='inf')

        self._optimizer.step()

and this is how I compute the L2_loss:

        # calculate L2
        value = layer.pow(2.0).sum()

        if L2_loss is None:
            L2_loss = value
        else:
            L2_loss = L2_loss + value

The code above is part of a function and at the end I return:
return reg*1/2, 0.005*L2_loss
I use model.state_dict() rather than model.parameters() to access the values per layer.

I think my L2_loss does not take any effect. I appreciate it if you could tell me what could be wrong here