How to properly implement an autograd.Function in Pytorch?

Shisho_Sama · September 4, 2019, 4:47am

Hello everyone, I hope you are having a great time.
I recently wanted to create a simple autoencoder and for that used this thread where @smth provided an example on how to create an autograd Function for the aformentioned autoencoder.
and the code he wrote is this :

import torch
from torch.autograd import Function

class L1Penalty(Function):

    @staticmethod
    def forward(ctx, input, l1weight):
        ctx.save_for_backward(input)
        ctx.l1weight = l1weight
        return input

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_variables
        grad_input = input.clone().sign().mul(self.l1weight)
        grad_input += grad_output
        return grad_input

However, this code fails completely on newer versions of Pytorch (e.g 1.1.0) with the error indicating the backward method needs to return as many values as the forward method received.

I asked but got no answer and I myself also couldnt specify I dont need a gradient for the second argument. I tried to set ctx.needs_input_grad but thats read-only .
What should I do here? should I simply return None, or 0 for the arguments that I’m not interested in?

by the way what does this part exactly doing?

 grad_input = input.clone().sign().mul(self.l1weight)

Can anyone please also clarify this? Thanks a lot

Shisho_Sama · September 4, 2019, 6:57am

Based on the discussions here I found out that I should be using None for any inputs that I dont want the gradients for. so this is done.
I however would appreciate if anyone could tell me what that last snippet do though.
its greatly appreciated.

smth · September 4, 2019, 7:34pm

The last snippet is applying the L1 penalty (penalizing the direction of gradient a constant factor of input)

Shisho_Sama · September 4, 2019, 7:36pm

Thanks a lot I really appreciate it

Shisho_Sama · September 5, 2019, 6:43am

May I ask another question if you dont mind?
I noticed we have different ways for imposing/enforcing the sparsity constraint. one other way we can achieve sparsity is likely to get the average of the layers output and treat that as the loss for sparsity and add it e.g. to the reconstruction loss and do the backward pass.
That is simply do in the forward pass :

    def forward(self, input):
        input = input.view(input.size(0), -1)
        sparsity_loss = 0.0
        output = self.encoder(input)
        sparsity_loss += torch.mean(abs(output))
        if self.use_l1_penalty:
            output = L1Penalty.apply(output, self.l1_weight)
        output = self.decoder(output)
        sparsity_loss += torch.mean(abs(output))
        output = output.view(input.size(0), 1, 28, 28)
        return output, sparsity_loss

and in the trianing loop :

for e in range(epochs):
    for imgs,_ in dataloader_train:
        imgs = imgs.to(device)
        output, sparsity_loss = sae_model(imgs)
        loss = criterion(output, imgs)
        
        loss_f = loss + (sparsity_ratio * sparsity_loss)
        optimizer.zero_grad()
        loss_f.backward()
        optimizer.step()
    print(f'epoch: {e}/{epochs} loss_f: {loss_f.item():.6f} loss: { loss.item():.6f}'\
          f' sparsity loss: {sparsity_loss.item():.6f} lr = {scheduler.get_lr()}')
    scheduler.step()

However I noticed the outcome of these two methods are not the same exactly . why is that?
This is the weights outputs using the first method(L1Penalty) :

sample reconstructions at each epoch
encoder_weights_trained:
weight_diff (initial weight - trained_weights)
decoder_weights_trained:

And using the sparsity loss from activations:

encoder_weights_trained:
weight_diff (initial weight - trained_weights)
decoder_weights_trained:

21513×1477 1.07 MB

2_21513×1477 984 KB

2_31513×1477 1.09 MB

2_4772×288 42.1 KB

They use the same exact hyperparameters for optimization which is SGD with lr=0.98 for 20 epochs.
shouldn’t they have the same effect on weights? or is it because of the mean() of the activations, that the gradients will be spread more homogeneously and thus their effect is much milder than when we directly add a term to the gradients?

Thank you very much in advance