Gradient not passing through sparsity regularization

I would like to add a sparsity regularization to the encodings of my VAE. This means I set an activation vector to 1 for all indices where the encoding is not 0 and then take the mean over the batch dimension to get a distribution. Then I can take a pointwise kullback leibler divergence to the desired sparsity probability.

However, the output of the function does not possess a grad_fn attribute. So I guess it does not propagate the gradient back. Here is a minimal working example with the function I use to regularize.

import torch
import torch.nn.functional as F

def sparsity_regularizer(enc, sparsity=.05):
    activations = torch.zeros(size=enc.size())
    # mean activation
    activations[torch.nonzero(enc, as_tuple=True)] = 1
    # take mean along batch dimension
    mean = torch.mean(activations, dim=0)
    reg = -F.kl_div(mean.log(), sparsity*torch.ones(size=mean.size()), reduction="sum")
    return reg

enc = torch.Tensor([0, 2])
enc.requires_grad = True

out = sparsity_regularizer(enc)


What do I need to change?

The zero “norm” function: counting up the number of zeros (or nonzeros), is not differentiable. It has a very discontinuous behavior at the point where the entry is zero, so even if you get the PyTorch working, it likely won’t behave as you expect. Maybe try looking at the l1-norm, which is related to sparsity, but is differentiable (almost everywhere).