Differentiating w.r.t. empirical probability distributions

sssohrab · February 16, 2018, 5:04pm

I’m wondering how I could pass functions of the empirical probability distributions of variables into the loss function.

For example imagine a hypothetical scenario where I have a linear layer followed by a ReLU. On its output, I define a binary random variable which equals zero if the output is zero and equals one when the output is non-zero. I want to pass, say the entropy of this binary random variable into my loss function.

Any thoughts on this? Does this have to do with the distributions package?

egilm-ragulpr · February 18, 2018, 9:02am

Interesting stuff. I have no solution but here are some things to play around with that I think at least gives some component to a solution; how to calculate ecdf and how to autograd through the unautogradable

model = torch.nn.Sequential(torch.nn.Linear(1,10),
                            torch.nn.Tanh(),
                            torch.nn.Linear(10,10),
                            torch.nn.Tanh(),
                            torch.nn.Linear(10,1))

losses = []
optimizer = torch.optim.RMSprop(model.parameters(), lr=1e-4)
for i in range(10000):
    def closure():
        x = Variable(torch.randn([500,1]), requires_grad = False)
        y = model(x)

        x_sorted, x_argsort = x.sort(0)
        y_sorted, y_argsort = y.sort(0)

        ecdf_actual = x_argsort.float()/x_argsort.max().float()
        ecdf_pred = y_argsort.float()/y_argsort.max().float()

        ecdf_pred = Variable(ecdf_pred.data-y.data)+y

        loss = ((ecdf_actual-ecdf_pred)**2).mean()
        optimizer.zero_grad()
        loss.backward()
        losses.append(loss.data.numpy())
        return loss
    optimizer.step(closure)

plt.semilogy(losses)
losses = np.array(losses).flatten().cumsum()
plt.semilogy(losses/np.ones_like(losses).cumsum())

It actually seems to learn to map x s.t the mappings ecdf corresponds to that of the ecdf of x (granted, which a single linear layer would) but I imagine your target value is different.

This is probably not a case for Distributions unless you think you’ll estimate the ecdf with some parametric distribution

sssohrab · February 19, 2018, 9:05am

This idea of cdf optimization with L2 constraint sounds interesting. But what if we want to impose less than a whole cdf? Say, we want to control the probability of occurrence of zeros vs. non-zeros only, no matter what those non-zeros are. Then I assume we can consider a parametric binary random variable on top. Any idea how Distributions can be used here?

egilm-ragulpr · February 19, 2018, 12:00pm

Maximizing the entropy would be forcing the random binary variable towards 0.5, i.e centering the ReLU input distributions median around 0.

Some very random ideas:

You essentially want to penalize whenever the Median of x deviates from 0. Maybe an easier ways to do this that propagates gradients of the correct direction directly towards whatever goes into the ReLU. A Sigmoid on the inputs with very high temperature H i.e if z = ReLU(x), p = Sigmoid(H*x) would tend steeply towards 0 or 1. Adding binary entropy on p to loss could be done manually and I imagine H controls the steepness of penalty of deviations from 0.5.

If you want to use Distributions to encapsulate this initialize a dist=Bernoulli(logits=H*x) in each forwardprop and add dist.entropy() to the loss function.

A second idea is to use the Straight-through Gumbel trick (similar to what I did with the ECDF but added stochastic noise). Someone else should answer exactly how this is done properly. I imagine expected value of solutions are closely related to solution 1)
A third idea could be to add Gumbel- or Normal noise to avoid saturated ReLUs I’m sure there’s lots of research on this.