# Differentiating w.r.t. empirical probability distributions

I’m wondering how I could pass functions of the empirical probability distributions of variables into the loss function.

For example imagine a hypothetical scenario where I have a linear layer followed by a ReLU. On its output, I define a binary random variable which equals zero if the output is zero and equals one when the output is non-zero. I want to pass, say the entropy of this binary random variable into my loss function.

Any thoughts on this? Does this have to do with the ` distributions ` package?

1 Like

Interesting stuff. I have no solution but here are some things to play around with that I think at least gives some component to a solution; how to calculate ecdf and how to autograd through the unautogradable

``````model = torch.nn.Sequential(torch.nn.Linear(1,10),
torch.nn.Tanh(),
torch.nn.Linear(10,10),
torch.nn.Tanh(),
torch.nn.Linear(10,1))

losses = []
optimizer = torch.optim.RMSprop(model.parameters(), lr=1e-4)
for i in range(10000):
def closure():
x = Variable(torch.randn([500,1]), requires_grad = False)
y = model(x)

x_sorted, x_argsort = x.sort(0)
y_sorted, y_argsort = y.sort(0)

ecdf_actual = x_argsort.float()/x_argsort.max().float()
ecdf_pred = y_argsort.float()/y_argsort.max().float()

ecdf_pred = Variable(ecdf_pred.data-y.data)+y

loss = ((ecdf_actual-ecdf_pred)**2).mean()
loss.backward()
losses.append(loss.data.numpy())
return loss
optimizer.step(closure)

plt.semilogy(losses)
losses = np.array(losses).flatten().cumsum()
plt.semilogy(losses/np.ones_like(losses).cumsum())
``````

It actually seems to learn to map x s.t the mappings ecdf corresponds to that of the ecdf of x (granted, which a single linear layer would) but I imagine your target value is different.

This is probably not a case for `Distributions` unless you think you’ll estimate the ecdf with some parametric distribution

This idea of cdf optimization with L2 constraint sounds interesting. But what if we want to impose less than a whole cdf? Say, we want to control the probability of occurrence of zeros vs. non-zeros only, no matter what those non-zeros are. Then I assume we can consider a parametric binary random variable on top. Any idea how ` Distributions` can be used here?

Maximizing the entropy would be forcing the random binary variable towards 0.5, i.e centering the ReLU input distributions median around 0.

Some very random ideas:

1. You essentially want to penalize whenever the Median of `x` deviates from 0. Maybe an easier ways to do this that propagates gradients of the correct direction directly towards whatever goes into the ReLU. A Sigmoid on the inputs with very high temperature H i.e if `z = ReLU(x)`, `p = Sigmoid(H*x)` would tend steeply towards 0 or 1. Adding binary entropy on p to loss could be done manually and I imagine H controls the steepness of penalty of deviations from 0.5.

If you want to use `Distributions` to encapsulate this initialize a `dist=Bernoulli(logits=H*x)` in each forwardprop and add `dist.entropy()` to the loss function.

1. A second idea is to use the Straight-through Gumbel trick (similar to what I did with the ECDF but added stochastic noise). Someone else should answer exactly how this is done properly. I imagine expected value of solutions are closely related to solution 1)

2. A third idea could be to add `Gumbel`- or `Normal` noise to avoid saturated ReLUs I’m sure there’s lots of research on this.