I’m wondering how I could pass functions of the empirical probability distributions of variables into the loss function.

For example imagine a hypothetical scenario where I have a linear layer followed by a ReLU. On its output, I define a binary random variable which equals zero if the output is zero and equals one when the output is non-zero. I want to pass, say the entropy of this binary random variable into my loss function.

Any thoughts on this? Does this have to do with the distributions package?

Interesting stuff. I have no solution but here are some things to play around with that I think at least gives some component to a solution; how to calculate ecdf and how to autograd through the unautogradable

model = torch.nn.Sequential(torch.nn.Linear(1,10),
torch.nn.Tanh(),
torch.nn.Linear(10,10),
torch.nn.Tanh(),
torch.nn.Linear(10,1))
losses = []
optimizer = torch.optim.RMSprop(model.parameters(), lr=1e-4)
for i in range(10000):
def closure():
x = Variable(torch.randn([500,1]), requires_grad = False)
y = model(x)
x_sorted, x_argsort = x.sort(0)
y_sorted, y_argsort = y.sort(0)
ecdf_actual = x_argsort.float()/x_argsort.max().float()
ecdf_pred = y_argsort.float()/y_argsort.max().float()
ecdf_pred = Variable(ecdf_pred.data-y.data)+y
loss = ((ecdf_actual-ecdf_pred)**2).mean()
optimizer.zero_grad()
loss.backward()
losses.append(loss.data.numpy())
return loss
optimizer.step(closure)
plt.semilogy(losses)
losses = np.array(losses).flatten().cumsum()
plt.semilogy(losses/np.ones_like(losses).cumsum())

It actually seems to learn to map x s.t the mappings ecdf corresponds to that of the ecdf of x (granted, which a single linear layer would) but I imagine your target value is different.

This is probably not a case for Distributions unless you think you’ll estimate the ecdf with some parametric distribution

This idea of cdf optimization with L2 constraint sounds interesting. But what if we want to impose less than a whole cdf? Say, we want to control the probability of occurrence of zeros vs. non-zeros only, no matter what those non-zeros are. Then I assume we can consider a parametric binary random variable on top. Any idea how Distributions can be used here?

Maximizing the entropy would be forcing the random binary variable towards 0.5, i.e centering the ReLU input distributions median around 0.

Some very random ideas:

You essentially want to penalize whenever the Median of x deviates from 0. Maybe an easier ways to do this that propagates gradients of the correct direction directly towards whatever goes into the ReLU. A Sigmoid on the inputs with very high temperature H i.e if z = ReLU(x), p = Sigmoid(H*x) would tend steeply towards 0 or 1. Adding binary entropy on p to loss could be done manually and I imagine H controls the steepness of penalty of deviations from 0.5.

If you want to use Distributions to encapsulate this initialize a dist=Bernoulli(logits=H*x) in each forwardprop and add dist.entropy() to the loss function.

A second idea is to use the Straight-through Gumbel trick (similar to what I did with the ECDF but added stochastic noise). Someone else should answer exactly how this is done properly. I imagine expected value of solutions are closely related to solution 1)

A third idea could be to add Gumbel- or Normal noise to avoid saturated ReLUs I’m sure there’s lots of research on this.