# Variational dropout?

Hi, what is the standard-ish way to do variational dropout in PyTorch?

(Edit: I just need something that works, and can plug in; don’t need to understand how it works, just how to use it )

(edit2: though one or two sentences of intuition behind how ti works / what it is doing would be very welcome )

1 Like

These implementations seem pretty similar and straight forward:

Regarding your second edit: I haven’t even tried understanding it, since the paper is still a part of my (always growing) reading list.

1 Like

Thank you @justusschock It looks like these are both dropout. The first looks like ZoneOut? And the second looks like standard dropout? Am I mis-reading? (I’m looking specifically for ‘variational’ dropout)

In the first repo there is a jupyter notebook containing several variations of dropout including this:

``````class VariationalDropout(nn.Module):
def __init__(self, alpha=1.0, dim=None):
super(VariationalDropout, self).__init__()

self.dim = dim
self.max_alpha = alpha
# Initial alpha
log_alpha = (torch.ones(dim) * alpha).log()
self.log_alpha = nn.Parameter(log_alpha)

def kl(self):
c1 = 1.16145124
c2 = -1.50204118
c3 = 0.58629921

alpha = self.log_alpha.exp()

negative_kl = 0.5 * self.log_alpha + c1 * alpha + c2 * alpha**2 + c3 * alpha**3

kl = -negative_kl

return kl.mean()

def forward(self, x):
"""
Sample noise   e ~ N(1, alpha)
Multiply noise h = h_ * e
"""
if self.train():
# N(0,1)
epsilon = Variable(torch.randn(x.size()))
if x.is_cuda:
epsilon = epsilon.cuda()

# Clip alpha
self.log_alpha.data = torch.clamp(self.log_alpha.data, max=self.max_alpha)
alpha = self.log_alpha.exp()

# N(1, alpha)
epsilon = epsilon * alpha

return x * epsilon
else:
return x
``````

And the second repo contains this implementation:

``````
class VariationalDropout(nn.Module):
def __init__(self, input_size, out_size, log_sigma2=-10, threshold=3):
"""
:param input_size: An int of input size
:param log_sigma2: Initial value of log sigma ^ 2.
It is crusial for training since it determines initial value of alpha
:param threshold: Value for thresholding of validation. If log_alpha > threshold, then weight is zeroed
:param out_size: An int of output size
"""
super(VariationalDropout, self).__init__()

self.input_size = input_size
self.out_size = out_size

self.theta = Parameter(t.FloatTensor(input_size, out_size))
self.bias = Parameter(t.Tensor(out_size))

self.log_sigma2 = Parameter(t.FloatTensor(input_size, out_size).fill_(log_sigma2))

self.reset_parameters()

self.k = [0.63576, 1.87320, 1.48695]

self.threshold = threshold

def reset_parameters(self):
stdv = 1. / math.sqrt(self.out_size)

self.theta.data.uniform_(-stdv, stdv)
self.bias.data.uniform_(-stdv, stdv)

@staticmethod
def clip(input, to=8):
input = input.masked_fill(input < -to, -to)
input = input.masked_fill(input > to, to)

return input

def kld(self, log_alpha):

first_term = self.k[0] * F.sigmoid(self.k[1] + self.k[2] * log_alpha)
second_term = 0.5 * t.log(1 + t.exp(-log_alpha))

return -(first_term - second_term - self.k[0]).sum() / (self.input_size * self.out_size)

def forward(self, input):
"""
:param input: An float tensor with shape of [batch_size, input_size]
:return: An float tensor with shape of [batch_size, out_size] and negative layer-kld estimation
"""

log_alpha = self.clip(self.log_sigma2 - t.log(self.theta ** 2))
kld = self.kld(log_alpha)

if not self.training:

mu = t.mm(input, self.theta)
std = t.sqrt(t.mm(input ** 2, self.log_sigma2.exp()) + 1e-6)

eps = Variable(t.randn(*mu.size()))
if input.is_cuda:
eps = eps.cuda()

return std * eps + mu + self.bias, kld

def max_alpha(self):
log_alpha = self.log_sigma2 - self.theta ** 2
return t.max(log_alpha.exp())
``````

From scrolling through the paper and skimming the equations, it looks like they are both fine although I might also misread something.

2 Likes

Ah, awesome, thanks! Question: how to use them? Like:

• where do we put them in an rnn?
• are the masks shared between eg different timesteps, or something like that?
• will this work with eg an nn.LSTM? or do we need to use an nn.LSTMCell, and plug those together, with this dropout in between?
• if so, is there an LSTM implementation that handles this for us?
1 Like

To be honest:

I don’t know if this works with an LSTM, but my first guess would be No, since both of them seem to be designed to be used with sequential networks:

``````import torch.nn as nn
import torch.nn.functional as F

from variational_dropout.variational_dropout import VariationalDropout

class VariationalDropoutModel(nn.Module):
def __init__(self):
super(VariationalDropoutModel, self).__init__()

self.fc = nn.ModuleList([
VariationalDropout(784, 500),
VariationalDropout(500, 50),
nn.Linear(50, 10)
])

def forward(self, input, train=False):
"""
:param input: An float tensor with shape of [batch_size, 784]
:param train: An boolean value indicating whether forward propagation called when training is performed
:return: An float tensor with shape of [batch_size, 10]
filled with logits of likelihood and kld estimation
"""

result = input

if train:
kld = 0

for i, layer in enumerate(self.fc):
if i != len(self.fc) - 1:
result, kld = layer(result, train)
result = F.elu(result)
kld += kld

return self.fc[-1](result), kld

for i, layer in enumerate(self.fc):
if i != len(self.fc) - 1:
result = F.elu(layer(result, train))

return self.fc[-1](result)

def loss(self, **kwargs):
if kwargs['train']:
out, kld = self(kwargs['input'], kwargs['train'])
return F.cross_entropy(out, kwargs['target'], size_average=kwargs['average']), kld

out = self(kwargs['input'], kwargs['train'])
return F.cross_entropy(out, kwargs['target'], size_average=kwargs['average'])
``````

So I guess, you would have to plug those together yourself.

Ya, that was sort of my conclusion too Unless there is some paper implementation somewhere, that is in PyTorch, and uses variational dropout?

Hello guys
Any new updates on this a year later?

Hey all, new here, perhaps I can help.

Thanks to kreitkurita of Carnegie Mellon University, who sub-classed LSTM to use Variational Dropout : Better LSTM with Variational Dropout

Just use this as a drop-in replacement for LSTM. The implementation is an almost faithful implementation of the original paper https://arxiv.org/abs/1512.05287 (see code comments for minor deviations.)

Some tips :

• 0.25 is a good initial choice for dropouti, dropoutw and dropouto (input, weight, and output respectively.)

• It is probably best to avoid using other dropout techniques alongside this one (embedding-, batch-, layer-, etc.) At least at first. And possibly always. I need to look into this more.

• In their original paper, Gal and Ghahramani note that weight_decay takes on new importance. They suggest 0.001 as a default. (This is set on the Optimizer. If you’re using Adam, I suggest looking into AdamW instead.)

3 Likes