Variational dropout?

Hi, what is the standard-ish way to do variational dropout in PyTorch?

(Edit: I just need something that works, and can plug in; don’t need to understand how it works, just how to use it :slight_smile: )

(edit2: though one or two sentences of intuition behind how ti works / what it is doing would be very welcome :slight_smile: )

1 Like

These implementations seem pretty similar and straight forward:

Number One
Number two

Regarding your second edit: I haven’t even tried understanding it, since the paper is still a part of my (always growing) reading list.

1 Like

Thank you @justusschock It looks like these are both dropout. The first looks like ZoneOut? And the second looks like standard dropout? Am I mis-reading? (I’m looking specifically for ‘variational’ dropout)

In the first repo there is a jupyter notebook containing several variations of dropout including this:

class VariationalDropout(nn.Module):
    def __init__(self, alpha=1.0, dim=None):
        super(VariationalDropout, self).__init__()
        self.dim = dim
        self.max_alpha = alpha
        # Initial alpha
        log_alpha = (torch.ones(dim) * alpha).log()
        self.log_alpha = nn.Parameter(log_alpha)
    def kl(self):
        c1 = 1.16145124
        c2 = -1.50204118
        c3 = 0.58629921
        alpha = self.log_alpha.exp()
        negative_kl = 0.5 * self.log_alpha + c1 * alpha + c2 * alpha**2 + c3 * alpha**3
        kl = -negative_kl
        return kl.mean()
    def forward(self, x):
        Sample noise   e ~ N(1, alpha)
        Multiply noise h = h_ * e
        if self.train():
            # N(0,1)
            epsilon = Variable(torch.randn(x.size()))
            if x.is_cuda:
                epsilon = epsilon.cuda()

            # Clip alpha
   = torch.clamp(, max=self.max_alpha)
            alpha = self.log_alpha.exp()

            # N(1, alpha)
            epsilon = epsilon * alpha

            return x * epsilon
            return x

And the second repo contains this implementation:

class VariationalDropout(nn.Module):
    def __init__(self, input_size, out_size, log_sigma2=-10, threshold=3):
        :param input_size: An int of input size
        :param log_sigma2: Initial value of log sigma ^ 2.
               It is crusial for training since it determines initial value of alpha
        :param threshold: Value for thresholding of validation. If log_alpha > threshold, then weight is zeroed
        :param out_size: An int of output size
        super(VariationalDropout, self).__init__()
        self.input_size = input_size
        self.out_size = out_size
        self.theta = Parameter(t.FloatTensor(input_size, out_size))
        self.bias = Parameter(t.Tensor(out_size))
        self.log_sigma2 = Parameter(t.FloatTensor(input_size, out_size).fill_(log_sigma2))
        self.k = [0.63576, 1.87320, 1.48695]
        self.threshold = threshold
    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.out_size)
, stdv), stdv)
    def clip(input, to=8):
        input = input.masked_fill(input < -to, -to)
        input = input.masked_fill(input > to, to)
        return input
    def kld(self, log_alpha):
        first_term = self.k[0] * F.sigmoid(self.k[1] + self.k[2] * log_alpha)
        second_term = 0.5 * t.log(1 + t.exp(-log_alpha))
        return -(first_term - second_term - self.k[0]).sum() / (self.input_size * self.out_size)
    def forward(self, input):
        :param input: An float tensor with shape of [batch_size, input_size]
        :return: An float tensor with shape of [batch_size, out_size] and negative layer-kld estimation
        log_alpha = self.clip(self.log_sigma2 - t.log(self.theta ** 2))
        kld = self.kld(log_alpha)
        if not
            mask = log_alpha > self.threshold
            return t.addmm(self.bias, input, self.theta.masked_fill(mask, 0))
        mu =, self.theta)
        std = t.sqrt( ** 2, self.log_sigma2.exp()) + 1e-6)
        eps = Variable(t.randn(*mu.size()))
        if input.is_cuda:
            eps = eps.cuda()
        return std * eps + mu + self.bias, kld
    def max_alpha(self):
        log_alpha = self.log_sigma2 - self.theta ** 2
        return t.max(log_alpha.exp())

From scrolling through the paper and skimming the equations, it looks like they are both fine although I might also misread something.


Ah, awesome, thanks! Question: how to use them? Like:

  • where do we put them in an rnn?
  • are the masks shared between eg different timesteps, or something like that?
  • will this work with eg an nn.LSTM? or do we need to use an nn.LSTMCell, and plug those together, with this dropout in between?
    • if so, is there an LSTM implementation that handles this for us?
1 Like

To be honest:

I don’t know if this works with an LSTM, but my first guess would be No, since both of them seem to be designed to be used with sequential networks:

import torch.nn as nn
import torch.nn.functional as F
from variational_dropout.variational_dropout import VariationalDropout
class VariationalDropoutModel(nn.Module):
    def __init__(self):
        super(VariationalDropoutModel, self).__init__()
        self.fc = nn.ModuleList([
            VariationalDropout(784, 500),
            VariationalDropout(500, 50),
            nn.Linear(50, 10)
    def forward(self, input, train=False):
        :param input: An float tensor with shape of [batch_size, 784]
        :param train: An boolean value indicating whether forward propagation called when training is performed
        :return: An float tensor with shape of [batch_size, 10]
                 filled with logits of likelihood and kld estimation
        result = input
        if train:
            kld = 0
            for i, layer in enumerate(self.fc):
                if i != len(self.fc) - 1:
                    result, kld = layer(result, train)
                    result = F.elu(result)
                    kld += kld
            return self.fc[-1](result), kld
        for i, layer in enumerate(self.fc):
            if i != len(self.fc) - 1:
                result = F.elu(layer(result, train))
        return self.fc[-1](result)
    def loss(self, **kwargs):
        if kwargs['train']:
            out, kld = self(kwargs['input'], kwargs['train'])
            return F.cross_entropy(out, kwargs['target'], size_average=kwargs['average']), kld
        out = self(kwargs['input'], kwargs['train'])
        return F.cross_entropy(out, kwargs['target'], size_average=kwargs['average'])

So I guess, you would have to plug those together yourself.

Ya, that was sort of my conclusion too :stuck_out_tongue: Unless there is some paper implementation somewhere, that is in PyTorch, and uses variational dropout?

Hello guys
Any new updates on this a year later?

Hey all, new here, perhaps I can help.

Thanks to kreitkurita of Carnegie Mellon University, who sub-classed LSTM to use Variational Dropout : Better LSTM with Variational Dropout

Just use this as a drop-in replacement for LSTM. The implementation is an almost faithful implementation of the original paper (see code comments for minor deviations.)

Some tips :

  • 0.25 is a good initial choice for dropouti, dropoutw and dropouto (input, weight, and output respectively.)

  • It is probably best to avoid using other dropout techniques alongside this one (embedding-, batch-, layer-, etc.) At least at first. And possibly always. I need to look into this more.

  • In their original paper, Gal and Ghahramani note that weight_decay takes on new importance. They suggest 0.001 as a default. (This is set on the Optimizer. If you’re using Adam, I suggest looking into AdamW instead.)