Variational dropout?

hughperkins · August 13, 2018, 12:37pm

Hi, what is the standard-ish way to do variational dropout in PyTorch?

(Edit: I just need something that works, and can plug in; don’t need to understand how it works, just how to use it )

(edit2: though one or two sentences of intuition behind how ti works / what it is doing would be very welcome )

justusschock · August 13, 2018, 12:48pm

These implementations seem pretty similar and straight forward:

Regarding your second edit: I haven’t even tried understanding it, since the paper is still a part of my (always growing) reading list.

hughperkins · August 13, 2018, 12:50pm

Thank you @justusschock It looks like these are both dropout. The first looks like ZoneOut? And the second looks like standard dropout? Am I mis-reading? (I’m looking specifically for ‘variational’ dropout)

justusschock · August 13, 2018, 12:55pm

In the first repo there is a jupyter notebook containing several variations of dropout including this:

class VariationalDropout(nn.Module):
    def __init__(self, alpha=1.0, dim=None):
        super(VariationalDropout, self).__init__()
        
        self.dim = dim
        self.max_alpha = alpha
        # Initial alpha
        log_alpha = (torch.ones(dim) * alpha).log()
        self.log_alpha = nn.Parameter(log_alpha)
        
    def kl(self):
        c1 = 1.16145124
        c2 = -1.50204118
        c3 = 0.58629921
        
        alpha = self.log_alpha.exp()
        
        negative_kl = 0.5 * self.log_alpha + c1 * alpha + c2 * alpha**2 + c3 * alpha**3
        
        kl = -negative_kl
        
        return kl.mean()
    
    def forward(self, x):
        """
        Sample noise   e ~ N(1, alpha)
        Multiply noise h = h_ * e
        """
        if self.train():
            # N(0,1)
            epsilon = Variable(torch.randn(x.size()))
            if x.is_cuda:
                epsilon = epsilon.cuda()

            # Clip alpha
            self.log_alpha.data = torch.clamp(self.log_alpha.data, max=self.max_alpha)
            alpha = self.log_alpha.exp()

            # N(1, alpha)
            epsilon = epsilon * alpha

            return x * epsilon
        else:
            return x

And the second repo contains this implementation:


class VariationalDropout(nn.Module):
    def __init__(self, input_size, out_size, log_sigma2=-10, threshold=3):
        """
        :param input_size: An int of input size
        :param log_sigma2: Initial value of log sigma ^ 2.
               It is crusial for training since it determines initial value of alpha
        :param threshold: Value for thresholding of validation. If log_alpha > threshold, then weight is zeroed
        :param out_size: An int of output size
        """
        super(VariationalDropout, self).__init__()
 
        self.input_size = input_size
        self.out_size = out_size
 
        self.theta = Parameter(t.FloatTensor(input_size, out_size))
        self.bias = Parameter(t.Tensor(out_size))
 
        self.log_sigma2 = Parameter(t.FloatTensor(input_size, out_size).fill_(log_sigma2))
 
        self.reset_parameters()
 
        self.k = [0.63576, 1.87320, 1.48695]
 
        self.threshold = threshold
 
    def reset_parameters(self):
        stdv = 1. / math.sqrt(self.out_size)
 
        self.theta.data.uniform_(-stdv, stdv)
        self.bias.data.uniform_(-stdv, stdv)
 
    @staticmethod
    def clip(input, to=8):
        input = input.masked_fill(input < -to, -to)
        input = input.masked_fill(input > to, to)
 
        return input
 
    def kld(self, log_alpha):
 
        first_term = self.k[0] * F.sigmoid(self.k[1] + self.k[2] * log_alpha)
        second_term = 0.5 * t.log(1 + t.exp(-log_alpha))
 
        return -(first_term - second_term - self.k[0]).sum() / (self.input_size * self.out_size)
 
    def forward(self, input):
        """
        :param input: An float tensor with shape of [batch_size, input_size]
        :return: An float tensor with shape of [batch_size, out_size] and negative layer-kld estimation
        """
 
        log_alpha = self.clip(self.log_sigma2 - t.log(self.theta ** 2))
        kld = self.kld(log_alpha)
 
        if not self.training:
            mask = log_alpha > self.threshold
            return t.addmm(self.bias, input, self.theta.masked_fill(mask, 0))
 
        mu = t.mm(input, self.theta)
        std = t.sqrt(t.mm(input ** 2, self.log_sigma2.exp()) + 1e-6)
 
        eps = Variable(t.randn(*mu.size()))
        if input.is_cuda:
            eps = eps.cuda()
 
        return std * eps + mu + self.bias, kld
 
    def max_alpha(self):
        log_alpha = self.log_sigma2 - self.theta ** 2
        return t.max(log_alpha.exp())

From scrolling through the paper and skimming the equations, it looks like they are both fine although I might also misread something.

hughperkins · August 13, 2018, 12:58pm

Ah, awesome, thanks! Question: how to use them? Like:

where do we put them in an rnn?
are the masks shared between eg different timesteps, or something like that?
will this work with eg an nn.LSTM? or do we need to use an nn.LSTMCell, and plug those together, with this dropout in between?
- if so, is there an LSTM implementation that handles this for us?

justusschock · August 13, 2018, 1:05pm

To be honest:

I don’t know if this works with an LSTM, but my first guess would be No, since both of them seem to be designed to be used with sequential networks:

import torch.nn as nn
import torch.nn.functional as F
 
from variational_dropout.variational_dropout import VariationalDropout
 
 
class VariationalDropoutModel(nn.Module):
    def __init__(self):
        super(VariationalDropoutModel, self).__init__()
 
        self.fc = nn.ModuleList([
            VariationalDropout(784, 500),
            VariationalDropout(500, 50),
            nn.Linear(50, 10)
        ])
 
    def forward(self, input, train=False):
        """
        :param input: An float tensor with shape of [batch_size, 784]
        :param train: An boolean value indicating whether forward propagation called when training is performed
        :return: An float tensor with shape of [batch_size, 10]
                 filled with logits of likelihood and kld estimation
        """
 
        result = input
 
        if train:
            kld = 0
 
            for i, layer in enumerate(self.fc):
                if i != len(self.fc) - 1:
                    result, kld = layer(result, train)
                    result = F.elu(result)
                    kld += kld
 
            return self.fc[-1](result), kld
 
        for i, layer in enumerate(self.fc):
            if i != len(self.fc) - 1:
                result = F.elu(layer(result, train))
 
        return self.fc[-1](result)
 
    def loss(self, **kwargs):
        if kwargs['train']:
            out, kld = self(kwargs['input'], kwargs['train'])
            return F.cross_entropy(out, kwargs['target'], size_average=kwargs['average']), kld
 
        out = self(kwargs['input'], kwargs['train'])
        return F.cross_entropy(out, kwargs['target'], size_average=kwargs['average'])

So I guess, you would have to plug those together yourself.

hughperkins · August 13, 2018, 1:06pm

Ya, that was sort of my conclusion too Unless there is some paper implementation somewhere, that is in PyTorch, and uses variational dropout?

Shisho_Sama · September 26, 2019, 4:36pm

Hello guys
Any new updates on this a year later?

Iridium_Blue · November 25, 2019, 7:42am

Hey all, new here, perhaps I can help.

Thanks to kreitkurita of Carnegie Mellon University, who sub-classed LSTM to use Variational Dropout : Better LSTM with Variational Dropout

Just use this as a drop-in replacement for LSTM. The implementation is an almost faithful implementation of the original paper https://arxiv.org/abs/1512.05287 (see code comments for minor deviations.)

Some tips :

0.25 is a good initial choice for dropouti, dropoutw and dropouto (input, weight, and output respectively.)
It is probably best to avoid using other dropout techniques alongside this one (embedding-, batch-, layer-, etc.) At least at first. And possibly always. I need to look into this more.
In their original paper, Gal and Ghahramani note that weight_decay takes on new importance. They suggest 0.001 as a default. (This is set on the Optimizer. If you’re using Adam, I suggest looking into AdamW instead.)