How to implement "Shake Shake"

tjoseph · December 6, 2017, 8:11pm

For reference I try to implement this paper: https://openreview.net/pdf?id=HkO-PCmYl But for helping me you should not have to read it.

This is my forward pass in my nn.Module:

    def forward(self, x):
        if self.training:
            alpha = Variable(torch.rand(x.size()[0]).cuda())
        else:
            alpha = Variable(torch.FloatTensor([0.5]).cuda())

        p1 = self.path1(x)
        p2 = self.path2(x)

        alpha = torch.unsqueeze(alpha, dim=1)
        alpha = torch.unsqueeze(alpha, dim=2)
        alpha = torch.unsqueeze(alpha, dim=3)

        alpha = alpha.expand(p1.size())

        x = alpha * p1 + (1 - alpha) * p2
        return x

So two questions:

Is there a better way to handle alpha? All the torch.unsqueeze look ugly.
In the backward pass I want to handle the gradient with alpha=0.5 instead of the random alpha from the forward pass. How would I implement this? Maybe you can give a working implementation? I saw http://pytorch.org/docs/master/autograd.html#torch.autograd.Function, but I am not sure how to use it…

SimonW · December 6, 2017, 8:42pm

def forward(self, x):
    if self.training:
        alpha = Variable(torch.rand(x.size(0)).cuda())
    else:
        alpha = Variable(torch.FloatTensor([0.5]).cuda())

    p1 = self.path1(x)
    p2 = self.path2(x)

    alpha = alpha.view(alpha.size(0), 1, 1, 1).expand_as(p1)

    x = alpha * p1 + (1 - alpha) * p2
    return x

For your second question, wouldn’t you need to do another forward pass using 0.5 then? Otherwise the output and intermediate outputs are incorrect to calculate gradients.

tjoseph · December 8, 2017, 8:40am

Thank you for answering!
So do one forward pass, save the loss, reset the gradients and to another forward pass with 0.5 and the saved loss? Is that what you mean?

SimonW · December 8, 2017, 5:53pm

I guess I’m not understanding the idea of the algorithm. It is unclear to me what it tries to do. Does it still use the result from fwd a random alpha?

tom · December 8, 2017, 7:19pm

Hi,

I think you might want something like

x = Variable(torch.randn(5), requires_grad=True)
y = Variable(torch.bernoulli(torch.FloatTensor(5).fill_(0.5)))
z = 0.5*x+(x*(y-0.5)).detach()
z.sum().backward()

The trick is to use the mean in the normal flow (which will be used for backward) and block the backward for the product with the difference between the mean and the normal flow.

Note that implementing your own autograd.Function would likely be somewhat more efficient computationally (saves the product of x*(x-0.5)).

I saw something similar in the discussion of Gumbel-softmax by Hugh Perkins.

Best regards

Thomas

tjoseph · December 10, 2017, 12:35pm

Thank you guys, I will work my way through this and do some testing!

jphoward · December 15, 2017, 2:34pm

@tjoseph did you have any success with this? I’m interested in Shake Shake in pytorch too. The method shown by @tom is really clever so I’m glad to come across this discussion!

tom · December 16, 2017, 9:50pm

Hi Jeremy,

thank you. I admire your course material!

Here is a work-in-progress Pytorch shake shake CIFAR10 notebook. There might be bugs in the net itself and the training around needs serious improvement.

Best regards

Thomas