Reparameterizing a Parameter

ajbrock · February 12, 2017, 10:31pm

Hi guys,

I’ve been digging through docs and this forum but I can’t quite figure this seemingly basic thing out–if I want to reparameterize a filter as an indexed function of another filter (let’s just say in the most basic case that I want to make a sparse 5x5 filter out of a 3x3 filter), how do I go about doing that so that this properly ends up in the graph?

In Theano, I would just use set_subtensor, but I’m not entirely clear on how things get put into the graph with e.g. index_fill_ or index_copy_. The end goal is to end up parameterizing filters like this (MDC Blocks), . I’m also struggling to figure out exactly how to replicate standard numpy advanced indexing/slicing. Ideally I would just go:

W_base = Parameter(torch.randn(num_out,num_in,3,3))
W_effective = Variable(torch.zeros(num_out,num_in,5,5))

W_effective[:,:,::2,::2] = W_base

And have a 5x5 filter which is effectively a dilated 3x3 filter, and where only the elements that I indexed in that call are parameters that can be updated. Note that I’m not just trying to dilate a filter (the convNd modules already look to have that on lock) but ideally I’d like to be able to arbitrarily parameterize a tensor by indexing.

Any help is appreciated!

Best,

Andy

apaszke · February 12, 2017, 11:41pm

Hmm, this part might be hard as we don’t support strided indexing yet (it should be easy to add though). Apart from this, if you want to have a Parameter, that’s computed from another one, you should do that in the forward function of the model, and use our functional interface. Here’s a simple example with Linear:

class MyModule(nn.Module):
    def __init__(self):
        self.large_linear = nn.Linear(100, 100)

    def forward(self):
        x = F.sigmoid(self.large_linear(x))
        W_small = self.large_linear.weight[:10]
        b_small = self.large_linear.bias[:10]
        return F.linear(x, W_small, b_small)

ajbrock · February 13, 2017, 12:32am

Thanks!

Hmm, so including the paramterization in forward() makes sense so that the new param is a part of the graph…might it instead be easier to initialize a Parameter, set the static values to zero, then specifically index in the backprop to only update the desired indices?

Best,

Andy

apaszke · February 13, 2017, 12:59am

No, that wouldn’t work, because you’d have to backprop through the parametrization at each iteration, but it doesn’t make sense to specify retain_variables and loose all the memory savings in other parts of the graph. I’d recommend recreating the parametrized Variable from scratch at every iteration.

ajbrock · February 15, 2017, 3:51pm

Alright, so I’ve dug further into this and found some interesting things. TL; DR: I solved it for my use-case, but I might have stumbled onto some bugs. Don’t use in-place operations, bad things happen.

I managed to achieve what I was going for by creating the main weight parameter, w, and two intermediate variables,

W, a zeros-filled of the desired final shape and
M, A mask of the same shape as W filled with 1s that correspond which elements of W I want to fill with w. For consistency the number of positive elements in M has to equal the number of elements of w, of course.

I initialize both W and M in the init method of the module, then call W[M] = w during the forward() method, and convolve using the modified W. This works, but for some reason when I do things this way it results in training starting out fast, then progressively slowing down (starting out at around 5 batches/s and dropping to 0.2 batches/s over the course of the first 1000 batches). It also throws an error about non-leaf variables not yet being serializable when I train to use torch.save, presumably because I’m creating some naughty nodes that I shouldn’t be.

My initial suspicion was that I was creating additional subgraphs that weren’t being deleted, or not freeing memory appropriately (more on that in a moment), but the memory usage in this case was constant. Investigating further, I found that if I replaced W[M] = w with W.masked_copy_(M,w), I get an error after the first batch saying that I need to use"retain_variables=True" if I want to backpropagate a second time through the graph. The error method is a bit confusing here, as I am only calling backward() once.

My intuition is that the above error occurs because I’m calling an in-place method in forward(), which seems to be against best pytorch practice at the moment, so whatever variables autograd would need to do the backprop aren’t getting saved. Calling backward(retain_variables=True) results in the same behavior as using W[M]=w; it works, but it progressively slows down throughout training. I’m still not sure on what’s causing the slowing–my best guess is that some part of the graph isn’t getting appropriately freed in such a way that rather than creating multiple subgraphs that take up memory, it’s just propping through the same graph element an increasing number of times on each successive iteration.

I ran into another interesting issue while messing with this–while running Brendan Amos’s Densenet model inside my own training code, if I swapped out standard filters for dilated filters using the dilate keyword in conv2d, I would see a memory explosion that would overflow my GPU within ~50 batches. It turned out this was because I was using saved_loss+=loss rather than +=loss.data, so it was creating multiple copies of subgraphs and not freeing them appropriately. The user error isn’t interesting, but the fact that when using undilated filters I do not observe this memory explosion, despite the bad +=loss line, is interesting.

Anyhow, I did manage to get things working for my use-case–rather than trying to do a masked copy or in-place operation, I just instantiate W as a full-rank tensor and drop W*M into the F.conv2d call. Works great and is about twice as fast as using the dilation parameter (presumably because it allows for the use of the cuDNN backend).

Here’s a code snippet with which I’m currently getting ~80-100% speedup over using the dilation keyword for my use-case. Note that this currently prevents you from saving with torch.save due to a “can’t serialize non-leaf variables yet” dealio.

in init:

self.m =  Variable(torch.cuda.FloatTensor( [[(( [( ([1]+[0]*(dilation-1)) *3)[:-(dilation-1)]] + [[0]*(3+(dilation-1)*2)]*(dilation-1))*3) [:-(dilation-1)]] *n_in]*n_out))
self.W = Parameter(torch.zeros(n_out,n_in,3+2*(dilation-1),3+2*(dilation-1)),requires_grad=True).cuda() # requires_grad not necessarily necessary

in forward:

out = F.conv2d(input,weight=self.W*self.m,padding=dilation,bias=None)

Sorry for the wall of text, but hopefully this will prove enlightening and thorough if other people come along with similar issues on in-place ops. For the record, I’m using the build provided by conda install and am on python 2.7 (my attempts at building from source crash, sigh). I tested these with Cuda7.5 on a GTX980 and Cuda8.0 on a Titan X.

Best,

Andy

apaszke · February 15, 2017, 4:01pm

What’s the full error you’re seeing? It doesn’t sound as if it has to do anything with in-place ops. The slowdown comes from the fact, that you’re modifying the W and reusing it in every subsequent iteration, but it keeps it’s history, so it keeps growing. In-place operations are part of history too! If you don’t want to backprop through old graph multiple times, I’d recommend you to repackage W in a fresh Variable after every iteration (see e.g. repackage_hidden from the language modeling example). There should be a better way to solve this, but I’d need to know more about your particular use case.

This also causes the serialization failure - you can only save leaf Variables, but your W keeps hold of the history, so it’s not a leaf!

That’s probably because PyTorch has to pick a more memory-intensive backend for dilated convolutions. If you can send me a diff that I could test, I could confirm that it’s because of this.

ajbrock · February 15, 2017, 4:46pm

I nominate Adam for king of speed replies, wow!

So I tried two variations on repackage_hidden in the forward() method, both of which solve this issue:

Keep W and M as regular old tensors (not variables) then have:

W[M]] = w
out = F.conv2d(x,Variable(W))

Repackage W a la the example you linked:

W = V(self.W.data)
W[M] = w

#1 is ~15% faster than #2, and the original thing I was doing with W*M is ~7.5% faster than #2. This also makes it possible to save with torch.save.

The exact error I was getting when not properly Variable-ing around was:
RuntimeError: Trying to backward through the graph second time, but the buffers have already been freed. Please specify retain_variables=True when calling backward for the first time.

I was only calling backward() a single time, hence my initial confusion; I mostly found this interesting because it only occurred when switching to masked_copy_ from a direct indexing call.

I’ll put together a benchmark script to compare and validate different dilation implementations–I did some experimentation on that front for theano/lasagne a few months back, so I’ll be really curious to see how things roll with PyTorch.

Thanks,

Andy

apaszke · February 15, 2017, 7:00pm

Great! I think that if you can keep them as Tensors (i.e. don’t strictly need autograd for that part), then it’s best to do this. The error was caused because the history was never freed, I don’t think it has to do anything with masked_copy_. If you look into how assignment is implemented, it actually also uses an in-place function (but it’s hidden behind that syntactic sugar).

Cool! Once you put up your PyTorch implementation I can take a look to make sure it’s running at full speed!