What is 'grad_fn=<AsStridedBackward>'

…what does it do, and why am I seeing it?

(This is PyTorch 0.4)

I have two networks. Both are multilayer bidirectional LSTMs, where each layer can have a different size, and which are trying to learn the initial hidden and cell configurations.

One of these (call it NET-A) is a prototype, but it produces adequate results and seems to work. It is actually four related models, one for each of one layer, two layers, three layers, four layers. The inputs are the layer sizes, but the construction of the network is brute force, and the forward function proceeds like this:

    self.h1, self.c1   = self.init_hidden(batch_size, self.h1_kernel, self.c1_kernel)
    pack1              = nn.utils.rnn.pack_padded_sequence(inputs, lengths, batch_first=True)
    out1, _            = self.lstm1(pack1, (self.h1, self.c1))
    pad1               = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)[0]
    pad1               = self.drop(pad1)


The second (call it NET-B) tries to be more clever, taking a list of layer sizes, using ModuleLists and ParameterLists to keep track of layers and learnable kernels, etc. It does not work as well, but proceeds like this:

    for layer in range(len(self.sizes)):
        self.h.append(self.h_kernel[layer].repeat(1, batch_size, 1))
        self.c.append(self.c_kernel[layer].repeat(1, batch_size, 1))

    pads = []
    for layer in range(0, len(self.sizes)-1):
        pack   = nn.utils.rnn.pack_padded_sequence(inputs, lengths, batch_first=True)
        out, _ = self.lstm[layer](pack, (self.h[layer], self.c[layer]))
        pad    = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)[0] 
        inputs = self.drop(pad) # next time through loop, this is used

In trying to track down the reduced performance, I’m setting up with seeded random variables and looking for differences. What I have noticed is something curious, which may or may not be the problem, but which I do not understand.

In the functioning NET-A, the pad1 variable has a grad_fn of TransposeBackward0 prior to the drop, and DropoutBackward0 after the drop.

In the poorly performing NET-B, the pad variable has a grad_fn of TransposeBackward0 prior to the drop, and AsStridedBackward after the drop.

So the question, re-iterated from the top is: What is this grad_fn, why is it different in both versions, and what does it do?

A view’s grad_fn is replaced by an AsStridedBackward after in-place operation on it or the base. You can think of it as a generalized version of the backward of a view op.

Ah, I think I see.

So this is really the result of one version renaming the variable and one not?

Would these be expected to act the same, or could this be the source of the slowly diverging models (under identical initialization and input) that I’m seeing?

nah it doesn’t make your model worse. It’s just PyTorch’s solution for inplace ops on views. if you use out-of-place dropout, it won’t do that.

you are using obviously less parameter in B. isn’t it natural that it performs worse?

No, Net-A repeats the block shown multiple time by brute force, i.e., by addtional similar code blocks. Very cumbersome and very tedious to have to maintain separate models for different numbers of layers. A prototype, really.

Net-B has identical initial parameters (I’ve checked, manually) but loops over the layers. (There’s no obvious reason those for loops couldn’t be combined, either.)

(Final layer gets separate treatment in both, which is why it isn’t in the loop)

hmm i see. sorry about that. I must admit that I don’t really understand the difference between the two formulations. Maybe posting the entire fwd?

Doesn’t seem to be necessary-- whatever issue I was having seems to have gone away. I suspect I had a psychologically invisible typo in the code, somewhere, and fixed it without realizing it. All testing shows numerically identical operation, now.

Nevertheless, thank you for the answer and the offer.

1 Like

I still have the question that [What is ‘grad_fn=< AsStridedBackward >’]