…what does it do, and why am I seeing it?
(This is PyTorch 0.4)
I have two networks. Both are multilayer bidirectional LSTMs, where each layer can have a different size, and which are trying to learn the initial hidden and cell configurations.
One of these (call it NET-A) is a prototype, but it produces adequate results and seems to work. It is actually four related models, one for each of one layer, two layers, three layers, four layers. The inputs are the layer sizes, but the construction of the network is brute force, and the forward function proceeds like this:
self.h1, self.c1 = self.init_hidden(batch_size, self.h1_kernel, self.c1_kernel) (etc) pack1 = nn.utils.rnn.pack_padded_sequence(inputs, lengths, batch_first=True) out1, _ = self.lstm1(pack1, (self.h1, self.c1)) pad1 = nn.utils.rnn.pad_packed_sequence(out1, batch_first=True)[0] pad1 = self.drop(pad1) (etc)
The second (call it NET-B) tries to be more clever, taking a list of layer sizes, using ModuleLists and ParameterLists to keep track of layers and learnable kernels, etc. It does not work as well, but proceeds like this:
for layer in range(len(self.sizes)): self.h.append(self.h_kernel[layer].repeat(1, batch_size, 1)) self.c.append(self.c_kernel[layer].repeat(1, batch_size, 1)) pads = [] for layer in range(0, len(self.sizes)-1): pack = nn.utils.rnn.pack_padded_sequence(inputs, lengths, batch_first=True) out, _ = self.lstm[layer](pack, (self.h[layer], self.c[layer])) pad = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)[0] inputs = self.drop(pad) # next time through loop, this is used
In trying to track down the reduced performance, I’m setting up with seeded random variables and looking for differences. What I have noticed is something curious, which may or may not be the problem, but which I do not understand.
In the functioning NET-A, the pad1 variable has a grad_fn of TransposeBackward0 prior to the drop, and DropoutBackward0 after the drop.
In the poorly performing NET-B, the pad variable has a grad_fn of TransposeBackward0 prior to the drop, and AsStridedBackward after the drop.
So the question, re-iterated from the top is: What is this grad_fn, why is it different in both versions, and what does it do?