Problem in learning initial hidden state(h0, c0) in LSTM

Hi I’m trying to learn initial hidden states(h0 and c0) while training an LSTM model.

My code is like follow:


# inside model class
def init_hidden(self,device,bsz=1):
        return tuple(nn.Parameter(torch.zeros((self.args.numLayer*(int(self.args.bidir)+1),bsz, 100),device=device)) for _ in range(2))

def repackage_hidden(h):
    """Wraps hidden states in new Tensors, to detach them from their history."""
    if isinstance(h, torch.Tensor):
        return h.detach()
    else:
        return tuple(repackage_hidden(v) for v in h)

hidden = model_.init_hidden(device, bsz)
for batch in iterator:
    f_, target = batch
    out = model_(f_,hidden)
    loss = criterion(out,target)
    loss.backward()
    for p in model_.parameters(): # learning model parameters
        p.data.add_(-lr, p.grad)
    for h in hidden: # learning initial hidden state
        h.data.add_(-lr, h.grad)
    model_.zero_grad()
    hidden = repackage_hidden(hidden)

In the code, hidden consists of h0 and c0. In pytorch tutorial, word-level language model, I saw repackage_hidden function is for detaching hidden states from computational graph so that their gradients w.r.t loss are not computed repeatedly. But It seems that whenever I call the function and put repackaged hidden states into the model, hidden states do not attach to computational graph again, because traceback says that h.grad is Nonetype.

When I remove repackaging part, It works well.

So my question is how to learn initial hidden state using its gradient w.r.t loss.
My model is not a language model but it deals with many-to-many interaction. However, each sequence is independent so no need to capture long term dependency beyond a sequence.