Hi I’m trying to learn initial hidden states(h0 and c0) while training an LSTM model.

My code is like follow:

```
# inside model class
def init_hidden(self,device,bsz=1):
return tuple(nn.Parameter(torch.zeros((self.args.numLayer*(int(self.args.bidir)+1),bsz, 100),device=device)) for _ in range(2))
def repackage_hidden(h):
"""Wraps hidden states in new Tensors, to detach them from their history."""
if isinstance(h, torch.Tensor):
return h.detach()
else:
return tuple(repackage_hidden(v) for v in h)
hidden = model_.init_hidden(device, bsz)
for batch in iterator:
f_, target = batch
out = model_(f_,hidden)
loss = criterion(out,target)
loss.backward()
for p in model_.parameters(): # learning model parameters
p.data.add_(-lr, p.grad)
for h in hidden: # learning initial hidden state
h.data.add_(-lr, h.grad)
model_.zero_grad()
hidden = repackage_hidden(hidden)
```

In the code, `hidden`

consists of h0 and c0. In pytorch tutorial, word-level language model, I saw repackage_hidden function is for detaching hidden states from computational graph so that their gradients w.r.t loss are not computed repeatedly. But It seems that whenever I call the function and put repackaged hidden states into the model, hidden states do not attach to computational graph again, because traceback says that h.grad is Nonetype.

When I remove repackaging part, It works well.

So my question is how to learn initial hidden state using its gradient w.r.t loss.

My model is not a language model but it deals with many-to-many interaction. However, each sequence is independent so no need to capture long term dependency beyond a sequence.