I have a RNN and after every batch, I have the option of either detaching the hidden states or re-initializing them. It’s not clear to me which one should I choose. If the batches are independent, should you just re-initialize them (like with all zeros) or should you pass the hidden state’s data to the next batch (but call detach so you don’t backprop through the entire dataset).
Also, as as a sanity check: if the batches were dependent you would call detach if you had to choose between re-initializing or detaching?
With independent batches, you shouldn’t carry hidden state from one batch to the next by calling detach. That wouldn’t make sense – each batch element would have a different hidden state corresponding to what was appropriate to the end of the last corresponding sequence, but that sequence has no relation to the current one. On the other hand, if the batches are successive parts of a long sequence and you are doing truncated BPTT then you should call detach.
You may find that it helps somewhat to learn the initial hidden state.