Thank you for your reply, I think I tried it but it did not work. I have however found a solution, I was also using that feature network to produce the state for the next observation before computing gradients. When I changed that so that I only had put one observation through the feature network before computing the gradients it did not raise the error anymore. Not quite sure why since I did not use that state for any loss or anything, but maybe it confused its histories since there were no two observation having been put through the feature network.