Help with policy gradient update with lstm

Hi I have the following network with the forward function shown here:

class RecurrentPNetwork(nn.Module):
    ''' Recurrent policy '''
    def __init__(self, state_space,action_space,hidden_space=64):

        self.fc1 = nn.Linear(state_space,hidden_space)
        self.rnn = nn.LSTM(hidden_space,hidden_space)#,batch_first=True)
        self.fc2 = nn.Linear(hidden_space,action_space)
        self.hidden_memory = []

    def forward(self,x):
        x = F.relu(self.fc1(x))
        if len(self.hidden_memory) == 0:
            h_t = None
            h_t = self.hidden_memory[-1]

        x, (new_h,new_c) = self.rnn(x,h_t)
        new_h = new_h.detach().requires_grad_()
        new_c = new_c.detach().requires_grad_()
        out = F.softmax(self.fc2(new_h),dim=-1)
        return out

Here I encode the action history in the hidden state which I use to estimate the probability for the new action hence I have to manually unroll the model. When I optimize the model I want to take the history of the hidden states into account. For that I introduce the hidden_memory list. So that I can store them.

First the h_0 and c_0 are initialized to 0 and the model continues for 1 episode. and it works. When the second episode begins I get this error: one of the variables needed for gradient computation has been modified by an inplace operation.
This happens at the lstm cell because now h_t is taken from the memory list. The way to solve that is to use detach() or detach().requires_grad_(), but then I miss out on these gradients and the model doesnt work. What should I do ?