Trying to understand why output of nn.Linear (for output layer) isn't retaining

Here is some code for reference, but I’m trying to understand why the output of the final linear layer gives None for the grad. The final layer of weights will give me a gradient, but I’d consider the result of the final layer would be considered a leaf node. I’m also trying to understand exactly what is happening when in the second to last line of my forward function, it seems like the linear layer is creating a new tensor, but it seems like the gradients are not being retained. However, even when I call .retain_graph_() to the final computation in forward() it still returns None for the gradient. It does work when I make a copy of the tensor and set retain_graph to True (don’t quite understand why I can’t alter the original tensor in the desired way). I’d really appreciate some help with this!

if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
type = torch.float
args = {'device': device, 'dtype': type}

class SingleRNN(nn.Module):
    def __init__(self, n_inputs, n_outputsize):
        super(SingleRNN, self).__init__()

        numHU = 2
        bias = False

        self.Wxh = nn.Linear(
            in_features=n_inputs, out_features=numHU , bias=bias)

        self.Whh = nn.Linear(
            in_features=numHU + self.Wxh.out_features, out_features=numHU , bias=bias)

        self.Why = nn.Linear(
            in_features=numHU , out_features=n_outputsize, bias=bias)

    def forward(self, X0, H0):
        self.L0_X= self.Wxh(X0)
        self.L1_H = self.Whh(torch.cat((self.L0_X, H0), dim=1))
        self.L2_Y = self.Why(self.L1_H) # so what is happening here, self.L1_H is a tensor that's being passed to the forward function of a Linear layer (does this by default initialize the output to not retain the gradient even though it is a leaf node), the same behavior occurs when i set this line to: self.Why(self.L1_H).retain_graph_(True). However, this works self.L2_Y = tensor(self.Why(self.L1_H), retain_graph=True) and gives the desired gradient, why does altering the existing tensor output not work, why do I need to make a new copy? 
        return self.L2_Y, self.L1_H


def main():
    torch.manual_seed(999)
    cudnn.benchmark = True

    N_INPUT = 4
    N_OUTPUTSIZE = 1

    X0 = torch.tensor([[0, 1, 2, 0]],
                      **args)  # t=0 => 4 X 4

    H0 = torch.tensor([[0, 0,]],
                      **args)  # initialize hidden state to 0's

    Ytarg = torch.tensor([[1]], **args)

    model = SingleRNN(N_INPUT, N_OUTPUTSIZE).cuda()

    criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    for i in range(1):
        parameters = model.parameters()

        Y_out, H0 = model.forward(X0, H0)

        loss = criterion.forward(input=Y_out, target=Ytarg)
        loss.backward(retain_graph=True)
        optimizer.step()
        print (model.L2_Y.grad) #->prints None 


if __name__ == "__main__":
    main()