Found Reason for Cuda out of Memory, but not a solution

Hello everyone,
im currently working with a LSTM to predict rest lifetime of mechanical parts. I got a problem when i try to use the trained network to predict multiple timesteps ahead. I noticed that everytime i use the predicted output as input, i get this error message after a few timesteps(mostly between 50-200 timesteps):

cuda out of memory. tried to allocate 2.50 mib (gpu 0; 8.00 gib total capacity; 6.28 gib already allocated; 1.55 mib free; 1.55 mib cached

This is the code:
inp, label = Validset[1]
print(inp.shape)
y = []
for i in range(100):
print(i)
torch.no_grad()
inp = torch.tensor(inp)
inp = inp.view(1,-1,1)
inp = inp.to(device)
inp = inp.float()
out, hn_cn = model.forward(inp)
del inp
out = out.reshape(-1)
y.append(out[-1])
inp = out.clone().detach()
del out
torch.cuda.empty_cache()

I used a gtx1070 and a gtx1080 sli for this task but it seems like both setups fail to predict enough timesteps.

I noticed this problem also while training a LSTM There i tried to use a output Tensor (1x1) as an Input aswell and i get the same error message. Is there a way to free memory after i predicted one timestep? Or has anyone a different solution?

Thank you in advance