How can I release allocated memory inside a non-default CUDA stream?

In the following code, I notice that CUDA memory usage increases each time I create a new stream. Even after calling gc.collect() and torch.cuda.empty_cache(), the memory allocated on the GPU seems to multiply with each iteration.

Is there a way to release the memory allocated by each non-default stream?

Here’s the code for reference:

import gc
import torch
import torch.nn as nn

def run(model):
    model = model.to('cuda')
    input = torch.rand(10, 100)
    target = torch.rand(10, 100)
    input = input.to('cuda')
    target = target.to('cuda')
    model.train(True)
    with torch.cuda.stream(stream):
        output = model(input)
        loss = torch.nn.functional.mse_loss(output, target)
        loss.backward()
    stream.synchronize()
    print('allocated', torch.cuda.memory_allocated())
    print('reserved', torch.cuda.memory_reserved())

model = nn.Sequential(nn.Linear(100, 100), nn.Linear(100, 100))
for i in range(4):
    print(i)
    stream = torch.cuda.Stream()
    run(model)
    model.train(False)
    model.to('cpu')
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.synchronize()

And here is the memory allocation printout:

0
allocated 17219584
reserved 25165824
1
allocated 34258944
reserved 46137344
2
allocated 51298304
reserved 67108864
3
allocated 68337664
reserved 88080384

Why does the allocated memory keep increasing with each new stream, and how can I properly release it?