CUDA memory leak

Hi, I am trying to train several models in parallel using torch 's pool.map(). Since my setup has multiple GPUs, I pass a device also to my training task and the model is trained on that particular device.

The problem I face is RuntimeError: CUDA error: out of memory
after a while.

This happens after several models are trained and I can clearly see using watch nvidia-smi that the GPU memory accumulates over time.

I have posted a minimal example below which also leads to the stated issue.

Is there something needs to be done at the end of run() to clean up the memory?

Pytorch version: torch 1.6.0

import torch
from torch.multiprocessing import Pool
import random


class SimpleModule(torch.nn.Module):
    def __init__(self, input_: int, output: int):
        super().__init__()
        self.linear = torch.nn.Linear(input_, output)

    def forward(self, input_: torch.Tensor):
        return self.linear(input_)


def run(device: str):
    model = SimpleModule(10, 10).to(device)
    optimizer = torch.optim.SGD(lr=1e+2,
                                params=model.parameters())
    loss_criterion = torch.nn.MSELoss()
    for i in range(int(1e+4)):
        optimizer.zero_grad()
        inputs = torch.rand(5, 10).to(device)
        outputs = torch.rand(5, 10).to(device)
        preds = model.forward(inputs)
        loss = loss_criterion(preds, outputs)
        loss.backward()
        optimizer.step()


if __name__ == "__main__":
    torch.multiprocessing.set_start_method("spawn", force=True)
    n_tasks = 500
    available_device = ["cpu"] if not torch.cuda.is_available() else [f"cuda:{i}" for i in
                                                                      range(torch.cuda.device_count())]
    devices = random.choices(available_device, k=n_tasks)
    pool = Pool(processes=50)
    pool.map(run, devices)