CPU RAM Usage Kills Training

Hi there,

I’m running into an overload on CPU/Ram usage causing the kernel to kill my training after about 50 epochs or so. GPU usage is not increasing between epochs, so no leaks on the GPU side. I’m wondering if there are any optimizations I can make to the following code:

for epoch in range(epochs):
        net.train()
        with tqdm(total=n_train, desc=f'Epoch {epoch + 1}/{epochs}', unit='img') as pbar:
            for batch in train_loader:
                imgs = batch['image']
                true_imgs = batch['true_imgs']
                
                imgs = imgs.to(device=device, dtype=torch.float32)
                true_imgs = true_imgs.to(device=device, dtype=torch.float32)

                optimizer.zero_grad()
                imgs_pred = net(imgs)
                loss = loss_fn(true_imgs, imgs_pred)
                loss.backward()
                optimizer.step()

                loss = loss.detach().cpu().numpy()
                writer.add_scalar('loss/train', loss, global_step)
                pbar.set_postfix(**{'loss (batch)': loss})
                pbar.update(imgs.shape[0])

        net.eval()

Did you find a solution to this? I’m also facing same issue.