CPU RAM Usage Kills Training

Tameem.Samawi · April 29, 2020, 4:55am

Hi there,

I’m running into an overload on CPU/Ram usage causing the kernel to kill my training after about 50 epochs or so. GPU usage is not increasing between epochs, so no leaks on the GPU side. I’m wondering if there are any optimizations I can make to the following code:

for epoch in range(epochs):
        net.train()
        with tqdm(total=n_train, desc=f'Epoch {epoch + 1}/{epochs}', unit='img') as pbar:
            for batch in train_loader:
                imgs = batch['image']
                true_imgs = batch['true_imgs']
                
                imgs = imgs.to(device=device, dtype=torch.float32)
                true_imgs = true_imgs.to(device=device, dtype=torch.float32)

                optimizer.zero_grad()
                imgs_pred = net(imgs)
                loss = loss_fn(true_imgs, imgs_pred)
                loss.backward()
                optimizer.step()

                loss = loss.detach().cpu().numpy()
                writer.add_scalar('loss/train', loss, global_step)
                pbar.set_postfix(**{'loss (batch)': loss})
                pbar.update(imgs.shape[0])

        net.eval()

Sowmen_Das · March 13, 2021, 8:38pm

Did you find a solution to this? I’m also facing same issue.