Training "never finishes" or system crashes using PyTorch - GPU has memory allocated but always has 0% utilization using DataLoader

My neural network training “never finishes” or system crashes (memory reaches limit or DataLoader worker being killed error occurs) using PyTorch - GPU has memory allocated but always has 0% utilization using DataLoader. I’ve tested several batch values; and in DataLoader, number of workers, shuffle true or false, pin_memory true or false. Considering some tests I’ve done, I can’t use number of workers greater than 1, even if I increase or decrease the batch value. I’m in the dark and will be very grateful with any help, thanks.

This sounds as if you are running out of host memory, so check e.g. via htop if your Python process is indeed increasing its memory usage until it’s killed by the OS.

1 Like

First, thank you very much for your attention! Yes, of course it is, I just checked that it hits the limit by tracking through htop. But the detail is that the machine has a lot of memory (384 GB) and the same code runs smoothly in another environment which has much less memory at normal speed.

Thanks for confirming. It’s weird that the same code apparently does not suffer from this issue in another environment. In any case, try to narrow down which part of the code increases the memory usage by running separate parts standalone. E.g. remove the actual model training and iterate the dataloader only. Afterwards do the opposite: remove the data loading pipeline, use static input tensors, and train the model only. Let me know if this helps on isolating the issue.

1 Like

I was able to solve part of the problem by basically following the instructions on that page - https://nccadmin.webspace.durham.ac.uk/pytorch/#excessive-pytorch-dataloader-memory-usage - and after the instructions on that page - https://github.com/pytorch/pytorch/issues/11929#issuecomment-649760983.
Now there is no error regarding the DataLoader worker and no memory overflow/system crash. Each DataLoader worker is using a thread to carry out the necessary loads, but the GPU is still at 0% utilization despite having a certain amount of memory allocated to it; and even varying the batch size, the training of just 1 epoch is not completed in the time I can do it end up running in another environment (actually I still haven’t been able to complete the training of just 1 epoch in any time I’ve tested it so far (each test I wait a maximum of 40 minutes, but in the other environment the training is completed much faster). Perhaps there is some bottleneck in relation to the GPU. Can you help me, please?

Could this mean that your run is hanging somewhere or do you see any progress via debug print statements?

1 Like

I set up training for 1 epoch. And yes, it’s probably hanging somewhere, but no error occurs (before yes, not now). Is there any way I can test the GPU for its utilization? Apparently it can receive data (since memory is allocated to it - the greater the number of workers I configure in the DataLoader, the greater the memory size on the GPU). I’m going to start putting prints yes, I haven’t done that yet.

Usually you would not try to load the data directly to the GPU in your Dataset or DataLoader but would move each batch to the GPU inside your training loop.
Could you check if the potential hang disappears if you load the data to the CPU first and move it to the GPU inside the training loop?

1 Like

Coincidentally, I was just going to comment on a test I forgot to pass using only CPU (before your last comment). A while ago I had adapted the code (removing the codes related to CUDA) to be able to test the code only using CPU. And it was possible to complete 1 season of training. So I think it’s something related to the GPU. But I had done this test before putting this new PyTorch code that uses multiprocessing via forkserver (this adjustment that consequently I also needed to make a small adjustment when reading the HDF file, which I mentioned in the second link that I had sent). What do you think of this multiprocessing tweak? I went back to adapting the code to use only the CPU, but now it has this multiprocessing adaptation, and right now I’m training this way. Do you think I should remove this multiprocessing adaptation and go back again to do CPU-only testing?

I’m going to show you all my training code now, it’s not the whole program but I think it can help you understand better. Where in this code should I make that adjustment you mentioned (“Could you check if the potential hang disappears if you load the data to the CPU first and move it to the GPU inside the training loop?”)?

import datetime
import inspect
import os
import pdb
import sys

import hydra

#import aim
#from aim import Run
from torch.utils.tensorboard import SummaryWriter
from mlflow import log_metric, log_param, log_artifacts
from mlflow.tracking import MlflowClient
from torchinfo import summary

import torchmetrics

from copy import deepcopy

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from torch import nn
from torch.optim.lr_scheduler import CyclicLR, ReduceLROnPlateau
from tqdm import tqdm

import torch_optimizer as optim

import uuid

current_dir = os.path.dirname(
    os.path.abspath(inspect.getfile(inspect.currentframe())))
parent_dir = os.path.dirname(current_dir)
sys.path.insert(0, parent_dir)

from data_loader import SequenceLoader

torch.backends.cudnn.benchmark = True

@hydra.main(config_path="conf", config_name="config")
def main(cfg):
    
    if cfg.execution.experiment_id:
        experiment_id = cfg.execution.experiment_id
    else:
        experiment_id = str(uuid.uuid4())
        cfg.execution.experiment_id = experiment_id



    experiment_name = cfg.register.experiment_name
    
    # writer will output to ./runs/ directory by default
    writer = SummaryWriter(f"./runs/tb/{experiment_name}")
    
#    run = Run(experiment=experiment_name,
#                  repo=f"./runs/aim/{experiment_name}",
#                  run_hash=experiment_id)
#    
#    run["hparams"] = cfg.hparams
    
    ds_train = SequenceLoader(name='train', **cfg.dataloader)
    
    batch_size = cfg.hparams.batch_size
    seq_train = torch.utils.data.DataLoader(
        ds_train,
        batch_size=batch_size, 
        shuffle=True,
        num_workers=4,
        pin_memory=True)
    
    ds_train.load(0)
    
    ds_validation = SequenceLoader(name='validation', **cfg.dataloader)
    
    seq_validation = torch.utils.data.DataLoader(
        ds_validation,
        batch_size=batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True)
    
    ds_test = SequenceLoader(name='test', **cfg.dataloader)
    
    seq_test = torch.utils.data.DataLoader(
        ds_test,
        batch_size=batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True)
    
    
    from networks import Resnet, Wavenet, TCN
    
    model = Wavenet(n_inputs=1,
                n_outputs=1,
                n_features=cfg.hparams.n_features,
                x_len=cfg.dataloader.window_train,
                y_len=cfg.dataloader.window_predict,
                **cfg.wavenet)
    
    summary(model, input_size=(batch_size, 1, 144, 7, 7))
    model.cuda()
    model.train()
    
    optimizer = optim.AdaBound(
        model.parameters(),
        lr=cfg.hparams.lr,
        betas=tuple(cfg.hparams.betas),
        final_lr=cfg.hparams.lr2,
        gamma=1e-2,
        eps= 1e-8,
        weight_decay=cfg.hparams.weight_decay,
        amsbound=False,
    )
    
    best_loss_l1 = np.infty
    best_model = None
    
    t = tqdm(range(cfg.hparams.num_epochs), ncols=120)
    log = []
    for idx_epoch, epoch in enumerate(t):
        print()
        list_train_loss = []
        list_train_loss_l1 = []
        list_train_loss_l1_rel = []
        list_train_l1_max = []
        list_train_loss_fft_l1 = []
        list_train_loss_fft_l1_rel = []
        list_train_fft_l1_max = []
        
        list_validation_loss = []
        list_validation_loss_l1 = []
        list_validation_loss_l1_rel = []
        list_validation_l1_max = []
        list_validation_loss_fft_l1 = []
        list_validation_loss_fft_l1_rel = []
        list_validation_fft_l1_max = []
        
        list_test_loss = []
        list_test_loss_l1 = []
        list_test_loss_l1_rel = []
        list_test_l1_max = []
        list_test_loss_fft_l1 = []
        list_test_loss_fft_l1_rel = []
        list_test_fft_l1_max = []
    
        l1_distribution_validation = []
        l1_distribution_test = []
    
        model.train()
        for seq in seq_train:
            X, y = seq
            X = X.cuda()
            y = y.cuda()
    
            optimizer.zero_grad()
            output = model(X)
    
            ffty = torch.fft.rfftn(y, dim=(-1, -2, -3))
            fftout = torch.fft.rfftn(output, dim=(-1, -2, -3))
    
            fft_res = (fftout - ffty)
            fft_res_abs = torch.abs(fft_res)
            loss_fft_l1 = torch.mean(fft_res_abs)
            loss_fft_l1_rel = torch.mean(fft_res_abs/(torch.abs(ffty) + 0.01))
            fft_l1_max = torch.max(fft_res_abs)
    
            res = (output - y)
            res_abs = torch.abs(res)
            loss_l1 = torch.mean(res_abs)
            loss_l1_rel = torch.mean(res_abs/(torch.abs(y)+0.01))
            l1_max = torch.max(res_abs)
    
            loss = loss_l1_rel + loss_fft_l1_rel
            loss.backward()
            optimizer.step()
        
            list_train_loss.append(loss.cpu().item())
            list_train_loss_l1.append(loss_l1.cpu().item())
            list_train_loss_l1_rel.append(loss_l1_rel.cpu().item())
            list_train_l1_max.append(l1_max.cpu().item())
            list_train_loss_fft_l1.append(loss_fft_l1.cpu().item())
            list_train_loss_fft_l1_rel.append(loss_fft_l1_rel.cpu().item())
            list_train_fft_l1_max.append(fft_l1_max.cpu().item())

        g_train_loss = np.mean(list_train_loss)       
        g_train_loss_l1 = np.mean(list_train_loss_l1)
        g_train_loss_l1_rel = np.mean(list_train_loss_l1_rel)
        g_train_l1_max = np.max(list_train_l1_max)
        g_train_loss_fft_l1 = np.mean(list_train_loss_fft_l1)
        g_train_loss_fft_l1_rel = np.mean(list_train_loss_fft_l1_rel)
        g_train_fft_l1_max = np.max(list_train_fft_l1_max)
    
        writer.add_scalar('loss/train', g_train_loss, idx_epoch)
        writer.add_scalar('loss_l1/train', g_train_loss_l1, idx_epoch)
        writer.add_scalar('loss_l1_rel/train', g_train_loss_l1_rel, idx_epoch)
        writer.add_scalar('l1_max/train', g_train_l1_max, idx_epoch)
        writer.add_scalar('loss_fft_l1/train', g_train_loss_fft_l1, idx_epoch)
        writer.add_scalar('loss_fft_l1_rel/train', g_train_loss_fft_l1_rel, idx_epoch)
        writer.add_scalar('fft_l1_max/train', g_train_fft_l1_max, idx_epoch)
    
        #run.track(g_train_loss, name='loss', step=idx_epoch, context={"subset":"train"})
        #run.track(g_train_loss_l1, name='loss_l1', step=idx_epoch, context={"subset":"train"})
        #run.track(g_train_loss_l1_rel, name='loss_l1_rel', step=idx_epoch, context={"subset":"train"})
        #run.track(g_train_l1_max, name='l1_max', step=idx_epoch, context={"subset":"train"})
        #run.track(g_train_loss_fft_l1, name='loss_fft_l1', step=idx_epoch, context={"subset":"train"})
        #run.track(g_train_loss_fft_l1_rel, name='loss_fft_l1_rel', step=idx_epoch, context={"subset":"train"})
        #run.track(g_train_fft_l1_max, name='fft_l1_max', step=idx_epoch, context={"subset":"train"})
        
        dict_metrics_train = {"loss/train": g_train_loss,
            "loss_l1/train": g_train_loss_l1,
            "loss_l1_rel/train": g_train_loss_l1_rel,
            "l1_max/train": g_train_l1_max,
            "loss_fft_l1/train": g_train_loss_fft_l1,
            "loss_fft_l1_rel/train": g_train_loss_fft_l1_rel,
            "fft_l1_max/train": g_train_fft_l1_max
        }
    
        with torch.no_grad():
            model.eval()
            for seq in seq_validation:
                X, y = seq
                X = X.cuda()
                y = y.cuda()
                output = model(X)
    
                #ffty = torch.fft.rfft(y, 1)
                #fftout = torch.fft.rfft(output, 1)
                ffty = torch.fft.rfftn(y, dim=(-1, -2, -3))
                fftout = torch.fft.rfftn(output, dim=(-1, -2, -3))
    
                fft_res = (fftout - ffty)
                fft_res_abs = torch.abs(fft_res)
                loss_fft_l1 = torch.mean(fft_res_abs)
                loss_fft_l1_rel = torch.mean(fft_res_abs/(torch.abs(ffty) + 0.01))
                fft_l1_max = torch.max(fft_res_abs)
    
                res = (output - y)
                res_abs = torch.abs(res)
                loss_l1 = torch.mean(res_abs)
                loss_l1_rel = torch.mean(res_abs/(torch.abs(y)+0.01))
                l1_max = torch.max(res_abs)
    
                loss = loss_l1_rel + loss_fft_l1_rel
                
                l1_distribution_validation = l1_distribution_validation + list(res_abs.cpu().numpy())
        
                list_validation_loss.append(loss.cpu().item())
                list_validation_loss_l1.append(loss_l1.cpu().item())
                list_validation_loss_l1_rel.append(loss_l1_rel.cpu().item())
                list_validation_l1_max.append(l1_max.cpu().item())
                list_validation_loss_fft_l1.append(loss_fft_l1.cpu().item())
                list_validation_loss_fft_l1_rel.append(loss_fft_l1_rel.cpu().item())
                list_validation_fft_l1_max.append(fft_l1_max.cpu().item())
    
        g_validation_loss = np.mean(list_validation_loss)       
        g_validation_loss_l1 = np.mean(list_validation_loss_l1)
        g_validation_loss_l1_rel = np.mean(list_validation_loss_l1_rel)
        g_validation_l1_max = np.max(list_validation_l1_max)
        g_validation_loss_fft_l1 = np.mean(list_validation_loss_fft_l1)
        g_validation_loss_fft_l1_rel = np.mean(list_validation_loss_fft_l1_rel)
        g_validation_fft_l1_max = np.max(list_validation_fft_l1_max)
        
        writer.add_scalar('loss/validation', g_validation_loss, idx_epoch)
        writer.add_scalar('loss_l1/validation', g_validation_loss_l1, idx_epoch)
        writer.add_scalar('loss_l1_rel/validation', g_validation_loss_l1_rel, idx_epoch)
        writer.add_scalar('l1_max/validation', g_validation_l1_max, idx_epoch)
        writer.add_scalar('loss_fft_l1/validation', g_validation_loss_fft_l1, idx_epoch)
        writer.add_scalar('loss_fft_l1_rel/validation', g_validation_loss_fft_l1_rel, idx_epoch)
        writer.add_scalar('fft_l1_max/validation', g_validation_fft_l1_max, idx_epoch)
    
        #run.track(g_validation_loss, name='loss', step=idx_epoch, context={"subset":"validation"})
        #run.track(g_validation_loss_l1, name='loss_l1', step=idx_epoch, context={"subset":"validation"})
        #run.track(g_validation_loss_l1_rel, name='loss_l1_rel', step=idx_epoch, context={"subset":"validation"})
        #run.track(g_validation_l1_max, name='l1_max', step=idx_epoch, context={"subset":"validation"})
        #run.track(g_validation_loss_fft_l1, name='loss_fft_l1', step=idx_epoch, context={"subset":"validation"})
        #run.track(g_validation_loss_fft_l1_rel, name='loss_fft_l1_rel', step=idx_epoch, context={"subset":"validation"})
        #run.track(g_validation_fft_l1_max, name='fft_l1_max', step=idx_epoch, context={"subset":"validation"})
        #
        #d = aim.Distribution(distribution=l1_distribution_validation, bin_count=100)
        #run.track(d, name='dist', step=idx_epoch, context={"subset":"validation"})
        
        dict_metrics_validation = {"loss/validation": g_validation_loss,
            "loss_l1/validation": g_validation_loss_l1,
            "loss_l1_rel/validation": g_validation_loss_l1_rel,
            "l1_max/validation": g_validation_l1_max,
            "loss_fft_l1/validation": g_validation_loss_fft_l1,
            "loss_fft_l1_rel/validation": g_validation_loss_fft_l1_rel,
            "fft_l1_max/validation": g_validation_fft_l1_max
        }
        
        # save model
        if cfg.execution.save:
            if best_loss_l1 > g_validation_loss_l1:
                best_loss_l1 = g_validation_loss_l1
                best_model = deepcopy(model)
    
        with torch.no_grad():
            model.eval()
            for seq in seq_test:
                X, y = seq
                X = X.cuda()
                y = y.cuda()
                output = model(X)
                
                #ffty = torch.fft.rfft(y, 1)
                #fftout = torch.fft.rfft(output, 1)
                ffty = torch.fft.rfftn(y, dim=(-1, -2, -3))
                fftout = torch.fft.rfftn(output, dim=(-1, -2, -3))
    
                fft_res = (fftout - ffty)
                fft_res_abs = torch.abs(fft_res)
                loss_fft_l1 = torch.mean(fft_res_abs)
                loss_fft_l1_rel = torch.mean(fft_res_abs/(torch.abs(ffty) + 0.01))
                fft_l1_max = torch.max(fft_res_abs)
    
                res = (output - y)
                res_abs = torch.abs(res)
                loss_l1 = torch.mean(res_abs)
                loss_l1_rel = torch.mean(res_abs/(torch.abs(y)+0.01))
                l1_max = torch.max(res_abs)
    
                loss = loss_l1_rel + loss_fft_l1_rel
    
                l1_distribution_test = l1_distribution_test + list(res_abs.cpu().numpy())
        
                list_test_loss.append(loss.cpu().item())
                list_test_loss_l1.append(loss_l1.cpu().item())
                list_test_loss_l1_rel.append(loss_l1_rel.cpu().item())
                list_test_l1_max.append(l1_max.cpu().item())
                list_test_loss_fft_l1.append(loss_fft_l1.cpu().item())
                list_test_loss_fft_l1_rel.append(loss_fft_l1_rel.cpu().item())
                list_test_fft_l1_max.append(fft_l1_max.cpu().item())
    
        g_test_loss = np.mean(list_test_loss)       
        g_test_loss_l1 = np.mean(list_test_loss_l1)
        g_test_loss_l1_rel = np.mean(list_test_loss_l1_rel)
        g_test_l1_max = np.max(list_test_l1_max)
        g_test_loss_fft_l1 = np.mean(list_test_loss_fft_l1)
        g_test_loss_fft_l1_rel = np.mean(list_test_loss_fft_l1_rel)
        g_test_fft_l1_max = np.max(list_test_fft_l1_max)
        
        writer.add_scalar('loss/test', g_test_loss, idx_epoch)
        writer.add_scalar('loss_l1/test', g_test_loss_l1, idx_epoch)
        writer.add_scalar('loss_l1_rel/test', g_test_loss_l1_rel, idx_epoch)
        writer.add_scalar('l1_max/test', g_test_l1_max, idx_epoch)
        writer.add_scalar('loss_fft_l1/test', g_test_loss_fft_l1, idx_epoch)
        writer.add_scalar('loss_fft_l1_rel/test', g_test_loss_fft_l1_rel, idx_epoch)
        writer.add_scalar('fft_l1_max/test', g_test_fft_l1_max, idx_epoch)
    
        #run.track(g_test_loss, name='loss', step=idx_epoch, context={"subset":"test"})
        #run.track(g_test_loss_l1, name='loss_l1', step=idx_epoch, context={"subset":"test"})
        #run.track(g_test_loss_l1_rel, name='loss_l1_rel', step=idx_epoch, context={"subset":"test"})
        #run.track(g_test_l1_max, name='l1_max', step=idx_epoch, context={"subset":"test"})
        #run.track(g_test_loss_fft_l1, name='loss_fft_l1', step=idx_epoch, context={"subset":"test"})
        #run.track(g_test_loss_fft_l1_rel, name='loss_fft_l1_rel', step=idx_epoch, context={"subset":"test"})
        #run.track(g_test_fft_l1_max, name='fft_l1_max', step=idx_epoch, context={"subset":"test"})
    
        #d = aim.Distribution(distribution=l1_distribution_test, bin_count=100)
        #run.track(d, name='dist', step=idx_epoch, context={"subset":"test"})
        
        dict_metrics_test = {"loss/test": g_test_loss,
            "loss_l1/test": g_test_loss_l1,
            "loss_l1_rel/test": g_test_loss_l1_rel,
            "l1_max/test": g_test_l1_max,
            "loss_fft_l1/test": g_test_loss_fft_l1,
            "loss_fft_l1_rel/test": g_test_loss_fft_l1_rel,
            "fft_l1_max/test": g_test_fft_l1_max
        }
    
        log.append({**dict_metrics_train, **dict_metrics_validation, **dict_metrics_test})
    
        t.set_postfix(LTrain=g_train_loss,
                      LTest=g_test_loss,
                      LTValidation=g_validation_loss)
    
    ###################################################################################################################################################
    
    if os.path.exists('logs/'):
        pass
    else:
        os.makedirs('logs/')
    
    ###################################################################################################################################################
   
    if cfg.execution.save:
        if os.path.exists('models/'):
            pass
        else:
            os.makedirs('models/')
        
        torch.save(best_model.state_dict(), f"models/model_{experiment_id}")

    ###################################################################################################################################################
    
    file_log = open(f"logs/file_log_{experiment_id}", "w")
    
    ###################################################################################################################################################
    
    df_log = pd.DataFrame(log)
    df_log.to_pickle(f'logs/log_train_{experiment_id}')
    
    ###################################################################################################################################################
    
    id_min = df_log["loss_l1/validation"].argmin()
    best_result = df_log.iloc[id_min]
    columns = df_log.columns
    
    ###################################################################################################################################################
    
    file_log.write("===================================================\n")
    print("===================================================")
    
    for key in cfg:
        print(f"{key}: {cfg[key]}")
        file_log.write(f"{key}: {cfg[key]}")
    
    file_log.write("===================================================\n")
    print("===================================================")
    
    ###################################################################################################################################################
    
    file_log.write("===================================================\n")
    print("===================================================")
    
    for column, result in zip (columns, best_result):
        print(f"{column}\t{result}")
        file_log.write(f"{column}\t{result}")
    
    file_log.write("===================================================\n")
    print("===================================================")
    
    ###################################################################################################################################################
    
    file_log.close()

if __name__ == "__main__":
    torch.multiprocessing.set_start_method('forkserver')
    main()   

Well, just now I interrupted this test that I said I was doing considering only the CPU, but a long time passed and no progress. I’ll test again by removing the multiprocessing part.

Yes, I would also recommend removing the multiprocessing part for now to try creating a minimal working solution. Once your CPU workload runs again, let me know if the GPU training also succeeds.

1 Like

Ok, thank you. The last test I did using only CPU didn’t work (I waited 40 minutes). I took a look at this page - PyTorch - CC Doc - and it seems that it is necessary to set the number of threads in PyTorch via torch.set_num_threads(). That’s right? I don’t know if you are familiar with the Slurm environment, but I am using this environment, and there is an important option present in Slurm script, “cpus-per-task=x” (which defines the number of threads) and everything indicates that it must be configured with the same number as the number of workers of the DataLoader (that page also says that, but in this part - PyTorch - CC Doc), are you aware of this and could you confirm that this is the case? Also there is other two important options present in Slurm script, “–tasks-per-node=x”, that seems to represent the number of processes that each task has (lately I’m always leaving it as 1), and “–tasks=x” that represent the number of tasks in the nodes (devices), but in the case as I am using just 1 node (device), I always leave it as 1. The problem may be in the configuration of the Slurm script that is used to submit the execution job of my program that uses PyTorch. I’m doing another test, just using CPU, after configuring that PyTorch function that I mentioned that defines the number of threads, equaling that number to the number of workers (I configured it to 8). So, if you happen to know about Slurm and can confirm this information, I’d appreciate it, but if you don’t, no problem. Thank you very much for the help you continue to give me.

Maybe this situation also is related to my case (considering that PyTorch call modules written in low level that use OpenMP internally (other libraries such as NumPy may also be in the same situation)): Use of OMP_NUM_THREADS=1 for Python Multiprocessing - Stack Overflow.

@ptrblck I did several debugs with prints, it’s hanging right in the “for seq in seq_train:” execution. It doesn’t even go beyond the next line (I put a print right after “X, y = seq” and it doesn’t reach this print). What can I do now to test? I believe we can now rule out problems with data loading. Thanks in advance.

I would recommend removing the custom multiprocessing usage and try to get the code running with num_workers=0 in the DataLoader. If this is working, use num_workers=1 and see if this is causing the hang. If so, check if you are trying to push the CPU tensors to the GPU inside the Dataset or DataLoader.

1 Like

@ptrblck I’m not using that part of multiprocessing anymore, you already asked me that. Ok, I will do these tests, but I didn’t understand how to do the last part: “If so, check if you are trying to push the CPU tensors to the GPU inside the Dataset or DataLoader”. Thanks!!

In one of your previous replies you’ve mentioned:

which indicates that each worker of the DataLoader consumes GPU memory, which is not the usual use case, since the DataLoader calling into the Dataset.__getitem__ would work with CPUTensors.
Are you still seeing an increase in GPU memory usage while increasing the number of workers?
If so, check the .device attribute of all used tensors inside your Dataset and make sure they are stored in the CPU.

@ptrblck You commented on using the GPU, but in the last tests I adapted the code to use only the CPU, as agreed. But yes, as I recall, when I was testing using GPU, the more I increased the value for workers, the more memory was allocated to the GPU. I already did the test on using the value 0 in the number of workers and now I’m doing the test with the value 1, with the value 0 the execution of the code was very slow and it didn’t even reach the part I mentioned that is stuck, even after waiting a long time. And I did the tests you asked, I checked in my script that performs processing related to the DataLoader and there is the following part, the only one that presents tensors:

        X = sample[0:self.window_train, :, :, :]
        y = sample[self.window_train:, :, :, :]

        X = Tensor(X.transpose(1, 0, 2, 3))
        y = Tensor(y.transpose(1, 0, 2, 3))

By doing the tests you asked to do, performing:

        X = sample[0:self.window_train, :, :, :]
        y = sample[self.window_train:, :, :, :]

        X = Tensor(X.transpose(1, 0, 2, 3))
        print("X.device: "+str(X.device))
        y = Tensor(y.transpose(1, 0, 2, 3))
        print("y.device: "+str(y.device))

The prints return “cpu”.

I just finished the test with the workers with value 1 and even after a long time (more than 40 minutes), the execution was stuck in the same place that I had mentioned.