DataLoader: Proper Use

How to use the Dataloader object properly? it either freezes indefinitely if its arguments are customized (num_workers>0), or it hangs at certain iterations (even if I set drop_last to True to avoid unequal batches), or if the number of samples is large.
Are there general guidelines as to the batch size or the remaining parameters?

train_sampler = DistributedSampler(train, num_replicas=world_size, rank=rank)
train_loader = DataLoader(train, batch_size, sampler=train_sampler                                           ,num_workers=int(os.environ["SLURM_CPUS_PER_TASK"]),pin_memory=True,drop_last=True)
i=1
for ep in np.arange(epochs):
                 start_time=time.time()
                 for count,sample in enumerate(train_loader,0):                       
                                 inputs=sample
                                 target = sample
                                 target=target.to(torch.float32)
                                 target=target.to(local_rank) 
                                 inputs=inputs.to(torch.float32)
                                 inputs=inputs.to(local_rank)
                                 output = model_ddp(inputs,local_rank)
                                 output=output.to(local_rank)
                                 loss = criterion(output[:, :-i], target[:, i:])
                                 for param in model_ddp.parameters():
                                             param.grad = None
                                 loss.backward()  
                                 optimizer.step()
                                 print_loss = loss.item()

I don’t see why the dataloader shouldn’t work provided you have properly initialised a process group with correct ranks etc. but are you sure your code is overall working how you want it to?

E.g.,
In for count, sample in enumerate():, sample could be a tuple of (x,y) which you haven’t properly unpacked (if that is your aim, to unpack it such that inputs=sample[0] and target=sample[1]) because the assignment of inputs and target are on different lines.

And I’m not sure what the aim with the for param in model_ddp.parameters() loop is.

My custom datasets (of different sizes) are in the form of (batch size, sequence, features). So I am using the same datasets as input to the model and as a target for comparison. I realized it does not even work properly even if I set num_workers to 0, the batch size to 64, and if the number of samples is small (e.g., <3). The job gets completed but the output is erroneous.

I have the thread settings adjusted as follows:

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1

As for the number of workers, I alternate between 0 or int(os.environ[“SLURM_CPUS_PER_TASK”]). And for the local ranks, I have tried these to set the local rank for single/multinode training:

local_rank = int(os.environ['SLURM_LOCALID']) 
local_rank = rank - gpus_per_node * (rank // gpus_per_node)

Both are defined and initialized once outside the train() function and passed to it in the above post, but I am unsure if they are the real cause of the behavior.

For this: for param in model_ddp: parameters() To my knowledge, it is a good and valid practice to speed up training (by freeing gradients).

I have multiple models that use the same datasets. Calls for the train() function is done in for loop:
I modified the above code according to this statement below:

def train(model): 
optimizer = optim.Adam(model.parameters(), lr=0.01,weight_decay=0.01)
for ep in np.arange(epochs):
      try: 
             train_sampler = DistributedSampler(train, 
                   num_replicas=world_size, rank=rank)
             print ("train_sampler")
             train_loader = DataLoader(train, 
                 batch_size=batch_size, sampler=train_sampler
                  ,num_workers=int(os.environ["SLURM_CPUS_PER_TASK"]),
                     drop_last=True,persistent_workers=True)
      except Exception as e:
                       print("Error creating dataloader:", e)
                                        
      total_step=len(train_loader)
      iter_loader=iter(train_loader)
      print ("ep2",ep,iter_loader)
      for count in range(total_step): 
           optimizer.zero_grad()
           try:
                 sample=next(iter_loader)
           except StopIteration: 
                          iter_loader=iter(train_loader)
                          sample=next(iter_loader)

How accurate is this statement: “When you use the same dataset with multiple models, you should create a new DataLoader instance for each model, so that each model has its own iterator and can iterate through the data independently.”

Obviously, it does not work even with the above modification, so I want to know if this is required for the training to be executed properly because it seems redundant.