How to load a huge dataset to cuda

Hi, I am trying to execute a dataset of approx (400k) records with the help of GPU. While training the model, lot of time is consumed in loading the data inside the for loop. How do I load the full data to cuda directly from the dataloader to improve the speed of execution.

model = Net().cuda()
optimizer = optim.Adam(model.parameters(), lr=0.001)

loss_func = nn.NLLLoss()

epochs = 3
loss_list = []
correct = 0
model.train()

for epoch in range(epochs):
    total_loss = []
    
    for i, (X_tr, Y_tr) in enumerate(train_ldr):
        
  # get the inputs; data is a list of [inputs, labels]
      X_tr = X_tr.unsqueeze(0).cuda()
     
      Y_tr = Y_tr.type(torch.LongTensor).cuda()
      
      optimizer.zero_grad()

        # Forward pass
      output = model(X_tr).cuda()
      
      pred = output.argmax(dim=1, keepdim=True)
      correct += pred.eq(Y_tr.view_as(pred)).sum().item()
      loss = loss_func(output, Y_tr).cuda()
        # Backward pass
      loss.backward()
        # Optimize the weights
      optimizer.step()
        
      total_loss.append(loss.item())
#      print(correct, Y_tr, pred)

      loss_list.append(sum(total_loss)/len(total_loss))
    print('Training [{:.0f}%]\tLoss: {:.4f}'.format(
        100. * (epoch + 1) / epochs, loss_list[-1]))

Could you share more information about your profiling and how you’ve narrowed down that the data transfer is indeed the bottleneck?

You could pre-load the entire dataset to the GPU, but would of course use device memory for it besides a slower startup time.

Sorry for the late reply. Since this is the only module that is taking a lot of time execution time, led me to think that data loading could be a problem as also I am using batch size of 1. I tried to run my whole program in the docker with GPU. Except for the rest of the code, this module takes hours to execute with 400K records. Apart from this module I have not made any data transfer to cuda in my program from dataloaders. The Gpu utilization shows only 4-5% GPU usage.

Kindly suggest how to pre load the entire dataset to GPU as I have run the same program in Google colab pro multiple times but still facing the same issue.
Link to the program in google colab:
Google Colab.
@ptrblck : Please have a look to my code if possible.

Can the pytorch NN with batch size of 1 and big dataset be used efficiently with GPU’s?

I don’t know which “module” you mean and am still unsure how you’ve profiled it and narrowed down the bottleneck.
In any case, to preload the data, just push it to the device:

data = data.to(device)

and wrap it into e.g. a TensorDataset.

It depends on the model. If the GPU workload is tiny, your script might suffer from the kernel launches and general CPU overhead.

Sorry not module but the part when I am “training my model”, the GPU utilization shows only 5-6 % usage and keeps on processing. For 400k records, it disconnects finally and doesnt give any output while it works fine for small dataset. I have no idea as to why the GPU usage is only 5-6% while training my model. Is there something incorrect.

Training the model
model = Net().cuda()
optimizer = optim.Adam(model.parameters(), lr=0.001)

loss_func = nn.NLLLoss()

epochs = 3
loss_list = []
correct = 0
model.train()

for epoch in range(epochs):
    total_loss = []
    
    for i, (X_tr, Y_tr) in enumerate(train_ldr):
        
  # get the inputs
      X_tr = X_tr.unsqueeze(0).cuda()
     
      Y_tr = Y_tr.type(torch.LongTensor).cuda()
      
      optimizer.zero_grad()

        # Forward pass
      output = model(X_tr.unsqueeze(0)).cuda()
      
      pred = output.argmax(dim=1, keepdim=True)
#      correct += pred.eq(Y_tr.view_as(pred)).sum().item()
      loss = loss_func(output, Y_tr.type(torch.LongTensor).cuda())
        # Backward pass
      loss.backward()
        # Optimize the weights
      optimizer.step()
        
      total_loss.append(loss.item())
#      print(correct, Y_tr, pred)

      loss_list.append(sum(total_loss)/len(total_loss))
    print('Training [{:.0f}%]\tLoss: {:.4f}'.format(
        100. * (epoch + 1) / epochs, loss_list[-1]))
    ~~~

A low GPU utilization could be caused by various bottlenecks, e.g. such as the data loading.
You could profile different parts of the code to narrow it down further.
One approach would be to profile the entire strict via Nsight Systems as described here or via the PyTorch profiler.
Also, you could check the GPU util. by removing the DataLoader and by feeding CUDATensors directly to the model and you could profile the data loading standalone (i.e. without any model training) to see how long it takes to create each batch.