Torch dataloader is slow

I’m practicing using torch.utils.data.DataLoader to load batches of data into GPU, but find it’s significantly slower than just indexing the data, like 10 times slower. I wonder why or there is something I did wrong.
Here is my code:

for train_index, test_index in kf.split(X, Y):
    # Split train-test
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]
    train_dataset = torch.utils.data.TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size, shuffle=True)
    
    model = NNTrain()
    model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)

    model.train()
    for i in range(epochs):
        for X_batch, y_batch in train_dataloader:
            X_batch = X_batch.float().cuda()
            y_batch = y_batch.float().cuda()
            optimizer.zero_grad()
            y_batch_pred = model(X_batch)
            loss = nn.MSELoss()(y_batch_pred, y_batch)
            loss.backward()
            optimizer.step()

and the faster one:

for train_index, test_index in kf.split(X, Y):
    # Split train-test
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]

    model = NNTrain()
    model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)

    model.train()
    for i in range(epochs):
        for j in range(batch_num):
            X_batch = torch.from_numpy(X_train[batch_size*j:batch_size*(j+1),:]).float().cuda()
            y_batch = torch.from_numpy(y_train[batch_size*j:batch_size*(j+1),:]).float().cuda()
            y_batch_pred = model(X_batch)            
            loss = nn.MSELoss()(y_batch_pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

This might be expected, since the DataLoader will add overhead to the already loaded tensors by e.g.:

  • shuffling the indices using a sampler
  • indexing the TensorDataset with each index separately
  • creating the batch by calling into the collate_fn

This workflow is useful, if you want to lazily load the dataset and process each sample via data augmentation techniques.
Since your dataset is already loaded, indexing would be the faster approach.

Thanks for the reply! I found a custom dataloader later and it’s much faster so its all good now.

hi @yuanc , is the later dataloader you found public? If so, could you share, thanks.

Hi, similar problem here, I have a very large dataset (1200000 images) and it cannot fit into memory and using indexing like in this example.
I also tried using h5py file to create dataset.

The training goes well with shuffle=False and getting slower and slower with shuffle=True
Can you give me some advice on how to solve the problem?
Thanks a lot

Here is my question