Torch dataloader is slow

yuanc · December 23, 2020, 4:11am

I’m practicing using torch.utils.data.DataLoader to load batches of data into GPU, but find it’s significantly slower than just indexing the data, like 10 times slower. I wonder why or there is something I did wrong.
Here is my code:

for train_index, test_index in kf.split(X, Y):
    # Split train-test
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]
    train_dataset = torch.utils.data.TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size, shuffle=True)
    
    model = NNTrain()
    model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)

    model.train()
    for i in range(epochs):
        for X_batch, y_batch in train_dataloader:
            X_batch = X_batch.float().cuda()
            y_batch = y_batch.float().cuda()
            optimizer.zero_grad()
            y_batch_pred = model(X_batch)
            loss = nn.MSELoss()(y_batch_pred, y_batch)
            loss.backward()
            optimizer.step()

and the faster one:

for train_index, test_index in kf.split(X, Y):
    # Split train-test
    X_train,X_test=X[train_index],X[test_index]
    y_train,y_test=y[train_index],y[test_index]

    model = NNTrain()
    model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)

    model.train()
    for i in range(epochs):
        for j in range(batch_num):
            X_batch = torch.from_numpy(X_train[batch_size*j:batch_size*(j+1),:]).float().cuda()
            y_batch = torch.from_numpy(y_train[batch_size*j:batch_size*(j+1),:]).float().cuda()
            y_batch_pred = model(X_batch)            
            loss = nn.MSELoss()(y_batch_pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

ptrblck · January 5, 2021, 6:31am

This might be expected, since the DataLoader will add overhead to the already loaded tensors by e.g.:

shuffling the indices using a sampler
indexing the TensorDataset with each index separately
creating the batch by calling into the collate_fn

This workflow is useful, if you want to lazily load the dataset and process each sample via data augmentation techniques.
Since your dataset is already loaded, indexing would be the faster approach.

yuanc · January 5, 2021, 5:11pm

Thanks for the reply! I found a custom dataloader later and it’s much faster so its all good now.

yjguo · September 28, 2021, 8:42am

hi @yuanc , is the later dataloader you found public? If so, could you share, thanks.

Lizhen_Ji · November 3, 2021, 8:12am

Hi, similar problem here, I have a very large dataset (1200000 images) and it cannot fit into memory and using indexing like in this example.
I also tried using h5py file to create dataset.

The training goes well with shuffle=False and getting slower and slower with shuffle=True
Can you give me some advice on how to solve the problem?
Thanks a lot

Here is my question