Dataloader slow on pc, fast on colab

Hi All,

I’m new to this forum, and also quite new to Pytorch.

I’m running into an issue that Data loader seems to be quite slow and i’m not sure what the bottle neck is. running on a ryzen 5700, 32 GB mem and a 4700 Super.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

# Load CIFAR-10 dataset
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

def profile_data_loading(train_loader):
    start_time = time.time()  # Start timing
    
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx == 0:
          load_time = time.time() - start_time # End timing
          print(f"Loading time: {load_time} seconds")
        pass
    
    loop_time = time.time() - start_time - load_time
    print(f"Loop time: {loop_time} seconds")

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2, pin_memory=True)

On my pc I get:
cuda
Files already downloaded and verified
Loading time: 5.744622468948364 seconds
Loop time: 4.3751795291900635 seconds

On colab I get
cuda
Files already downloaded and verified
Loading time: 0.11000442504882812 seconds
Loop time: 13.87480902671814 seconds

What makes the loading time on my pc so slow, or is this to be expected with my setup?

Thanks,

Blockquote

First thing I’d try is varying the num_workers, try setting it to 0 and slowly increasing it and see if the pattern changes

Also just confirming since this is a local setup, do you have an SSD or HDD?

Thanks for you reply,

I’ve played around with that a bit, I get a huge increase in performance when I set num_workers to 0, reduces the Loading time to about 0.013!
increasing it to 1, I get 2.8 sec. Increasing further to 22 seconds at 8. Does this makes sense?

Increasing the num_workers lowers the loop time from 7.5 sec at 0 workers to 2 sec at 8 workers.

I’m using an SSD

Thanks,

1 Like

these are the results on my machine : torch 2.1 ubuntu 18.04 ryzen 3600 and on a HDD drive

# num_workers 0
profile_data_loading(train_loader)
#>> Loading time: 0.5861883163452148 seconds
#>> Loop time: 6.6978724002838135 seconds
# num_workers 2
profile_data_loading(train_loader)
#>> Loading time: 0.6597647666931152 seconds
#>> Loop time: 3.702897071838379 seconds
# num_workers 4
profile_data_loading(train_loader)
#>> Loading time: 0.6669812202453613 seconds
#>> Loop time: 2.085876941680908 seconds
# num_workers 8
profile_data_loading(train_loader)
#>> Loading time: 0.7236223220825195 seconds
#>> Loop time: 1.2904746532440186 seconds

also the results for next epochs are usually much faster. i think dataloader or OS is Caching some information about the location of the file on the storage.
are you using a Windows machine??

I’m using a windows 11 machine with python 3.10.14 and Pytorch 2.2.2.
Your results look much more consistent with or without the data loader than mine.

Looking at my system resources I see very limited activity on PC, for 8 workers:
SSD: no notable percentage difference in use (1~2% of max)
CPU: 10%
Mem: 3GB increase (14.7 to 17.7 GB) in use.

What could be limiting my performance compared to yours?

take look at this https://pytorch.org/docs/stable/data.html#platform-specific-behaviors

This separate serialization means that you should take two steps to ensure you are compatible with Windows while using multi-process data loading:

  • Wrap most of you main script’s code within if __name__ == '__main__': block, to make sure it doesn’t run again (most likely generating error) when each worker process is launched. You can place your dataset and DataLoader instance creation logic here, as it doesn’t need to be re-executed in workers.
  • Make sure that any custom collate_fn, worker_init_fn or dataset code is declared as top level definitions, outside of the __main__ check. This ensures that they are available in worker processes. (this is needed since functions are pickled as references only, not bytecode.)

i think for this to work you can’t use jupyter notebook.
write .py script.

Thanks for your input, chagend the script accordingly:

import time
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

def profile_data_loading(train_loader):
    start_time = time.time()  # Start timing

    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx == 0:
            load_time = time.time() - start_time  # End timing
            print(f"Loading time: {load_time} seconds")
        pass

    loop_time = time.time() - start_time - load_time
    print(f"Loop time: {loop_time} seconds")

def main():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    # Load CIFAR-10 dataset
    train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=8, pin_memory=True)

    profile_data_loading(train_loader)

if __name__ == "__main__":
    main()

However, no impact on the results unfortunately.
Still 22 seconds at 8 workers.

Any other suggestions?

maybe try to define dataset outside of main.
this part:

i found this as well

Tried defining the dataset outside main, but no effect unfortunately.
Reading the link you’ve provided, seems to be a long running issue on Windows that was never resolved. Guess that sticking at 0 num_workers (or running on Linux) is the best way forward.

Thanks for the assistance!