DataLoader multiple workers works for torchvision dataSet but not my custom dataset

Liquidmasl · December 13, 2021, 10:53pm

Hi!

First off all, I am reading posts and github issues and threads since a few hours. I learned that Multithreading on Windows and/or Jupyter (Google colab) seams to be a pain or not working at all.
After a lot of trial and error, following a lot of advice it seams to work now for me, giving me an immense speed improvement. But sadly only with a downloaded Dataset. If I try it with my own it freezes up immidiatly and I have to restart the runtime, all meanwhile the CPU sits at 0%

I also read a lot of posts that are similiar to this one, and none seam to have the same issue. I believe there is an error in my DataSet class that I am unable to find. I hope someone can help me out.

This Test works perfectly fine:

import time
from tqdm import tqdm
import torch
import torchvision

def train(data_loader):
    start = time.time()
    for _ in tqdm(range(10)):
        for x in data_loader:
            pass
    end = time.time()
    return end - start

if __name__ == '__main__':
    train_dataset = torchvision.datasets.FashionMNIST(
        root=".", train=True, download=True,
        transform=torchvision.transforms.ToTensor()
    )

    batch_size = 32
    train_loader1 = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=0)
    train_loader2 = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=8)
    train_loader3 = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=8, persistent_workers=True)
    train_loader4 = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=10, persistent_workers=True)

    print(train(train_loader1))
    print(train(train_loader2))
    print(train(train_loader3))
    print(train(train_loader4))

this gets stuck immediatly after data_loader1, so as soon as I switch on multiple workers. It freezes and the CPU chills at 0%.
It looks like the iterator just never returnes a value? Similar to Multiple Dataloader Workers in multi Threading ?

import time
from torch.utils.data import DataLoader  # Gives easier dataset managment by creating mini batches etc.
from tqdm import tqdm  # For nice progress bar!
import numpy as np

def train(data_loader):
    start = time.time()
    for _ in tqdm(range(10)):
        for x in data_loader:
            pass
    end = time.time()
    return end - start

if __name__ ==  '__main__': 
  batch_size = 64

  root_dir = R'[REDACTED]\MiniStoneShaderDataset'
  dataset = smallShaderDataset(csv_file='Labels.csv', root_dir=root_dir )
  train_dataset, test_dataset = torch.utils.data.random_split(dataset, [99, 9900])

  train_loader1 = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory= True)
  train_loader2 = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory= True, persistent_workers = True)
  train_loader3 = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True, num_workers=8, pin_memory= True, persistent_workers = True)

  print(train(train_loader1))
  print(train(train_loader2))
  print(train(train_loader3))

Using this DataSet Class:

import pandas as pd
import torch
from torch.utils.data import Dataset
import numpy as np

#from skimage import io

class smallShaderDataset(Dataset):
  def __init__(self, csv_file, root_dir, transform=None):
    self.annotations = pd.read_csv(os.path.join(root_dir,csv_file))
    self.root_dir = root_dir
    self.transform = transform

  def __len__(self):
    return len(self.annotations)

  def __getitem__(self, index):

    data_path = os.path.join(self.root_dir,'LearnDataCombined', self.annotations.iloc[index,0] + ".npy")
    input = np.load(data_path) 

    parameters = self.annotations.iloc[index,1:]
    parameters = np.array([parameters],dtype = float).flatten()
    input = np.array(input,dtype = float)

    sample = {'input': input, 'parameters': parameters}

    return input, parameters
    #return sample

Everything works (slowly) with num_workers = 0

There is probably something wrong here wich i am unable to find in full tunnelvision mode. I really hope someone can help me find the issue.
If it really is a Bug in DataSet, why does the downloaded set work fine?

Thank you for any help!
Greetings

ejguan · December 17, 2021, 9:18pm

def train(data_loader):
    start = time.time()
    for _ in tqdm(range(10)):
        for x in data_loader:
            pass
    end = time.time()
    return end - start

This part is super confusing. Is this intentional that you want to iterate over DataLoader 10 times?

Liquidmasl · December 18, 2021, 5:21pm

Yes its intentional just so the results are easier to compare, also it is less vulnerable to single datapoints scewing the results

ejguan · December 21, 2021, 5:00pm

Have you tried to del train_loaderN after each step? You enabled pin_memory and persistent_worker for each DataLoader, then these threads and child processes won’t be cleaned up.

Liquidmasl · January 11, 2022, 4:25pm

I feel like thats how it shoule be because they are reused. I read something about winders and interactive IDE on github but i am not quite sure anymore. I will try your suggestion, but I think it actually never gets that far. it does not run out of memory or anything it just stops executing

ejguan · January 11, 2022, 4:39pm

Curious about are you running you code in Notebook? And, does it work with smaller number of worker?

Liquidmasl · January 11, 2022, 4:41pm

yes, in collab, it does not work with any number of workers above 0.
it DOES work with a different dataset though, but i described it quite thoroughly in the original question above

ejguan · January 11, 2022, 4:55pm

What I can recommend to try is to use Tensor rather than panda DataFrame for your self.annotations. I am not sure if this is something related to spawning process with pandas dataframe when using the multiprocessing in colab.

Liquidmasl · January 20, 2022, 6:12pm

good call, I will give that a try!
Thank you