Define iterator on Dataloader is very slow

Hello. When I try to use iter(dataloader) to define an iterator on my dataloader, it is very slow to define. Can anyone give some suggestions? Here is my code:

class CellDataset(Dataset):

def __init__(self, data_frame):

    self.data_frame = data_frame
    self.length = self.data_frame.shape[0]

def __len__(self):
    
    return self.length

def __getitem__(self, idx):
    
    cell = torch.tensor(self.data_frame.iloc[idx], dtype=torch.float)      
    return cell
dataset_train = CellDataset(data_frame=train)   # train is a large pandas dataframe already in memory
dataloader_train = DataLoader(dataset_train, batch_size=batchsize, shuffle=True, num_workers=48)

dataiter = iter(dataloader_train)   # this line very slow, about 30s
data = next(dataiter)   # this line is normal

My case is a little special since my data is not image files in ssd, but a large pandas dataframe in the memory (about 8GB). Also I use large number of workers. If I reduce the number of workers, say 8, then the dataiter = iter(dataloader_train) is faster but the data = next(dataiter) is slower. When I try regular image data in ssd, then everything is normal.

Thanks in advance for any suggestion!

Each worker will create a copy of the Dataset, so if you preload the data, your memory usage should increase a lot, especially if you are using 48 workers.

I’m wondering, why a single worker seems to be slow, since you are only slicing in-memory data.
Could you try to use torch.from_numpy(self.data_frame.iloc[idx].values).float() in your __getitem__ and profile the code again?

1 Like

Thanks for your response. I use a machine with 48 threads and 256G RAM so I preload the data and set worker=48.
I try your code and get similar result. It seems that worker=0 does not feed gpu enough, lower than 30%. If I use 48 workers, the feed speed is very fast, but the initialization of the iterator is slow. For example, if I use
for i, d in enumerate(dataloader_train):
print(i, d.shape)
on 48 workers, it will take about 30s before print anyting. After that it is very fast since 48 workers start to work. If I use 0 workers, then there is almost no initialization time but print part is slow.
I guess the multiprocessing under the dataloader is not suitable on preloaded very large dataset?

@xnnba @ptrblck i, I am having the same issue: iter(dataloader_train) is very slow. Even using an hdf5 dataset. This issue replicates during training, thus, my GPU usage % is ultra low, ~1%.

Did you come to a solution?

Any help would be appreciated.

I have also implemented the Iterable dataset class and still the iter(…) is very slow…

How do you store the data (local SSD, network drive, etc.)?
Also, is the first iteration slow or all of them?
Have a look at this post for some background information.

@ptrblck I have SSD, I’m on Windows (I know…), reading hdf5 file is very fast, when I feed it, either, to the map or iterable dataset class, then to the data loader, and then iter-ating it, it’s very slow, then next() calls are all equally very slow too…

Note: For what its worth, running some torchvison.dataset, i.e. MNIST from pytorch *.pt file, it runs pretty fast.

Are you sure you are reading the content of the hdf5 file or are you just initializing it?
Could you post a small code snippet showing, how you open the file and access it?

@ptrblck Sure, I actually thought of reading the hdf5 file outside the Dataset class, here it is:

train_hf = File('train_images_labels.h5', 'r')

class IterableDataset(pt.utils.data.IterableDataset):
    def __init__(self, train_images_file):
        super().__init__()
        self.hf = train_images_file

    def __iter__(self):
        return self
    
    def __next__(self):
        idx = np.random.randint(0, len(self.hf['images'])) # random for testing purposes
        return ( self.hf['images'][idx], self.hf['labels'][idx] )
    
batch_size = 2 # takes about 1 second for each sample, i.e. batch_size = 64 => ~64 secs... same for next calls

train_dataset = IterableDataset(train_hf)
print('done1')
train_loader = pt.utils.data.DataLoader(train_dataset, batch_size=batch_size)
print('done2')

iterator = iter(train_loader)

next(iterator).shape
print('done3')
next(iterator).shape
print('done4')

train_hf.__bool__()
train_hf.close()
train_hf.__bool__()

Note: my hdf5 dataset is a bunch of images converted to numpy arrays.

Thanks for the code snippet!
I think the File() might just open the file handle, but not actually read the data from your disk, which might be performed in the indexing.

I’ve created this dummy code snippet to play around with it a bit:

# Setup
d1 = np.random.random(size = (1000,1000, 100))
hf = h5py.File('data.h5', 'w')
hf.create_dataset('dataset_1', data=d1)
hf.close()

# Load
t0 = time.time()
hf = h5py.File('data.h5', 'r')
t1 = time.time()
print('open took {:.3f}ms'.format((t1-t0)*1000))

t0 = time.time()
n1 = hf.get('dataset_1')
t1 = time.time()
print('get took {:.3f}ms'.format((t1-t0)*1000))

t0 = time.time()
n1 = np.array(n1)
t1 = time.time()
print('reading array took {:.3f}ms'.format((t1-t0)*1000))

t0 = time.time()
data = hf['dataset_1']
t1 = time.time()
print('get took {:.3f}ms'.format((t1-t0)*1000))

nb_iters = 100
t0 = time.time()
for idx in np.random.randint(0, 1000, (nb_iters,)):
    x = data[idx]
t1 = time.time()
print('random index takes {:.3f}ms per index'.format((t1 - t0)/nb_iters * 1000))

I’m no expert in using hdf5, but I it seems that the indexing takes a lot more time than the reading of the file, which might point to lazy loading.

This seems to match @rasbt’s post.

Thanks. Will look into it. Will also try other methods. Thanks again!