Dataset: Loading from the same file with more than a single thread

Maximal · March 15, 2020, 11:15pm

Hey,

I’d like to write a dataset which reads examples from one single binary file and use it in combination with the standard Dataloader with more than a single worker.
I tried to simply inherit my custom Dataset torch.utils.data.Dataset like this:

class MyDataset(torch.utils.data.Dataset):
    def __init__(self):
        self.data_file = open('data.bin', "rb")
        self.transform = ....

    def __getitem__(self, index):
        self.data_file.seek(index * NUM_BYTES)
        data = self.data_file.read(NUM_BYTES)
        data_np = np.fromstring(data, dtype='uint8').reshape(DATA_SIZE, order="F")
        data_transformed = self.transform(data_np)
        return data_transformed , 0

This works fine for a single thread, but when I set the number of workers larger than 1, some of the data is just wrong. It almost appears to be the case that the data loader is mixing different data points. I guess this is due to the fact that read and seek are not a single atomic operation. At first I simply tried to use the multiprocessing Lock, so one thread would acquire the lock before seek and release if after the read, but this didn’t really change anything.
I can obviously also move open into getitem, in which case everything works fine with multiple workers, but open is actually quite slow in comparison to seek and read, so I’d prefer not do it like this
So how would I correctly synchronize different workers?
Thanks!

Frank-Jing · November 2, 2021, 8:42am

When using multi workers to load data, you’d better to reserve file-descriptor in workers(i.e in getitem() method), not in main process(i.e. in init() method).
You can refer this post(and it’s reply): DataLoader, when num_worker >0, there is bug - #14 by piojanu