I’m facing an issue when writting my custom dataset object that I think is linked to the way multiprocessing is implemented with dataloading.
My use : my dataset object opens a binary file handler on a very large single file and every time I next to return the i-th sample, I move the file handler with seek, and read the corresponding number of bytes.
As the file handler is built in the constructor, my initial belief was that I needed to protect the “seek/read” part of my code with a multiprocessing Lock but I have the impression that I’m not doing it properly or that this is not the right way to do so ;
I could provide the code but it is kind of long and dependent on the structure of my binary data format but in pratice, the layout is the following :
import os import struct from multiprocessing import Lock import torch.utils.data as data class Dataset(data.Dataset): def __init__(self, filepath): super().__init__() self.fp = open(filepath, 'rb') self.fplock = Lock() self.row_format = '<?ifff' self.row_size = struct.calcsize(self.row_format) def __getitem__(self, idx): file_offset = ..... # computed as a function of idx with self.fplock: self.fp.seek(file_offset, os.SEEK_SET) row = self.fp.read(self.row_size) values = struct.unpack(self.row_format, row) return values def __len__(self): return xxxxx dataset = Dataset() loader = DataLoader(dataset, batch_size=xxxx, num_workers=7) for X in loader: ...
Isn’t the correct to proceed ? When I set num_workers=1, the data are read correctly but If I set num_workers > 1, I can see that my data are not always correctly decoded; My feeling is that the multiprocesses are interfering with the seek/read although I protect the critical section with a multiprocessing Lock ;
Thank you for your help ;