Hello,
I’m facing an issue when writting my custom dataset object that I think is linked to the way multiprocessing is implemented with dataloading.
My use : my dataset object opens a binary file handler on a very large single file and every time I next to return the i-th sample, I move the file handler with seek, and read the corresponding number of bytes.
As the file handler is built in the constructor, my initial belief was that I needed to protect the “seek/read” part of my code with a multiprocessing Lock but I have the impression that I’m not doing it properly or that this is not the right way to do so ;
I could provide the code but it is kind of long and dependent on the structure of my binary data format but in pratice, the layout is the following :
import os
import struct
from multiprocessing import Lock
import torch.utils.data as data
class Dataset(data.Dataset):
def __init__(self, filepath):
super().__init__()
self.fp = open(filepath, 'rb')
self.fplock = Lock()
self.row_format = '<?ifff'
self.row_size = struct.calcsize(self.row_format)
def __getitem__(self, idx):
file_offset = ..... # computed as a function of idx
with self.fplock:
self.fp.seek(file_offset, os.SEEK_SET)
row = self.fp.read(self.row_size)
values = struct.unpack(self.row_format, row)
return values
def __len__(self):
return xxxxx
dataset = Dataset()
loader = DataLoader(dataset, batch_size=xxxx, num_workers=7)
for X in loader:
...
Isn’t the correct to proceed ? When I set num_workers=1, the data are read correctly but If I set num_workers > 1, I can see that my data are not always correctly decoded; My feeling is that the multiprocesses are interfering with the seek/read although I protect the critical section with a multiprocessing Lock ;
Thank you for your help ;
Jeremy./