Hello,
I face a problem with DataLoader and custom DataSet.
Here is my custom DataSet :
class AudioDataset(Dataset):
def __init__(self, dataset_path: str) -> None:
super().__init__()
assert isdir(dataset_path)
all_magn = [
f for f in tqdm(listdir(dataset_path))
if isfile(join(dataset_path, f)) and
f.startswith("magn")
]
all_phase = [
f for f in tqdm(listdir(dataset_path))
if isfile(join(dataset_path, f)) and
f.startswith("phase")
]
assert len(all_magn) == len(all_phase)
self.__all_magn = sorted(all_magn)
self.__all_phase = sorted(all_phase)
self.__dataset_path = dataset_path
def __getitem__(self, index: int):
magn = th.load(join(
self.__dataset_path,
self.__all_magn[index]
))
phase = th.load(join(
self.__dataset_path,
self.__all_phase[index]
))
return th.stack([magn, phase], dim=0)
def __len__(self):
return len(self.__all_magn)
wich is loaded with :
if __name__ == "__main__":
audio_dataset = audio.AudioDataset("/path/to/tensor/dir")
data_loader = DataLoader(
audio_dataset,
batch_size=8,
shuffle=True,
num_workers=10,
drop_last=True
)
The data is well loaded but in fact the DataLoader hangs when iterating. It seems that the “speed” of loading is not constant (my dataset is +60k tensors of size = (512, 512) ) : it varies from 20min to 1h to make an epoch.
I precise that the “speed” of iteration is constant when I specify num_workers = 0.
I’ve seen that this issue is quite common, how remediate to those hang ?
Python = problem with both 3.6 or 3.8
Pytorch = 1.9.0
CUDA = 11.1
Nvidia driver = 460.84
Ubuntu 20.04
Best regards