Shared Dict errors in pytorch dataloader

Muhammad_Inam · November 6, 2020, 10:05pm

I have implemented a custom dataset using pytorch. I am stating the sample code below:

class MyDataset(Dataset):
def init(self, shared_dict, length):
self.shared_dict = shared_dict
self.length = length

def __getitem__(self, index):
#Check share dict for the index as the key, if key exists return the value.(function then returns from here)

ELSE
#Heavy pre-processing of images happen here that consumes a lot of time
#Add the pre-processed data to the shared dict with the key set as the index.
#return the pre-processed data.

Since I wanted to save the time taken to pre-process the data in every epoch, I used the following approach:

During the training process multiple errors rise including:
Broken Pipe error occurring on the for loop of data loader enumerator (already tried including this for loop in python’s main function scope)
Workers quitting unexpectedly , Memory Buffer Errors, etc…
Tracing the errors always lead towards multiprocessing.manager libraries.

Training configuration:
number of workers = 40
batch size = 150

Hardware Specs:
GPU: Nvidia A100
CPU: EPYC 128 cores processor
RAM: 1 TB
HDD/SDD: Its not a limitation either

Software Specs:
Pytorch: latest version,
CUDA 11,
OS: Ubuntu

Dataset:
Consists of images and json files making up a size of 17GB.

NOTE: Code works perfectly without shared dict.