I have implemented a custom dataset using pytorch. I am stating the sample code below:
class MyDataset(Dataset):
def init(self, shared_dict, length):
self.shared_dict = shared_dict
self.length = length
def __getitem__(self, index):
#Check share dict for the index as the key, if key exists return the value.(function then returns from here)
ELSE
#Heavy pre-processing of images happen here that consumes a lot of time
#Add the pre-processed data to the shared dict with the key set as the index.
#return the pre-processed data.
Since I wanted to save the time taken to pre-process the data in every epoch, I used the following approach:
During the training process multiple errors rise including:
Broken Pipe error occurring on the for loop of data loader enumerator (already tried including this for loop in python’s main function scope)
Workers quitting unexpectedly , Memory Buffer Errors, etc…
Tracing the errors always lead towards multiprocessing.manager libraries.
Training configuration:
number of workers = 40
batch size = 150
Hardware Specs:
GPU: Nvidia A100
CPU: EPYC 128 cores processor
RAM: 1 TB
HDD/SDD: Its not a limitation either
Software Specs:
Pytorch: latest version,
CUDA 11,
OS: Ubuntu
Dataset:
Consists of images and json files making up a size of 17GB.
NOTE: Code works perfectly without shared dict.