Dataloader initiating data set upon call of each batch

Kuppharish · October 19, 2020, 2:50am

I am trying to load data into my Dataset class in the get_item function rather than init function because of the data is very large and it cannot be loaded all at once to memory. Since the index of DataLoader keeps I am keeping a record of length of previously loaded part of data but this length goes to 0 which was initiated in init every time a new batch is loaded. Is there a way to not call init function of DataSet?

`	def __init__(self, data_path, graph_args={}, train_val_test='train'):
		'''
		train_val_test: (train, val, test)
		'''
		self.data_path = data_path
		self.path_list = sorted(glob.glob(os.path.join(self.data_path,'*.txt')))
		self.all_feature=[]
		self.all_adjacency=[]
		self.all_mean_xy=[]
		self.it = 0
		#self.load_data()
		#total_num = len(self.all_feature)
		# equally choose validation set
		self.feature_num=0
		self.prev=0
	def __getitem__(self, idx):
		# C = 11: [frame_id, object_id, object_type, position_x, position_y, position_z, object_length, pbject_width, pbject_height, heading] + [mask]
		#if(idx>=self.feature_num):
		try:
			now_feature = self.all_feature[idx-self.prev].copy()
		except:
			path = self.path_list[self.it]
			self.it = self.it+1
			self.all_feature, self.all_adjacency, self.all_mean_xy = generate_data(path)
			self.prev = self.feature_num
			self.feature_num = self.feature_num+len(self.all_feature)	
			now_feature = self.all_feature[idx-self.prev].copy()`

ptrblck · October 19, 2020, 7:43am

Be careful with trying to manipulate the Dataset in a DataLoader, if you are using multiple workers.
Each worker will use a clone of the Dataset, so that changes to the internal states of the Dataset will not be reflected.

Could this be the case for your issue?

VitalyFedyunin · October 19, 2020, 1:50pm

1.7 release and current master will have reset functionality, see https://github.com/pytorch/pytorch/pull/35795

Kuppharish · October 19, 2020, 2:50pm

Thank you so much, making the num_workers to 0 solved the issue, But I guess I’m restricting the capabilities of machine by loading the data sequentially.