Dealing with large dataset without out of memory error

Gkv · June 1, 2018, 4:24pm

Hi all,
How can I handle big datasets without out of memory error? Is it ok to split the dataset into several small chunks and train the network on these small dataset chunks? I mean first, train the dataset for several epochs on a chunk then save the model and load it again for training with another chunk.

Thanks in advance!

ptrblck · June 1, 2018, 4:43pm

Usually you don’t need to load your complete dataset into the memory.
Using a DataLoader you will get mini batches containing several samples which are used for training.
Have a look at the Data loading tutorial for an introduction.

Gkv · June 1, 2018, 5:00pm

Hi, thanks for your reply. I think I don’t require the data loader. I precomputed the resnet features and saved in a pickle file (of size around 50gb). My model just uses these features. The problem is I am getting out of memory error after some epochs.

Thanks

ptrblck · June 1, 2018, 6:08pm

Could you post your code which throws this error?

Gkv · June 2, 2018, 10:13am


#cnn_feature_path is the path to the pickle file of size 50gb
# length of the videos are not the same
with open(cnn_feature_path,'rb') as s:
    cnn_features=pickle.load(s) # keys are the video names and values are the list cnn features of each frame

class model(nn.Module):
	def __init__(self):
		super(model,self).__init__()
		self.s_lstm=nn.LSTM(2048,1024,num_layers=1)
		self.fc1=nn.Linear(1024,512)
		self.fc2=nn.Linear(512,11)
		if torch.cuda.is_available:
			self.s_lstm=self.s_lstm.cuda()
			self.fc1=self.fc1.cuda()
			self.fc2=self.fc2.cuda()
	def forward(self,vid_name):
		spatial_feature=cnn_features[vid_name] # list of cnn features of each frame
		feature_vector=self.get_spatial_feature_vector(spatial_feature)
		out1=self.fc1(feature_vector)
		out=self.fc2(out1)
		return out
	def get_spatial_feature_vector(feature_list):
		hidden = (autograd.Variable(torch.randn(1, 1, 1024)).cuda(gpu_device),autograd.Variable(torch.randn((1, 1, 1024))).cuda(gpu_device))
        self.s_lstm.flatten_parameters()
        for feature in feature_list:
            s_rep,hidden=self.s_lstm(Variable(feature.data,volatile=True).view(1,1,2048).cuda(gpu_device),hidden)
        return s_rep
def train(model,n_epochs,vid_names):
	model.train()
	criterian=nn.CrossEntropyLoss()
	optimizer = torch.optim.Adam(model.parameters(), lr=0.00001)
	for i in range(n_epochs):
		cum_loss=0
		optimizer.zero_grad()
		shuffle(vid_names)
		cnt=0
		for name in vid_names:
			cnt=cnt+1
			pred=model(name)
			target= autograd.Variable(torch.cuda.LongTensor(ground_truth_class_number).cuda(gpu_device))
			loss=criterian(pred.view(1,11),target)
			cum_loss=cum_loss+loss.data[0]
			loss.backward()
            optimizer.step()
            gc.collect()
        print('Epoch '+str(i+1)+' loss '+str(cum_loss))
        torch.cuda.empty_cache()

my_model=model()
my_model=my_model.cuda()
train(my_model,n_epochs,vid_name_list[:1000]) # working fine
train(my_model,n_epochs,vid_name_list[:5000]) # out of memory error after some epochs.

ptrblck · June 2, 2018, 10:45am

Do you get this error in the first epoch after some iterations?
Could it be that some videos are much longer than others?

Could you check the GPU memory with nvidia-smi and check if it’s rising?

Gkv · June 2, 2018, 11:31am

I am using a shared GPU server. The out of memory error is coming depending on the other user’s usage. But when I am running on 1000 videos, it is running without any error. So my question is, can I train first on one set of 1000 videos, then save the trained model using torch.save(), then load it for training using the other set of 1000 videos like that. Will this strategy work?

Thanks

ptrblck · June 2, 2018, 12:05pm

Yes, it will work. You should also save the state_dict of the optimizer, if it stores any running stats (e.g. Adam).

As a workaround this should work for the moment, even though this error shouldn’t occur, if the training batches have the same number of samples etc.

Could you try that and report, if your code gets an OOM error on the “second epoch”?

Gkv · June 2, 2018, 12:54pm

Beacuse of the varying video size i am unable to form batches. So i am using normal SGD. I have a doubt regarding the state_dict. What is the difference between saving the trained model using torch.save() and saving the state_dict?

Thanks

justusschock · June 2, 2018, 12:59pm

Have a look at this doc page to get the differences of the model saving approaches.