Hi,
I am also facing similar issues. In my case, I have all the features in disk as .pt file. I am loading it into RAM as some global variables and using in the dataloader by indexing it. The problem is, CPU RAM is increasing every epoch and after some epochs the process got killed by the OS. My question is, I already loaded the features into the memory, in the dataloader i am just using it, how this is consuming extra memory?
Thanks
@kunasiramesh, @Gkv The memory issue might be related to the training procedure or another part of the code.
Could you post the code so that we can have a look?
Usually the computation graph is unintentionally stored somewhere, e.g. by using losses += loss instead of losses += loss.item().
@ptrblck: just to understand a bit more about OOM issues, if the computation graphs are stored unintentionally (assuming GPUs are used for training), it should lead to GPU out of memory. Am I right?
#new
global full_train_set1
global full_train_set2
global val_set1
global val_set2
full_train_set1=torch.load('path to the featueres1 dict.pt')
full_train_set2=torch.load('path to the featueres2 dict.pt')
val_set1=torch.load('path to the val features1 dict.pt')
val_set2=torch.load('path to the val feature2 dict.pt')
class MyCustomDataset(Dataset):
def __init__(self,ids,train):
self.names = ids
self.train=train
def __getitem__(self, index):
my_id=self.names[int(index)]
if self.train:
feature1=full_train_set1[my_id]
feature2=full_train_set2[my_id]
target=full_train_set1[my_id]['target']
else:
feature1=val_set1[my_id]
feature2=val_set2[my_id]
target=val_set[my_id]['target']
data=(feature1,feature2)
return (data,target)
def __len__(self):
return len(self.names)
def train(epoch,train_loader,optimizer,criterian):
print('Training epoch..',epoch)
model.train()
torch.cuda.empty_cache()
train_loss=0
b_cnt=0
for data,target in train_loader:
b_cnt=b_cnt+1
data=list(data)
data[0]=data[0].cuda()
data[1]=data[1].cuda()
target=target.cuda().float()
optimizer.zero_grad()
pred=model(data).cuda()
loss=criterian(pred,target).cuda()
train_loss+=math.sqrt(loss.item())
loss.backward()
optimizer.step()
torch.cuda.empty_cache()
gc.collect()
torch.cuda.empty_cache()
return train_loss/float(b_cnt)
def run():
torch.cuda.empty_cache()
n_epochs=100
print("Running...")
global n_batches
n_batches=100
global model
model=my_model().cuda()
model=nn.DataParallel(model,device_ids=[0,1,2])
criterian=nn.MSELoss().cuda()
train_ids=find_dataset(full_train_set1) # train_ids is a list
l_rate=0.001
optimizer=torch.optim.Adam(model.parameters(), lr=l_rate)
train_loader = torch.utils.data.DataLoader(dataset=MyCustomDataset(train_ids,train=True),
batch_size=n_batches,
shuffle=True)
for epoch in range(1,n_epochs+1):
gc.collect()
train_loss=train(epoch,train_loader,optimizer,criterian)
print('Epoch :'+str(epoch)+': Train rmse:',train_loss)
run()
Since the gradient update is usually done at CPU while Forward and backward are done on GPU, you could remove the .cuda() call at these parts of your code
and put this part of your code before moving the model to GPU:
Your run-function would now look like this:
def run():
torch.cuda.empty_cache()
n_epochs=100
print("Running...")
global n_batches
n_batches=100
global model
model=my_model()
optimizer=torch.optim.Adam(model.parameters(), lr=l_rate)
model = model.cuda()
model=nn.DataParallel(model,device_ids=[0,1,2])
criterian=nn.MSELoss().cuda()
train_ids=find_dataset(full_train_set1) # train_ids is a list
l_rate=0.001
train_loader = torch.utils.data.DataLoader(dataset=MyCustomDataset(train_ids,train=True),
batch_size=n_batches,
shuffle=True)
for epoch in range(1,n_epochs+1):
gc.collect()
train_loss=train(epoch,train_loader,optimizer,criterian)
print('Epoch :'+str(epoch)+': Train rmse:',train_loss)
This would prevent loss function and optimizer from living on GPU (and thus decrease the GPU memory usage). However, since you run OOM on CPU, I would first try to load only the parts of your dataset you need just-in-time instead of loading the whole (probably huge) datasets before and cache them. Saying that, your CPU RAM should not increase but stay approx. at the same level of usage.
Hi,
I included those .cuda() after this problem occurred. Previously without this also the problem was there. My feature dictionaries are around 6gb in total and i am running this on a system with 251 gb RAM. Then how this will happen?. Is there any problem with the global variables?
Can you try to run it without global variables (I.E. passing the model to the function as parameter and loading your data inside your Dataset’s __init__)? I usually try to avoid global variables.